Lorand Bendig created PIG-3911:
----------------------------------
Summary: Define unique fields with @OutputSchema
Key: PIG-3911
URL: https://issues.apache.org/jira/browse/PIG-3911
Project: Pig
Issue Type: Improvement
Reporter: Lorand Bendig
Assignee: Lorand Bendig
As a continuation of PIG-2361, I think that {{@OutputSchema}} could be extended
in order to eliminate the repeating patterns of {{EvalFunc#outputSchema()}}
found in most UDFs.
I'd come up with the following syntax:
Complex schema definition:
{code}
@OutputSchema("y:bag{t:tuple(len:int,word:chararray,${0}:int)},${1}:chararray,${2}:bytearray")
@SchemaFields({
@Unique(name="word"),
@Unique(name="${0}"),
@Unique(name="${1}", prefix="id"),
@Unique(name="${2}", prefix="item", postfix="id")}
)
public class MyUDF {...}
{code}
Rewrite rules:
{code}
word => "word" + "_" + nextSchemaId
${0} => this.getClass.getName().toLower() + "_" + nextSchemaId
${1} => "id" + "_" + nextSchemaId
${2} => "item" + "_" + nextSchemaId + "_" + "id"
{code}
Result:
{code}
y:bag{t:tuple(len:int,word_1:chararray,com.example.MyUDF_5:int)},id_8:chararray,item_9_id:bytearray
{code}
Prefix and postfix attributes would be applied only for placeholders.
Single field definitions:
{code}
@OutputSchema("double")
=> equivalent to:
return new Schema(new Schema.FieldSchema(null, DataType.DOUBLE));
{code}
{code}
@OutputSchema("d:double")
=> equivalent to:
return new Schema(new Schema.FieldSchema("d", DataType.DOUBLE));
{code}
{code}
@OutputSchema("tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName(
this.getClass().getName().toLowerCase(), input), DataType.TUPLE));
{code}
{code}
@OutputSchema("words:tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName("words", input),
DataType.TUPLE));
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)