[
https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lorand Bendig updated PIG-3911:
-------------------------------
Description:
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that
more flexible output schema can be defined through annotations. As a result,
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}
{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
input), input, DataType.BAG));
}
{code}
If useInputSchema attribute is set then input schema will be applied to the
output schema, provided that:
- outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and
- it has complex field type (tuple, bag, map)
@Unique : this annotation defines which fields should be unique in the schema
- if no parameters are provided, all fields will be unique
- otherwise it takes a string array of fields name
Unique field generation:
A unique field is generated in the same manner that EvalFunc#getSchemaName does.
- if field has an alias:
- it's a placeholder (${i}, i=0..n) : fieldName ->
com_myfunc_[input_alias]_[nextSchemaId]
- otherwise: fieldName -> fieldName_[nextSchemaId]
- otherwise: com_myfunc_[input_alias]_[nextSchemaId]
Scripting UDFs:
The following scripting languages have been extended to use the above
modifications:
Python, Jython, Groovy, JRuby
---
The patch incorporates PIG-2361, and contains the following testcases:
Modified piggybank UDFs:
{{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}
Various output schema definitions:
{{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}
Modified builtin UDFs:
{{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}
Scripting UDFs:
test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}
was:
As a continuation of PIG-2361, I think that {{@OutputSchema}} could be extended
in order to eliminate the repeating patterns of {{EvalFunc#outputSchema()}}
found in most UDFs.
I'd come up with the following syntax:
Complex schema definition:
{code}
@OutputSchema("y:bag{t:tuple(len:int,word:chararray,${0}:int)},${1}:chararray,${2}:bytearray")
@SchemaFields({
@Unique(name="word"),
@Unique(name="${0}"),
@Unique(name="${1}", prefix="id"),
@Unique(name="${2}", prefix="item", postfix="id")}
)
public class MyUDF {...}
{code}
Rewrite rules:
{code}
word => "word" + "_" + nextSchemaId
${0} => this.getClass.getName().toLower() + "_" + nextSchemaId
${1} => "id" + "_" + nextSchemaId
${2} => "item" + "_" + nextSchemaId + "_" + "id"
{code}
Result:
{code}
y:bag{t:tuple(len:int,word_1:chararray,com.example.MyUDF_5:int)},id_8:chararray,item_9_id:bytearray
{code}
Prefix and postfix attributes would be applied only for placeholders.
Single field definitions:
{code}
@OutputSchema("double")
=> equivalent to:
return new Schema(new Schema.FieldSchema(null, DataType.DOUBLE));
{code}
{code}
@OutputSchema("d:double")
=> equivalent to:
return new Schema(new Schema.FieldSchema("d", DataType.DOUBLE));
{code}
{code}
@OutputSchema("tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName(
this.getClass().getName().toLowerCase(), input), DataType.TUPLE));
{code}
{code}
@OutputSchema("words:tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName("words", input),
DataType.TUPLE));
{code}
> Define unique fields with @OutputSchema
> ---------------------------------------
>
> Key: PIG-3911
> URL: https://issues.apache.org/jira/browse/PIG-3911
> Project: Pig
> Issue Type: Improvement
> Reporter: Lorand Bendig
> Assignee: Lorand Bendig
>
> Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that
> more flexible output schema can be defined through annotations. As a result,
> the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from
> most of the UDFs.
> Examples:
> {code}
> @OutputSchema("bytearray")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
> }
> {code}
> {code}
> @OutputSchema("chararray")
> @Unique
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), DataType.CHARARRAY));
> }
> {code}
> {code}
> @OutputSchema(value = "dimensions:bag", useInputSchema = true)
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
> }
> {code}
> {code}
> @OutputSchema(value = "${0}:bag", useInputSchema = true)
> @Unique("${0}")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
> return new Schema(new
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(),
> input), input, DataType.BAG));
> }
> {code}
> If useInputSchema attribute is set then input schema will be applied to the
> output schema, provided that:
> - outputschema is "simple", i.e: [name][:type] or '()', '{}', '[]' and
> - it has complex field type (tuple, bag, map)
> @Unique : this annotation defines which fields should be unique in the schema
> - if no parameters are provided, all fields will be unique
> - otherwise it takes a string array of fields name
> Unique field generation:
> A unique field is generated in the same manner that EvalFunc#getSchemaName
> does.
> - if field has an alias:
> - it's a placeholder (${i}, i=0..n) : fieldName ->
> com_myfunc_[input_alias]_[nextSchemaId]
> - otherwise: fieldName -> fieldName_[nextSchemaId]
> - otherwise: com_myfunc_[input_alias]_[nextSchemaId]
> Scripting UDFs:
> The following scripting languages have been extended to use the above
> modifications:
> Python, Jython, Groovy, JRuby
> ---
> The patch incorporates PIG-2361, and contains the following testcases:
> Modified piggybank UDFs:
> {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}
> Various output schema definitions:
> {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}
> Modified builtin UDFs:
> {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}
> Scripting UDFs:
> test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}
--
This message was sent by Atlassian JIRA
(v6.2#6252)