[jira] [Updated] (PIG-3911) Define unique fields with @OutputSchema

Lorand Bendig (JIRA) Sat, 24 May 2014 14:05:27 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lorand Bendig updated PIG-3911:
-------------------------------

    Description: 
Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
more flexible output schema can be defined through annotations. As a result, 
the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
most of the UDFs.
Examples:
{code}
@OutputSchema("bytearray")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
}
{code}

{code}
@OutputSchema("chararray")
@Unique
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), DataType.CHARARRAY));
}
{code}
{code}
@OutputSchema(value = "dimensions:bag", useInputSchema = true)
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
  return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
}
{code}
{code}
@OutputSchema(value = "${0}:bag", useInputSchema = true)
@Unique("${0}")
{code}
=> equivalent to:
{code}
@Override
public Schema outputSchema(Schema input) {
    return new Schema(new 
Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
input), input, DataType.BAG));
}
{code}

If useInputSchema attribute is set then input schema will be applied to the 
output schema, provided that:
- outputschema is "simple", i.e: [name][:type]  or '()', '{}', '[]' and
- it has complex field type (tuple, bag, map)

@Unique : this annotation defines which fields should be unique in the schema
- if no parameters are provided, all fields will be unique
- otherwise it takes a string array of fields name

Unique field generation:
A unique field is generated in the same manner that EvalFunc#getSchemaName does.

- if field has an alias:
  - it's a placeholder (${i}, i=0..n) : fieldName -> 
com_myfunc_[input_alias]_[nextSchemaId]
  - otherwise: fieldName -> fieldName_[nextSchemaId]

- otherwise: com_myfunc_[input_alias]_[nextSchemaId]

Scripting UDFs:
The following scripting languages have been extended to use the above 
modifications:
Python, Jython, Groovy, JRuby


---

The patch incorporates PIG-2361, and contains the following testcases:
Modified piggybank UDFs:
{{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}

Various output schema definitions:
{{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}

Modified builtin UDFs:
{{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}

Scripting UDFs:
test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}

  was:
As a continuation of PIG-2361, I think that {{@OutputSchema}} could be extended 
in order to eliminate the repeating patterns of {{EvalFunc#outputSchema()}} 
found in most UDFs. 
I'd come up with the following syntax:

Complex schema definition:
{code}
@OutputSchema("y:bag{t:tuple(len:int,word:chararray,${0}:int)},${1}:chararray,${2}:bytearray")
@SchemaFields({
  @Unique(name="word"),
  @Unique(name="${0}"),
  @Unique(name="${1}", prefix="id"),
  @Unique(name="${2}", prefix="item", postfix="id")}
)
public class MyUDF {...}
{code}
Rewrite rules:
{code}
word => "word" + "_" + nextSchemaId
${0} => this.getClass.getName().toLower() + "_" + nextSchemaId
${1} => "id" + "_" + nextSchemaId
${2} => "item" + "_" + nextSchemaId + "_" + "id"
{code}
Result:
{code}
y:bag{t:tuple(len:int,word_1:chararray,com.example.MyUDF_5:int)},id_8:chararray,item_9_id:bytearray
{code}

Prefix and postfix attributes would be applied only for placeholders.

Single field definitions:
{code}
@OutputSchema("double")
=> equivalent to:
return new Schema(new Schema.FieldSchema(null, DataType.DOUBLE)); 
{code}
{code}
@OutputSchema("d:double")
=> equivalent to:
return new Schema(new Schema.FieldSchema("d", DataType.DOUBLE)); 
{code}
{code}
@OutputSchema("tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName(  
  this.getClass().getName().toLowerCase(), input), DataType.TUPLE));
{code}
{code}
@OutputSchema("words:tuple")
@Unique
=> equivalent to:
return new Schema(new Schema.FieldSchema(getSchemaName("words", input), 
  DataType.TUPLE));
{code}


> Define unique fields with @OutputSchema
> ---------------------------------------
>
>                 Key: PIG-3911
>                 URL: https://issues.apache.org/jira/browse/PIG-3911
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>
> Based on PIG-2361, I took the liberty of extending {{@Outputschema}} so that 
> more flexible output schema can be defined through annotations. As a result, 
> the repeating patterns of {{EvalFunc#outputSchema()}} can be eliminated from 
> most of the UDFs.
> Examples:
> {code}
> @OutputSchema("bytearray")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new Schema.FieldSchema(null, DataType.BYTEARRAY));
> }
> {code}
> {code}
> @OutputSchema("chararray")
> @Unique
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), DataType.CHARARRAY));
> }
> {code}
> {code}
> @OutputSchema(value = "dimensions:bag", useInputSchema = true)
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>   return new Schema(new FieldSchema("dimensions", input, DataType.BAG));
> }
> {code}
> {code}
> @OutputSchema(value = "${0}:bag", useInputSchema = true)
> @Unique("${0}")
> {code}
> => equivalent to:
> {code}
> @Override
> public Schema outputSchema(Schema input) {
>     return new Schema(new 
> Schema.FieldSchema(getSchemaName(this.getClass().getName().toLowerCase(), 
> input), input, DataType.BAG));
> }
> {code}
> If useInputSchema attribute is set then input schema will be applied to the 
> output schema, provided that:
> - outputschema is "simple", i.e: [name][:type]  or '()', '{}', '[]' and
> - it has complex field type (tuple, bag, map)
> @Unique : this annotation defines which fields should be unique in the schema
> - if no parameters are provided, all fields will be unique
> - otherwise it takes a string array of fields name
> Unique field generation:
> A unique field is generated in the same manner that EvalFunc#getSchemaName 
> does.
> - if field has an alias:
>   - it's a placeholder (${i}, i=0..n) : fieldName -> 
> com_myfunc_[input_alias]_[nextSchemaId]
>   - otherwise: fieldName -> fieldName_[nextSchemaId]
> - otherwise: com_myfunc_[input_alias]_[nextSchemaId]
> Scripting UDFs:
> The following scripting languages have been extended to use the above 
> modifications:
> Python, Jython, Groovy, JRuby
> ---
> The patch incorporates PIG-2361, and contains the following testcases:
> Modified piggybank UDFs:
> {{contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalOutputAnnotation.java}}
> Various output schema definitions:
> {{/trunk/test/org/apache/pig/test/TestEvalFuncOutputAnnotation.java}}
> Modified builtin UDFs:
> {{test/org/apache/pig/test/TestBuiltinOutputAnnotation.java}}
> Scripting UDFs:
> test/org/apache/pig/test/TestPythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJythonUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestGroovyUDFOutputAnnotation.java}}
> test/org/apache/pig/test/TestJRubyUDFOutputAnnotation.java}}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3911) Define unique fields with @OutputSchema

Reply via email to