[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Thejas M Nair (JIRA) Mon, 25 Jul 2011 20:44:02 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070925#comment-13070925
 ]


Thejas M Nair commented on PIG-1942:
------------------------------------

I think this is a very good idea, it will make it easier to write python udfs.
The patch is like one that introduces several new API's. Each type conversion 
behavior introduced here will need to be retained to preserve backward 
compatibility.  I think we should restrict the conversions to the cases where 
we are sure it makes sense (return either null+warning or error in other cases).

Review of 1942_with_junit.patch
- I think when the tuple schema and python udf return value are not compatible 
(for example when number of fields in schema are less than number of fields in 
object returned by udf), it should return null + warning. The case where the 
number of fields in object returned by python udf is fewer than ones in schema, 
null fields should be appended to new tuple to match schema.

- In JythonUtils.asBag, I am not sure if the automatic decision made for the 
type converting using contents of udf output object is worth the increase in 
complexity and potential for surprises. I think the user should wrap the 'list 
type' within another list type for the case when the schema represents a bag of 
tuples. ie do type conversions for only the cases where compatible == true. 

- In JythonUtils.asBag, Why are the null values skipped ? This behavior is not 
consistent to behavior in other places in pig.

- Why does the patch do type conversions for non python datatypes ? Are these 
expected from python udf output ?


> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1942
>                 URL: https://issues.apache.org/jira/browse/PIG-1942
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Woody Anderson
>            Assignee: Woody Anderson
>            Priority: Minor
>              Labels: python, schema, udf
>             Fix For: 0.10
>
>         Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
>         return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Reply via email to