[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Thejas M Nair (JIRA) Wed, 27 Jul 2011 22:39:57 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072187#comment-13072187
 ]


Thejas M Nair commented on PIG-1942:
------------------------------------

bq. wrt schema has fewer fields than actual:
I think pig schema needs to support the feature of specifying partial schemas, 
or types for schemas with variable number of fields of certain/unspecified 
type. But I think it is better to have this feature in schema, rather than 
doing conversions based on a schema that is not compatible. Also, I think it is 
a good thing to check for schema consistency, so that the user knows when they 
make a mistake.

bq. I am against the idea of returning null and WARN (nearly as a rule). I 
think a reasonable interpretation is always better than NULL (with WARN). I 
would only advocate for an actual error that forces a user to rectify their 
code. 
Returning null+WARN is the convention followed in load funcs like PigStorage 
and in type conversion code. But I see that the situation in type conversion is 
different because there is no reasonable interpretation if the type conversion 
fails. 
Does any body else have opinions on this ?

Re: logging 1 line per warning. 
PigLogger.warn(..) can be used to aggregate the warnings. 

Regarding auto-tupling, I agree that it is useful when -
1. output schema is a tuple
2. output schema is a bag of tuples with single fields. 
But if it is a bag of tuples that have multiple fields, I think it is makes 
sense for the output value to have a list type representing the tuple.
{code}
If output schema is 
{(int, int)}

I think the output value should look like - ((1,2),(3,4)). I don't see a need 
to convert (1,2) into ((1,2)).
{code}
I also have concern that with auto tupling, python udf users will have an 
incorrect understanding of pig bags of primitive types.  They might not realize 
that the bags always contain a tuple. How much of a performance difference did 
you notice while adding adding tuple wrappers for fields in a bag? I am trying 
to evaluate the option of providing utility libraries that python udfs can use 
to convert to pig type.

Regarding skipping nulls in JythonUtils.asBag, it is at line 491.  I am not 
sure about if pig actually works with null tuples in a bag, I need to check 
that.
{code}
          if (it != null) {
                while (first == null && it.hasNext()) {
                    first = it.next();
                }
          } 
{code}


> script UDF (jython) should utilize the intended output schema to more 
> directly convert Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: PIG-1942
>                 URL: https://issues.apache.org/jira/browse/PIG-1942
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Woody Anderson
>            Assignee: Woody Anderson
>            Priority: Minor
>              Labels: python, schema, udf
>             Fix For: 0.10
>
>         Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
>         return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output 
> schema is known, and it would be quite simple to implicitly promote the 
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and 
> list/array/tuple are equally convertable to Tuple, this conversion can be 
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory 
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the 
> jython script framework, i'll isolate that and attach.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-1942) script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

Reply via email to