[
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13071048#comment-13071048
]
Woody Anderson commented on PIG-1942:
-------------------------------------
I think your feedback raises some fair questions, but I have some reasons for
disagreeing:
wrt schema has fewer fields than actual:
this is a common case for me b/c pig doesn't allow specification of \*-tuple
i.e. all rows of data will have the same number of (int) elements, but it's not
known how many. This is a week area of pig in general imho. If there is only 1
element in the tuple it can be seen to infer some type information for the
remaining rows. (at least i think this is how 'tuple(int)' shows up). I think
that when there are more than 1 columns in a tuple, then it's not a *generic*
tuple, then i can see an error being appropriate. but for 1-tuples i appreciate
the flexibility of using it for the type and writing udfs that accept tuples of
arbitrary dimension, even if the args-to-function stuff is to simplistic to
apply in this scenario it's easy enough to write useful udfs that utilize tuple
dimension flexibility.
I am against the idea of returning null and WARN (nearly as a rule). I think a
reasonable interpretation is always better than NULL (with WARN). I would only
advocate for an actual error that forces a user to rectify their code. This may
be where reasonable people disagree, but i think null rather than a tuple
reflecting the returned data is less expected.
The whole reporting of 'schema != data' could be improved tho. I am not sure of
the best way to reflect that anything "grey/WARN" is happening. It seems liking
logging 1 line per encountered edge case is major overkill, and prone to
generate huge log output. We could count each WARN scenario and log/counter
that information to give a more succinct description of execution behavior that
a simple user can fix, and an advanced user can ignore judiciously. Possibly
more specific counters and only 1 warn per type per execution.
Pig schemas are often so.. imprecise, that i think best effort coercion is
useful, but i think a fine compromise would be to support only a specific set
of conversions that would be a subset of this patch, but perform the others b/c
they are mostly intuitive and useful, but a WARN will be generated when
executed if we think it's too esoteric. We may draw lines in slightly different
places, but i tried to cover a fair number of cases in the test code, which is
think is a fairly survey of expected coercions.
wrt JU.asBag, i think auto-tupling is a must. This is one of the most common
mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a
very common refrain, and just silly 99.9% of the time. Plus it's a bunch of
extra unnecessary objects that one must create, and causes a bit slower
execution for simple udfs.
I'd have to re-read the code again to examine the edge cases. I do recall the
disambiguation for embedded bags being a pain to write and describe.
Documentation being the remaining concern. That said, i think it does something
reasonable and still executes faster than existing rigid code. Also in the code
is a decent synopsis of the disambiguations that are intended.
wrt skipping nulls: can you cite the line number? do you mean skipping null
bags? or null element/tuples when creating a bag? This might just be me not
understanding something properly. I thought bags didn't have null tuples, just
tuples with null elements?
wrt various types:
jython is fully capable of returning any jvm type. so that means anything
really.
I decided to cover the collections classes, lang classes, base types, and PY*
classes.
Jython is nice in that many classes implement the collections ifaces, but not
always as efficiently as using the python classes directly.
this is common in python/jython of course. not in udfs as of yet... b/c it
wasn't allowed. But i began doing it pretty quickly once it was possible.
> script UDF (jython) should utilize the intended output schema to more
> directly convert Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
> Key: PIG-1942
> URL: https://issues.apache.org/jira/browse/PIG-1942
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.8.0, 0.9.0
> Reporter: Woody Anderson
> Assignee: Woody Anderson
> Priority: Minor
> Labels: python, schema, udf
> Fix For: 0.10
>
> Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
> return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output
> schema is known, and it would be quite simple to implicitly promote the
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and
> list/array/tuple are equally convertable to Tuple, this conversion can be
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the
> jython script framework, i'll isolate that and attach.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira