[
https://issues.apache.org/jira/browse/PIG-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072187#comment-13072187
]
Thejas M Nair commented on PIG-1942:
------------------------------------
bq. wrt schema has fewer fields than actual:
I think pig schema needs to support the feature of specifying partial schemas,
or types for schemas with variable number of fields of certain/unspecified
type. But I think it is better to have this feature in schema, rather than
doing conversions based on a schema that is not compatible. Also, I think it is
a good thing to check for schema consistency, so that the user knows when they
make a mistake.
bq. I am against the idea of returning null and WARN (nearly as a rule). I
think a reasonable interpretation is always better than NULL (with WARN). I
would only advocate for an actual error that forces a user to rectify their
code.
Returning null+WARN is the convention followed in load funcs like PigStorage
and in type conversion code. But I see that the situation in type conversion is
different because there is no reasonable interpretation if the type conversion
fails.
Does any body else have opinions on this ?
Re: logging 1 line per warning.
PigLogger.warn(..) can be used to aggregate the warnings.
Regarding auto-tupling, I agree that it is useful when -
1. output schema is a tuple
2. output schema is a bag of tuples with single fields.
But if it is a bag of tuples that have multiple fields, I think it is makes
sense for the output value to have a list type representing the tuple.
{code}
If output schema is
{(int, int)}
I think the output value should look like - ((1,2),(3,4)). I don't see a need
to convert (1,2) into ((1,2)).
{code}
I also have concern that with auto tupling, python udf users will have an
incorrect understanding of pig bags of primitive types. They might not realize
that the bags always contain a tuple. How much of a performance difference did
you notice while adding adding tuple wrappers for fields in a bag? I am trying
to evaluate the option of providing utility libraries that python udfs can use
to convert to pig type.
Regarding skipping nulls in JythonUtils.asBag, it is at line 491. I am not
sure about if pig actually works with null tuples in a bag, I need to check
that.
{code}
if (it != null) {
while (first == null && it.hasNext()) {
first = it.next();
}
}
{code}
> script UDF (jython) should utilize the intended output schema to more
> directly convert Py objects to Pig objects
> ----------------------------------------------------------------------------------------------------------------
>
> Key: PIG-1942
> URL: https://issues.apache.org/jira/browse/PIG-1942
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.8.0, 0.9.0
> Reporter: Woody Anderson
> Assignee: Woody Anderson
> Priority: Minor
> Labels: python, schema, udf
> Fix For: 0.10
>
> Attachments: 1942.patch, 1942_with_junit.patch
>
>
> from https://issues.apache.org/jira/browse/PIG-1824
> {code}
> import re
> @outputSchema("y:bag{t:tuple(word:chararray)}")
> def strsplittobag(content,regex):
> return re.compile(regex).split(content)
> {code}
> does not work because split returns a list of strings. However, the output
> schema is known, and it would be quite simple to implicitly promote the
> string element to a tupled element.
> also, a list/array/tuple/set etc. are all equally convertable to bag, and
> list/array/tuple are equally convertable to Tuple, this conversion can be
> done in a much less rigid way with the use of the schema.
> this allows much more facile re-use of existing python code and less memory
> overhead to create intermediate re-converting of object types.
> I have written the code to do this a while back as part of my version of the
> jython script framework, i'll isolate that and attach.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira