[
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172620#comment-14172620
]
Cheolsoo Park commented on PIG-4227:
------------------------------------
[~daijy], sorry for breaking unit tests.
{quote}
I don't totally understand the issue in the description, is that because jython
adds tuple inside a list automatically but python does not?
{quote}
You're right that Jython udf usually doesn't return a list of Python tuples but
just returns a list of Python objects. In that case, Pig converts it to a bag
of tuples automatically by wrapping objects with tuples. However, Python
streaming udf serializes it as a bag of non-tuples, and they're never wrapped
with tuples. The problem is that outputSchema is defined as something like
{{bag:\{tuple\:( chararray )\}}}, and now deserialization code skips bytes to
skip tuple delimiters that do not exist. That results in truncating 3 chars at
the beginning and the end.
So the root cause is that Jython and Python streaming handles a Python list of
non-tuples differently. This makes it not possible to run the same udf in the
two modes. With my patch, I can run the same udf in the two modes and get the
same result. For eg, here is the diff in one of udfs before and after my patch.
This should clarify the difference-
{code}
34c34
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
44c44
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
49c49
< output.append(items[i]['id'])
---
> output.append(tuple([items[i]['id']]))
84c84
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
96c96
< output.append(recos[r]['id'])
---
> output.append(tuple([recos[r]['id']]))
101c101
< output.append(items[i]['id'])
---
> output.append(tuple([items[i]['id']]))
105c105
< return [-1]
---
> return [tuple([-1])]
{code}
> Streaming Python UDF handles bag outputs incorrectly
> ----------------------------------------------------
>
> Key: PIG-4227
> URL: https://issues.apache.org/jira/browse/PIG-4227
> Project: Pig
> Issue Type: Bug
> Reporter: Cheolsoo Park
> Assignee: Cheolsoo Park
> Fix For: 0.14.0
>
> Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code}
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_'
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)