[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

Cheolsoo Park (JIRA) Wed, 15 Oct 2014 10:17:51 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172620#comment-14172620
 ]


Cheolsoo Park commented on PIG-4227:
------------------------------------

[~daijy], sorry for breaking unit tests.
{quote}
I don't totally understand the issue in the description, is that because jython 
adds tuple inside a list automatically but python does not?
{quote}
You're right that Jython udf usually doesn't return a list of Python tuples but 
just returns a list of Python objects. In that case, Pig converts it to a bag 
of tuples automatically by wrapping objects with tuples. However, Python 
streaming udf serializes it as a bag of non-tuples, and they're never wrapped 
with tuples. The problem is that outputSchema is defined as something like 
{{bag:\{tuple\:( chararray )\}}}, and now deserialization code skips bytes to 
skip tuple delimiters that do not exist. That results in truncating 3 chars at 
the beginning and the end.

So the root cause is that Jython and Python streaming handles a Python list of 
non-tuples differently. This makes it not possible to run the same udf in the 
two modes. With my patch, I can run the same udf in the two modes and get the 
same result. For eg, here is the diff in one of udfs before and after my patch. 
This should clarify the difference-
{code}
34c34
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
44c44
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
49c49
<                     output.append(items[i]['id'])
---
>                     output.append(tuple([items[i]['id']]))
84c84
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
96c96
<                             output.append(recos[r]['id'])
---
>                             output.append(tuple([recos[r]['id']]))
101c101
<                     output.append(items[i]['id'])
---
>                     output.append(tuple([items[i]['id']]))
105c105
<                 return [-1]
---
>                 return [tuple([-1])]
{code}

> Streaming Python UDF handles bag outputs incorrectly
> ----------------------------------------------------
>
>                 Key: PIG-4227
>                 URL: https://issues.apache.org/jira/browse/PIG-4227
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Cheolsoo Park
>            Assignee: Cheolsoo Park
>             Fix For: 0.14.0
>
>         Attachments: PIG-4227-1.patch
>
>
> I have a udf that generates different outputs when running as jython and 
> streaming python.
> {code:title=jython}
> {([[BBC Worldwide]])}
> {code} 
> {code:title=streaming python}
> {(BC Worldwid)}
> {code}
> The problem is that streaming python encodes a bag output incorrectly. For 
> this particular example, it serializes the output string as follows-
> {code}
> |{_[[BBC Worldwide]]|}_
> {code}
> where '|' and '\_' wrap bag delimiters '\{' and '\}'. i.e. '\{' => '|\{\_' 
> and '\}' => '|\}\_'.
> But this is wrong because bag must contain tuples not chararrays. i.e. the 
> correct encoding is as follows-
> {code}
> |{_|(_[[BBC Worldwide]]|)_|}_
> {code}
> where '|' and '_' wrap tuple delimiters '(' and ')' as well as bag delimiters.
> This results in truncated outputs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PIG-4227) Streaming Python UDF handles bag outputs incorrectly

Reply via email to