[ 
https://issues.apache.org/jira/browse/PIG-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1711:
-----------------------------------

    Assignee: Corinne Chandel  (was: Olga Natkovich)

Here is what we need to document:

Pig uses BinStorage? to store/load data generated between Map-Reduce jobs. 
Also, occasionally, users store their data using BinStorage?. Because this is a 
proprietory binary format, the original data is never in BinStorage? - it is 
always a derivation of some other data.

We have seen several examples of users doing something like this:

a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();

And then later:

a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;

There is a problem with this sequence of events. The first script does not 
define data types and, as the result, the data is stored as a bytearray and a 
bug with tuple with two bytearrays. The second script attempts to cast the 
bytearray to double; however, since the data originated from a different 
loader, it has no way to know the format of the bytearray or how to cast it to 
a different type. Pig 0.9 addresses this issue in 2 different ways:

    * By giving a meaningful error message when the second script is executed: 
"ERROR 1118: Cannot convert bytes load from BinStorage?"
    * By allowing the user to provide a converter to use during casting. 

a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag{t:(v, 
s)});
b = foreach a generate (double)id, flatten(d);
dump b;


> Document BinStorage behaviour 
> ------------------------------
>
>                 Key: PIG-1711
>                 URL: https://issues.apache.org/jira/browse/PIG-1711
>             Project: Pig
>          Issue Type: Bug
>          Components: documentation
>    Affects Versions: 0.6.0, 0.7.0
>            Reporter: Viraj Bhat
>            Assignee: Corinne Chandel
>             Fix For: 0.9.0
>
>
> We need to document some features of BinStorage that can cause indeterminate 
> results.
> I have a Pig script of this type:
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie'  as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> So we need to write correct documentation on why this happens. One good 
> explanation seems to be:
> According to Alan:
> BinStorage should not track data lineage. In the case where Pig is using 
> BinStorage (or whatever) for moving data between MR jobs then Pig can figure 
> out the correct cast function to use and apply it. For cases such as the one 
> here where users are storing data using BinStorage and then in a separate Pig 
> Latin script reading it (and thus loosing the type information) it is the 
> users responsibility to correctly cast the data before storing it in 
> BinStorage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to