[
https://issues.apache.org/jira/browse/PIG-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Olga Natkovich reassigned PIG-1711:
-----------------------------------
Assignee: Corinne Chandel (was: Olga Natkovich)
Here is what we need to document:
Pig uses BinStorage? to store/load data generated between Map-Reduce jobs.
Also, occasionally, users store their data using BinStorage?. Because this is a
proprietory binary format, the original data is never in BinStorage? - it is
always a derivation of some other data.
We have seen several examples of users doing something like this:
a = load 'b.txt' as (id, f);
b = group a by id;
store b into 'g' using BinStorage();
And then later:
a = load 'g/part*' using BinStorage() as (id, d:bag{t:(v, s)});
b = foreach a generate (double)id, flatten(d);
dump b;
There is a problem with this sequence of events. The first script does not
define data types and, as the result, the data is stored as a bytearray and a
bug with tuple with two bytearrays. The second script attempts to cast the
bytearray to double; however, since the data originated from a different
loader, it has no way to know the format of the bytearray or how to cast it to
a different type. Pig 0.9 addresses this issue in 2 different ways:
* By giving a meaningful error message when the second script is executed:
"ERROR 1118: Cannot convert bytes load from BinStorage?"
* By allowing the user to provide a converter to use during casting.
a = load 'g/part*' using BinStorage('Utf8StorageConverter') as (id, d:bag{t:(v,
s)});
b = foreach a generate (double)id, flatten(d);
dump b;
> Document BinStorage behaviour
> ------------------------------
>
> Key: PIG-1711
> URL: https://issues.apache.org/jira/browse/PIG-1711
> Project: Pig
> Issue Type: Bug
> Components: documentation
> Affects Versions: 0.6.0, 0.7.0
> Reporter: Viraj Bhat
> Assignee: Corinne Chandel
> Fix For: 0.9.0
>
>
> We need to document some features of BinStorage that can cause indeterminate
> results.
> I have a Pig script of this type:
> {code}
> raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
> --filter out null columns
> A = filter raw by col1#'bcookie' is not null;
> B = foreach A generate col1#'bcookie' as reqcolumn;
> describe B;
> --B: {regcolumn: bytearray}
> X = limit B 5;
> dump X;
> B = foreach A generate (chararray)col1#'bcookie' as convertedcol;
> describe B;
> --B: {convertedcol: chararray}
> X = limit B 5;
> dump X;
> {code}
> The first dump produces:
> (36co9b55onr8s)
> (36co9b55onr8s)
> (36hilul5oo1q1)
> (36hilul5oo1q1)
> (36l4cj15ooa8a)
> The second dump produces:
> ()
> ()
> ()
> ()
> ()
> So we need to write correct documentation on why this happens. One good
> explanation seems to be:
> According to Alan:
> BinStorage should not track data lineage. In the case where Pig is using
> BinStorage (or whatever) for moving data between MR jobs then Pig can figure
> out the correct cast function to use and apply it. For cases such as the one
> here where users are storing data using BinStorage and then in a separate Pig
> Latin script reading it (and thus loosing the type information) it is the
> users responsibility to correctly cast the data before storing it in
> BinStorage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.