[
https://issues.apache.org/jira/browse/PIG-3902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993831#comment-13993831
]
Cheolsoo Park commented on PIG-3902:
------------------------------------
Great! I'll run unit tests with the new patch and commit it. Thank you Jacob!
> PigServer creates cycle
> -----------------------
>
> Key: PIG-3902
> URL: https://issues.apache.org/jira/browse/PIG-3902
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.13.0
> Reporter: Jacob Perkins
> Assignee: Cheolsoo Park
> Fix For: 0.13.0
>
> Attachments: broken-plan.png, cycle.diff, multiquery-20140509.diff,
> multiquery-cycle.diff, multiquery-cycle.diff
>
>
> Under certain conditions PigServer creates a cycle in the logical plan.
> Consider the following pseudocode:
> {code}
> A = load from 'A' using F1;
> ...process...
> B = store X into 'B' using F2;
> C = load from 'B' using F3;
> ...process...
> D = store Y into 'A' using F1;
> {code}
> PigServer will, in ordinary cases, notice that an output path is equal to an
> input path, and, if there's no path from the input to the output, make the
> input a dependency of the output. However, PigServer orders the loads and
> stores arbitrarily during that logic. Sometimes, in the code above, C is
> correctly wired as a dependency of B and, since that creates a path from A to
> D, A won't be made a dependency of D and we're good. On occasion though, the
> ordering being arbitrary, A is wired as a dependency of D. That's no good. To
> be fair, it's not actually a cycle, since when A is wired to D, there's a
> path between C and B so the cycle won't actually get created. But it's still
> a broken plan.
> The offending PigServer code:
> https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693
> And here's some actual pig code that should reproduce the broken plan. Notice
> I had to use a store function that wouldn't check the output. If you're just
> using PigStorage this won't be reproducible since you can't write to the same
> location you read from in that case.
> {code}
> A = load '$A' as (line:chararray);
> A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
> store A into '$B';
> B = load '$B' as (token:chararray);
> B = filter B by SIZE(token) > 3;
> store B into '$A' using
> org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver',
> 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
> {code}
> As far as a fix goes... I'd love some input. I've got some workarounds in
> mind for the specific use case that brought this up, but the general problem
> is more difficult.
> As an aside, there's other issues with the PigServer code referenced above.
> For example, it should almost certainly be using the full path (after
> LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path
> then loading from the absolute representation of that path in the same
> script... Also, why isn't it checking the FuncSpec as well as the location?
> Just trying to open up the discussion.
--
This message was sent by Atlassian JIRA
(v6.2#6252)