[
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitriy V. Ryaboy updated PIG-2014:
-----------------------------------
Release Note:
A new annotation, @Nondeterministic, is introduced to allow UDF authors to mark
their UDFs as such.
A non-deterministic UDF is one that can produce different results when invoked
on the same input. Examples of non-deterministic behavior might be, for
example, getCurrentTime() or RANDOM.
Certain Pig optimizations depend on UDFs being deterministic. It is therefore
very important for correctness that non-deterministic UDFs be annotated as
such.
Status: Patch Available (was: Open)
> SAMPLE shouldn't be pushed up
> -----------------------------
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.0, 0.10
> Reporter: Jacob Perkins
> Assignee: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
> Attachments: PIG-2014.2.patch, PIG-2014.patch
>
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> weight:double);
> grouped = GROUP tfidf_all BY doc_id;
> vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains
> 1,428,280 records. The reduce output records should be exactly the number of
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in
> front of the group. It shouldn't push that filter
> since the UDF is non-deterministic.
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking
> pig this won't happen.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira