Matthew Hayes commented on DATAFU-119:

Okay that makes sense.  I can see how this would be useful then, in particular 
for cases where most of the records will match exactly.  I'll take a look at 
the code in the RB.

> New UDF - TupleDiff
> -------------------
>                 Key: DATAFU-119
>                 URL: https://issues.apache.org/jira/browse/DATAFU-119
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>       DEFINE TupleDiff datafu.pig.util.TupleDiff;
>       old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>       new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>       join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>       join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>       $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,<original tuple>,
> missing,,<new tuple>
> changed field2 field4,<original tuple>,<new tuple>
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.

This message was sent by Atlassian JIRA

Reply via email to