[ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16533930#comment-16533930
 ] 

Matthew Hayes commented on DATAFU-129:
--------------------------------------

The macro and test look good to me.  Regarding the dependency on piggybank, 
this is sort of a departure from the convention up to this point of having 
datafu-pig be pretty much self-contained.  You don't need to download separate 
JARs and make sure they're on the classpath for the UDFs to work.  I'm not sure 
whether it is typical or not to have piggybank on the class path.  I think the 
options we have are:
 # Require piggybank be on the classpath to use this macro
 # Include all of piggybank in the datafu jar, moving to a namespace like 
{{datafu.org.apache.pig.piggybank}} to avoid potential conflicts.
 # Same as #2 but only include {{ExtremalTupleByNthField}}.  The piggybank JAR 
is only 387kb where datafu is 2mb.  Since both JARs are on the small side I 
don't know if it is that important to save space given the potential complexity 
of only including this UDF.

My preference is #2 as this is consistent with the current convention.  
Thoughts?

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to