[jira] [Commented] (DATAFU-129) New macro - dedup

Matthew Hayes (JIRA) Wed, 10 Oct 2018 18:01:13 -0700


    [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16645791#comment-16645791
 ]


Matthew Hayes commented on DATAFU-129:
--------------------------------------

Okay I see.  My suggestion is to just copy and paste ExtremalTupleByNthField as 
you attempted and verified as working.  Let's just make sure that the namespace 
will be {{datafu.org.apache.pig.piggybank.ExtremalTupleByNthField}} or 
whatever.  This will achieve the same result we want.  The only downside is we 
aren't referencing the JAR, which we can look more into later.  How does this 
sound?  This is basically a new option #4 :)

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129-bad.patch, DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DATAFU-129) New macro - dedup

Reply via email to