[jira] [Commented] (DATAFU-129) New macro - dedup

Eyal Allweil (JIRA) Wed, 11 Jul 2018 04:34:12 -0700


    [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16539941#comment-16539941
 ]


Eyal Allweil commented on DATAFU-129:
-------------------------------------

I agree the #2 makes the most sense. The only drawback that I can think of is 
that if someone wants to debug the autojarred classes, it's difficult to, 
because they appear like they have one package name in the source code, but in 
runtime they have a different one.

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DATAFU-129) New macro - dedup

Reply via email to