[jira] [Commented] (DATAFU-129) New macro - dedup

Eyal Allweil (JIRA) Thu, 20 Sep 2018 07:28:16 -0700


    [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622132#comment-16622132
 ]


Eyal Allweil commented on DATAFU-129:
-------------------------------------

I'm attaching a patch which doesn't work - I guess this is a sign of my level 
of desperation. Basically, if I create a copy-pasted version of 
ExtremalTupleByNthField - let's call it DataFuExtremalTupleByNthField - 
everything works fine, both in the unit test and on a cluster/vm.

But if I autojar it (as is done in the attached patch,  [^DATAFU-129-bad.patch] 
) - the tests run fine, but on a cluster/vm it doesn't work, and I'm not sure 
why. In retrospect, I guess this is actually an implementation of #3, not #2 - 
but I don't think that's the reason why it doesn't work.

> New macro - dedup
> -----------------
>
>                 Key: DATAFU-129
>                 URL: https://issues.apache.org/jira/browse/DATAFU-129
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-129-bad.patch, DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (DATAFU-129) New macro - dedup

Reply via email to