[ 
https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549394#comment-16549394
 ] 

Eyal Allweil commented on DATAFU-127:
-------------------------------------

I'm fine with removing the dates version for simplicity's sake. Normally we 
limit even our samples by date, but it's not crucial and hopefully Pig can push 
the filter up if it's added after the macro.

> New macro - samply by keys
> --------------------------
>
>                 Key: DATAFU-127
>                 URL: https://issues.apache.org/jira/browse/DATAFU-127
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
>              Labels: macro
>         Attachments: DATAFU-127.patch
>
>
> Two macros that return a sample of a larger table based on a list of keys, 
> with the schema of the larger table. One of the macros filters by dates, the 
> other doesn't.
> If there are multiple rows with a key that appears in the key list, all of 
> them will be returned (no deduplication is done). The results are returned 
> ordered by the key field in a single file.
> The implementation uses a replicated join for efficiency, but this means the 
> key list shouldn't be too large as to not fit in memory.
> The first macro's definition looks as follows:
> DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
> returns out {
> - table_name                          - table name to sample
> - sample_set                          - a set of keys
> - join_key_table                      - join column name in the table
> - join_key_sample                     - join column name in the sample



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to