[ 
https://issues.apache.org/jira/browse/CRUNCH-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Reid updated CRUNCH-545:
--------------------------------
    Attachment: pre.dot.png
                post.dot.png
                CRUNCH-545.patch

Patch to reduce the writing of HFiles to a single job, regardless of which 
column families are defined on the output table. Also adds testing of writing 
multiple column families in an HFile load.

See pre.dot.png for how writing data for an HTable with 3 column families 
looked before the patch, and post.dot.png for how it looks after the patch.

> Writing to HFiles starts a job per column family
> ------------------------------------------------
>
>                 Key: CRUNCH-545
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-545
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-545.patch, post.dot.png, pre.dot.png
>
>
> When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a 
> separate MR job is started up per column family defined for the table, 
> regardless of whether or not there is any data for each of these column 
> families.
> Each of the column family jobs runs over the full set of Cells, filters for 
> the desired column family, and then partitions the data.
> For tables with multiple column families, it would be a lot more efficient to 
> sort/partition all of the data together, and then split it out per column 
> family afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to