Gabriel Reid created CRUNCH-545:
-----------------------------------
Summary: Writing to HFiles starts a job per column family
Key: CRUNCH-545
URL: https://issues.apache.org/jira/browse/CRUNCH-545
Project: Crunch
Issue Type: Improvement
Reporter: Gabriel Reid
Assignee: Gabriel Reid
When writing to HFiles via {{HFileUtils.writeToHFilesForIncrementalLoad}}, a
separate MR job is started up per column family defined for the table,
regardless of whether or not there is any data for each of these column
families.
Each of the column family jobs runs over the full set of Cells, filters for the
desired column family, and then partitions the data.
For tables with multiple column families, it would be a lot more efficient to
sort/partition all of the data together, and then split it out per column
family afterwards.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)