[ 
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131166#comment-15131166
 ] 

Enis Soztutar commented on PHOENIX-1973:
----------------------------------------

Thanks Gabriel for taking a look. 
bq. I would think that the biggest reasons that there is such a difference 
between the map output and the final HFile output is that block encoding (and 
probably compression) are used on the final output HFiles, but block encoding 
is certainly not used in the intermediate output.
Yes, we were discussing this offline. Certainly we cannot easily make use of 
efficient encoders like FAST_DIFF in map output since it requires an HFile 
context as of today. Also, for this, we know that within the same row, all 
cells share the exact same rowkey, and column names are repeated across rows. 
So the compression from hfile block encoders are not directly applicable. There 
is some work in HBase itself to have more generic codecs that can be used, but 
we have to develop these codecs first. The attached patch, is kind of a very 
very specialized codec that applies to only map output from CSV bulk load. 

I think the data size explosion is due to repeating table names per row, 
repeating row keys per column, and other overhead (cf, ts, type, etc) per-cell. 

bq. Any idea if map output compression was enabled for the MR job used in your 
internal test?
I believe map output compression was not enabled. Sorry, I did not carry the 
tests myself. I believe there would be good compression due to the already 
repeating segments. However, since we already know the data format, we can do a 
better encoding ourselves without having to rely on map output compression. 

> Improve CsvBulkLoadTool performance by moving keyvalue construction from map 
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1973
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1973
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Sergey Soldatov
>             Fix For: 4.4.1
>
>         Attachments: PHOENIX-1973-1.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and 
> reducer in Phoenix. In Map phase we just need to get row key from primary key 
> columns and write the full text of a line as usual(to ensure sorting). In 
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be 
> transferred through network.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to