[
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sergey Soldatov updated PHOENIX-1973:
-------------------------------------
Attachment: PHOENIX-1973-2.patch
Changes comparing to the initial patch:
1. Column names are indexed and we pass only indexes between mapper and reducer
2. Removed double iteration over list of cells
3. added generation KV for deleteColumn which was missed in the initial patch.
As for the performance:
I've used table with 10 columns. Loaded 70Mb of CSV data,1.6M records
Before:
Map-Reduce Framework
Map input records=1612800
Map output records=11289600
Map output bytes=1005017169
Map output materialized bytes=1027596399
Input split bytes=565
Combine input records=0
Combine output records=0
Reduce input groups=1612800
Reduce shuffle bytes=1027596399
Reduce input records=11289600
Reduce output records=11289600
Spilled Records=33868800
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=67210
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1116209152
After:
Map-Reduce Framework
Map input records=1612800
Map output records=1612800
Map output bytes=109913169
Map output materialized bytes=113138799
Input split bytes=565
Combine input records=0
Combine output records=0
Reduce input groups=1612800
Reduce shuffle bytes=113138799
Reduce input records=1612800
Reduce output records=11289600
Spilled Records=3225600
Shuffled Maps =5
Failed Shuffles=0
Merged Map outputs=5
GC time elapsed (ms)=66004
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1185415168
> Improve CsvBulkLoadTool performance by moving keyvalue construction from map
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
> Key: PHOENIX-1973
> URL: https://issues.apache.org/jira/browse/PHOENIX-1973
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Rajeshbabu Chintaguntla
> Assignee: Sergey Soldatov
> Fix For: 4.4.1
>
> Attachments: PHOENIX-1973-1.patch, PHOENIX-1973-2.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and
> reducer in Phoenix. In Map phase we just need to get row key from primary key
> columns and write the full text of a line as usual(to ensure sorting). In
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be
> transferred through network.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)