[ 
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Soldatov updated PHOENIX-1973:
-------------------------------------
    Attachment: PHOENIX-1973-2.patch

Changes comparing to the initial patch:
1. Column names are indexed and we pass only indexes between mapper and reducer
2. Removed double iteration over list of cells 
3. added generation KV for deleteColumn which was missed in the initial patch. 

As for the performance:
I've used table with 10 columns. Loaded 70Mb of CSV data,1.6M records

Before:
       Map-Reduce Framework
                Map input records=1612800
                Map output records=11289600
                Map output bytes=1005017169
                Map output materialized bytes=1027596399
                Input split bytes=565
                Combine input records=0
                Combine output records=0
                Reduce input groups=1612800
                Reduce shuffle bytes=1027596399
                Reduce input records=11289600
                Reduce output records=11289600
                Spilled Records=33868800
                Shuffled Maps =5
                Failed Shuffles=0
                Merged Map outputs=5
                GC time elapsed (ms)=67210
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1116209152

After:
        Map-Reduce Framework
                Map input records=1612800
                Map output records=1612800
                Map output bytes=109913169
                Map output materialized bytes=113138799
                Input split bytes=565
                Combine input records=0
                Combine output records=0
                Reduce input groups=1612800
                Reduce shuffle bytes=113138799
                Reduce input records=1612800
                Reduce output records=11289600
                Spilled Records=3225600
                Shuffled Maps =5
                Failed Shuffles=0
                Merged Map outputs=5
                GC time elapsed (ms)=66004
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=1185415168

> Improve CsvBulkLoadTool performance by moving keyvalue construction from map 
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1973
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1973
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Sergey Soldatov
>             Fix For: 4.4.1
>
>         Attachments: PHOENIX-1973-1.patch, PHOENIX-1973-2.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and 
> reducer in Phoenix. In Map phase we just need to get row key from primary key 
> columns and write the full text of a line as usual(to ensure sorting). In 
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be 
> transferred through network.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to