[jira] [Commented] (PHOENIX-1973) Improve CsvBulkLoadTool performance by moving keyvalue construction from map phase to reduce phase

Gabriel Reid (JIRA) Wed, 02 Mar 2016 08:32:57 -0800

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15175888#comment-15175888
 ]


Gabriel Reid commented on PHOENIX-1973:
---------------------------------------

I just took a look through it and gave it a test run on a single-node cluster.

It appears to work as intended, with a much lighter shuffle, which is good. 
HFiles seem to be created correctly (i.e. the issue from before is gone), and 
things look to make sense in terms of the actual implementation.

There are a few small things that could use some touching up though:
* TargetTableRefFunctions#LOGICAN_NAMES_TO_JSON appears to be spelled 
incorrectly (should be LOGICAL_NAMES_TO_JSON), and I don’t understand why it’s 
a Function instead of just a static method (it matches other Functions in that 
class, but the general use of Functions in that way makes no sense)
* FormatToKeyValueMapper should probably be renamed to accurately describe what 
it does, as well as definitely updating the class-level javadoc to explain what 
it does (i.e. it’s not creating KeyValues any more)
* Pretty minor code format issues, such as lack of correct spacing in 
FormatToKeyValueMapper#findIndex and elsewhere, and the use of wildcard imports 
in FormatToKeyValueMapper
* Minor nit, but why is TrustedByteArrayOutputStream being used in 
FormatToKeyValueMapper#writeAggregatedRow?

> Improve CsvBulkLoadTool performance by moving keyvalue construction from map 
> phase to reduce phase
> --------------------------------------------------------------------------------------------------
>
>                 Key: PHOENIX-1973
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1973
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Rajeshbabu Chintaguntla
>            Assignee: Sergey Soldatov
>             Fix For: 4.7.0
>
>         Attachments: PHOENIX-1973-1.patch, PHOENIX-1973-2.patch, 
> PHOENIX-1973-3.patch, PHOENIX-1973-4.patch, PHOENIX-1973-5.patch, 
> PHOENIX-1973-6.patch, PHOENIX-1973-7.patch
>
>
> It's similar to HBASE-8768. Only thing is we need to write custom mapper and 
> reducer in Phoenix. In Map phase we just need to get row key from primary key 
> columns and write the full text of a line as usual(to ensure sorting). In 
> reducer we need to get actual key values by running upsert query.
> It's basically reduces lot of map output to write to disk and data need to be 
> transferred through network.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1973) Improve CsvBulkLoadTool performance by moving keyvalue construction from map phase to reduce phase

Reply via email to