[jira] [Commented] (PHOENIX-1711) Improve performance of CSV loader

Gabriel Reid (JIRA) Mon, 09 Mar 2015 01:28:19 -0700

    [ 
https://issues.apache.org/jira/browse/PHOENIX-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14352669#comment-14352669
 ]


Gabriel Reid commented on PHOENIX-1711:
---------------------------------------

FWIW, my take on this topic in general is that the numbers are pretty much in 
line with what I would expect as far as where the work is being done (i.e. 18% 
of the time spent in parsing the input, and 39% of the time spent converting 
into Phoenix encoding). Seeing as those two tasks are the only real 
functionality performed by this tool, I think it's to be expected that they're 
taking up ~60% of the execution time. That being said, obviously making things 
faster is a good thing (as long as it doesn't come at the cost of breaking 
things).

Looking at the patch, I saw the following in 
{{org.apache.phoenix.mapreduce.CsvToKeyValueMapper#setup}}
{code}
        try {
            csvUpsertExecutor = buildUpsertExecutor(conf);
        } catch (SQLException e) {
            e.printStackTrace();
        }
{code}

We definitely want to throw that exception up the stack there and not just 
print the stack trace, as otherwise this is just going to lead to a NPE later.

I almost had the feeling that this patch is the combination of a couple of 
patches, could that be? Or are all the changes in there necessary? For example, 
is the change in PArrayDataType intended to be in this patch?

Also, considering that the optimization in this change is about speeding up the 
following (pseudo-code) calling pattern:
{code}
for listOfValues in input:
    for value in listOfValues:
        preparedStatement.setObject(value)
    preparedStatement.execute()
{code}

would it be apply this fix so that users of the public APIs will also take 
advantage of it? I can imagine that there are a lot of realtime ingest use 
cases where the same prepared statement is just being used over and over to 
ingest data, so I think it would be good if we can minimize the work being done 
in (re-)compiling the statement every time there as well.

> Improve performance of CSV loader
> ---------------------------------
>
>                 Key: PHOENIX-1711
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-1711
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: James Taylor
>         Attachments: PHOENIX-1711.patch
>
>
> Here is a break-up of percentage execution time for some of the steps inthe 
> mapper:
> csvParser: 18%
> csvUpsertExecutor.execute(ImmutableList.of(csvRecord)): 39%
> PhoenixRuntime.getUncommittedDataIterator(conn, true): 9%
> while (uncommittedDataIterator.hasNext()): 15%
> Read IO & custom processing: 19%
> See details here: http://s.apache.org/6rl



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PHOENIX-1711) Improve performance of CSV loader

Reply via email to