[jira] [Issue Comment Deleted] (SQOOP-2906) Optimization of AvroUtil.toAvroIdentifier

Attila Szabo (JIRA) Wed, 11 May 2016 07:21:26 -0700

     [ 
https://issues.apache.org/jira/browse/SQOOP-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Attila Szabo updated SQOOP-2906:
--------------------------------
    Comment: was deleted

(was: Hi Joeri,

I've joined the Sqoop community only a few weeks ago, so maybe I don't see all 
of the pitfalls, but let me raise a few suggestions/concerns:
You're fix seems to be okay, but I would suggest a bit more changes processing 
wise:
- First of all, I would not do the conversion for all of the column names, but 
rather create a Map<String, String> which would contain the "original" VS. 
"converted" names, and thus in most of the cases we would just have to lookup 
the name in O(1) time, rather doing the conversion all the time (even if it's 
now much faster and cheaper).
- I would also not convert those entry.getKey() values if those got a hit in 
the schema (schema.getField returns not null), as in that case they're valid 
values, but maybe this optimization is neglectable if you implement the first 
proposal. 
- I was also considering to do the mapping in advance before the import (after 
we've got the DB metadata and the avro schema), but for not RDBMS system it 
might cause problems (different sets of columns for each row e.g.), so I'm not 
sure that would help, but from algorithmic/clean code POV that would be the 
cleanest solution if possible.

Would you tell what do you think about these suggestions?
My 2cents,
Attila (Maugli))

> Optimization of AvroUtil.toAvroIdentifier
> -----------------------------------------
>
>                 Key: SQOOP-2906
>                 URL: https://issues.apache.org/jira/browse/SQOOP-2906
>             Project: Sqoop
>          Issue Type: Improvement
>            Reporter: Joeri Hermans
>            Assignee: Joeri Hermans
>              Labels: avro, hadoop, optimization
>         Attachments: diff.txt
>
>
> Hi all
> Our distributed profiler indicated some inefficiencies in the 
> AvroUtil.toAvroIdentifier method, more specifically, the use of Regex 
> patterns. This can be directly observed from the FlameGraph generated by this 
> profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg). 
> We implemented an optimization, and compared this with the original method. 
> On our testing machine, the optimization by itself is about 500% (on average) 
> more efficient compared to the original implementation. We have yet to test 
> how this optimization will influence the performance of user jobs.
> Any suggestions or remarks are welcome.
> Kind regards,
> Joeri
> https://github.com/apache/sqoop/pull/18
> Writeup:
> https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (SQOOP-2906) Optimization of AvroUtil.toAvroIdentifier

Reply via email to