[
https://issues.apache.org/jira/browse/SQOOP-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Attila Szabo updated SQOOP-2906:
--------------------------------
Comment: was deleted
(was: Hi Joeri,
I've joined the Sqoop community only a few weeks ago, so maybe I don't see all
of the pitfalls, but let me raise a few suggestions/concerns:
You're fix seems to be okay, but I would suggest a bit more changes processing
wise:
- First of all, I would not do the conversion for all of the column names, but
rather create a Map<String, String> which would contain the "original" VS.
"converted" names, and thus in most of the cases we would just have to lookup
the name in O(1) time, rather doing the conversion all the time (even if it's
now much faster and cheaper).
- I would also not convert those entry.getKey() values if those got a hit in
the schema (schema.getField returns not null), as in that case they're valid
values, but maybe this optimization is neglectable if you implement the first
proposal.
- I was also considering to do the mapping in advance before the import (after
we've got the DB metadata and the avro schema), but for not RDBMS system it
might cause problems (different sets of columns for each row e.g.), so I'm not
sure that would help, but from algorithmic/clean code POV that would be the
cleanest solution if possible.
Would you tell what do you think about these suggestions?
My 2cents,
Attila (Maugli))
> Optimization of AvroUtil.toAvroIdentifier
> -----------------------------------------
>
> Key: SQOOP-2906
> URL: https://issues.apache.org/jira/browse/SQOOP-2906
> Project: Sqoop
> Issue Type: Improvement
> Reporter: Joeri Hermans
> Assignee: Joeri Hermans
> Labels: avro, hadoop, optimization
> Attachments: diff.txt
>
>
> Hi all
> Our distributed profiler indicated some inefficiencies in the
> AvroUtil.toAvroIdentifier method, more specifically, the use of Regex
> patterns. This can be directly observed from the FlameGraph generated by this
> profiler (https://jhermans.web.cern.ch/jhermans/sqoop_avro_flamegraph.svg).
> We implemented an optimization, and compared this with the original method.
> On our testing machine, the optimization by itself is about 500% (on average)
> more efficient compared to the original implementation. We have yet to test
> how this optimization will influence the performance of user jobs.
> Any suggestions or remarks are welcome.
> Kind regards,
> Joeri
> https://github.com/apache/sqoop/pull/18
> Writeup:
> https://db-blog.web.cern.ch/blog/joeri-hermans/2016-04-hadoop-performance-troubleshooting-stack-tracing-introduction
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)