[
https://issues.apache.org/jira/browse/SQOOP-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15267641#comment-15267641
]
Ruslan Dautkhanov commented on SQOOP-2920:
------------------------------------------
Ran JvmTop profiler for a couple of minutes on one of the sqoop mappers while
the job is still running:
{quote}
http://code.google.com/p/jvmtop
Profiling PID 7246: org.apache.hadoop.mapred.YarnChild 10.20
92.92% ( 127.49s) unified_dim.setField0()
1.98% ( 2.71s) parquet.io.RecordReaderImplementation.read()
1.46% ( 2.00s) unified_dim.getFieldMap0()
1.33% ( 1.83s) unified_dim.setField()
0.58% ( 0.80s) parquet.column.impl.ColumnReaderImpl.readValue()
0.49% ( 0.67s) unified_dim.getFieldMap1()
0.37% ( 0.51s) ...quet.column.values.bitpacking.ByteBitPackingValuesRea()
0.34% ( 0.46s) com.cloudera.sqoop.lib.JdbcWritableBridge.writeString()
0.19% ( 0.26s) ...quet.column.impl.ColumnReaderImpl.writeCurrentValueTo()
0.09% ( 0.12s) ....cloudera.sqoop.lib.JdbcWritableBridge.writeBigDecima()
0.09% ( 0.12s) unified_dim.write1()
0.09% ( 0.12s) ...quet.column.values.rle.RunLengthBitPackingHybridDecod()
0.08% ( 0.10s) parquet.bytes.BytesUtils.readUnsignedVarInt()
{quote}
> sqoop performance deteriorates significantly on wide datasets; sqoop 100% on
> cpu
> --------------------------------------------------------------------------------
>
> Key: SQOOP-2920
> URL: https://issues.apache.org/jira/browse/SQOOP-2920
> Project: Sqoop
> Issue Type: Bug
> Components: connectors/oracle, hive-integration, metastore
> Affects Versions: 1.4.5
> Environment: - sqoop export on a very wide dataset (over 700 columns)
> - sqoop export to oracle
> - subset of columns is exported (using --columns argument)
> - parquet files
> - --table --hcatalog-database --hcatalog-table options are used
> Reporter: Ruslan Dautkhanov
> Priority: Critical
> Labels: columns, hive, oracle, perfomance
> Attachments: jstack.zip, top - sqoop mappers hog cpu.png
>
>
> We sqoop export from datalake to Oracle quite often.
> Every time we sqoop "narrow" datasets, Oracle always have scalability issues
> (3-node all-flash Oracle RAC) normally can't keep up with more than 45-55
> sqoop mappers. Map-reduce framework shows sqoop mappers are not so loaded.
> On wide datasets, this picture is quite opposite. Oracle shows 95% of
> sessions are bored and waiting for new INSERTs. Even when we go over hundred
> of mappers. Sqoop has serious scalability issues on very wide datasets. (Our
> company normally has very wide datasets)
> For example, on the last sqoop export:
> Started ~2.5 hours ago and 95 mappers already accumulated
> CPU time spent (ms) 1,065,858,760
> (looking at this metric through map-reduce framework stats)
> 1 million seconds of CPU time.
> Or 11219.57 per mapper. Which is roughly 3.11 hours of CPU time per mapper.
> So they are 100% on cpu.
> Will also attach jstack files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)