Ran JvmTop profiler for a couple of minutes on one of the sqoop mappers while the job is still running:
http://code.google.com/p/jvmtop Profiling PID 7246: org.apache.hadoop.mapred.YarnChild 10.20 92.92% ( 127.49s) unified_dim.setField0() 1.98% ( 2.71s) parquet.io.RecordReaderImplementation.read() 1.46% ( 2.00s) unified_dim.getFieldMap0() 1.33% ( 1.83s) unified_dim.setField() 0.58% ( 0.80s) parquet.column.impl.ColumnReaderImpl.readValue() 0.49% ( 0.67s) unified_dim.getFieldMap1() 0.37% ( 0.51s) ...quet.column.values.bitpacking.ByteBitPackingValuesRea() 0.34% ( 0.46s) com.cloudera.sqoop.lib.JdbcWritableBridge.writeString() 0.19% ( 0.26s) ...quet.column.impl.ColumnReaderImpl.writeCurrentValueTo() 0.09% ( 0.12s) ....cloudera.sqoop.lib.JdbcWritableBridge.writeBigDecima() 0.09% ( 0.12s) unified_dim.write1() 0.09% ( 0.12s) ...quet.column.values.rle.RunLengthBitPackingHybridDecod() 0.08% ( 0.10s) parquet.bytes.BytesUtils.readUnsignedVarInt() unified_dim in the above profiling output is name of the target table in oracle. Looks like .setField0() method is the culprit. Not sure which setField0() is that, but found this code generation that might not be as efficient for wider datasets- https://github.com/anthonycorbacho/sqoop/blob/master/src/java/org/apache/sqoop/orm/ClassWriter.java#L755 If I understand correctly for 700+ columns the generated code will have 700+ "if" statements .. Perhaps that is it, I didn't find any other setField() methods there. Anyone could please have a look into this? -- Ruslan Dautkhanov On Mon, May 2, 2016 at 11:09 PM, Ruslan Dautkhanov <[email protected]> wrote: > https://issues.apache.org/jira/browse/SQOOP-2920 > > Has anybody experienced this problem before? > Is any known workaround? > > We sqoop export from datalake to Oracle quite often. > Every time we sqoop "narrow" datasets, Oracle always have scalability > issues (3-node all-flash Oracle RAC) normally can't keep up with more than > 45-55 sqoop mappers. Map-reduce framework shows sqoop mappers are not so > loaded. > > On wide datasets, this picture is quite opposite. Oracle shows 95% of > sessions are bored and waiting for new INSERTs. Even when we go over > hundred of mappers. Sqoop has serious scalability issues on very wide > datasets. (Our company normally has very wide datasets) > > For example, on the last sqoop export: > Started ~2.5 hours ago and 95 mappers already accumulated > CPU time spent (ms) 1,065,858,760 > (looking at this metric through map-reduce framework stats) > > 1 million seconds of CPU time. > > Or 11219.57 per mapper. Which is roughly 3.11 hours of CPU time per > mapper. > So they are 100% on cpu. > > > -- > Ruslan Dautkhanov >
