Re: sqoop performance gets 20x slower on very wide datasets

Ruslan Dautkhanov Tue, 03 May 2016 10:08:52 -0700

Ran JvmTop profiler for a couple of minutes on one of the sqoop mappers
while the job is still running:


http://code.google.com/p/jvmtop

Profiling PID 7246: org.apache.hadoop.mapred.YarnChild 10.20

92.92% ( 127.49s) unified_dim.setField0()
1.98% ( 2.71s) parquet.io.RecordReaderImplementation.read()
1.46% ( 2.00s) unified_dim.getFieldMap0()
1.33% ( 1.83s) unified_dim.setField()
0.58% ( 0.80s) parquet.column.impl.ColumnReaderImpl.readValue()
0.49% ( 0.67s) unified_dim.getFieldMap1()
0.37% ( 0.51s) ...quet.column.values.bitpacking.ByteBitPackingValuesRea()
0.34% ( 0.46s) com.cloudera.sqoop.lib.JdbcWritableBridge.writeString()
0.19% ( 0.26s) ...quet.column.impl.ColumnReaderImpl.writeCurrentValueTo()
0.09% ( 0.12s) ....cloudera.sqoop.lib.JdbcWritableBridge.writeBigDecima()
0.09% ( 0.12s) unified_dim.write1()
0.09% ( 0.12s) ...quet.column.values.rle.RunLengthBitPackingHybridDecod()
0.08% ( 0.10s) parquet.bytes.BytesUtils.readUnsignedVarInt()



unified_dim in the above profiling output is name of the target table in
oracle.
Looks like .setField0() method is the culprit.

Not sure which setField0() is that, but found this code generation that
might not be as efficient for wider datasets-
https://github.com/anthonycorbacho/sqoop/blob/master/src/java/org/apache/sqoop/orm/ClassWriter.java#L755

If I understand correctly for 700+ columns the generated code will have
700+ "if" statements ..

Perhaps that is it, I didn't find any other setField() methods there.

Anyone could please have a look into this?


-- 
Ruslan Dautkhanov

On Mon, May 2, 2016 at 11:09 PM, Ruslan Dautkhanov <[email protected]>
wrote:

> https://issues.apache.org/jira/browse/SQOOP-2920
>
> Has anybody experienced this problem before?
> Is any known workaround?
>
> We sqoop export from datalake to Oracle quite often.
> Every time we sqoop "narrow" datasets, Oracle always have scalability
> issues (3-node all-flash Oracle RAC) normally can't keep up with more than
> 45-55 sqoop mappers. Map-reduce framework shows sqoop mappers are not so
> loaded.
>
> On wide datasets, this picture is quite opposite. Oracle shows 95% of
> sessions are bored and waiting for new INSERTs. Even when we go over
> hundred of mappers. Sqoop has serious scalability issues on very wide
> datasets. (Our company normally has very wide datasets)
>
> For example, on the last sqoop export:
> Started ~2.5 hours ago and 95 mappers already accumulated
> CPU time spent (ms) 1,065,858,760
> (looking at this metric through map-reduce framework stats)
>
> 1 million seconds of CPU time.
>
> Or 11219.57 per mapper. Which is roughly 3.11 hours of CPU time per
> mapper.
> So they are 100% on cpu.
>
>
> --
> Ruslan Dautkhanov
>

Re: sqoop performance gets 20x slower on very wide datasets

Reply via email to