Ruslan Dautkhanov created SQOOP-2920:
----------------------------------------

             Summary: sqoop performance deteriorates significantly on wide 
datasets; sqoop 100% on cpu
                 Key: SQOOP-2920
                 URL: https://issues.apache.org/jira/browse/SQOOP-2920
             Project: Sqoop
          Issue Type: Bug
          Components: connectors/oracle, hive-integration, metastore
    Affects Versions: 1.4.5
         Environment: - sqoop export on a very wide dataset (over 700 columns)
- sqoop export to oracle
- subset of columns is exported (using --columns argument)
- parquet files
- --table --hcatalog-database --hcatalog-table options are used

            Reporter: Ruslan Dautkhanov
            Priority: Critical


We sqoop export from datalake to Oracle quite often.
Every time we sqoop "narrow" datasets, Oracle always have scalability issues 
(3-node all-flash Oracle RAC) normally can't keep up with more than 45-55 sqoop 
mappers. Map-reduce framework shows sqoop mappers are not so loaded. 

On wide datasets, this picture is quite opposite. Oracle shows 95% of sessions 
are bored and waiting for new INSERTs. Even when we go over hundred of mappers. 
Sqoop has serious scalability issues on very wide datasets. (Our company 
normally has very wide datasets)

For example, on the last sqoop export:
Started ~2.5 hours ago and 95 mappers already accumulated
CPU time spent (ms)     1,065,858,760
(looking at this metric through map-reduce framework stats)

1 million seconds of CPU time.

Or 11219.57 per mapper. Which is roughly 3.11 hours of CPU time per mapper. 
So they are 100% on cpu.

Will also attach jstack files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to