Dovy Paukstys created SQOOP-3445:
------------------------------------

             Summary: Spark with Sqoop and Kite - Parquet Mismatch in Command?
                 Key: SQOOP-3445
                 URL: https://issues.apache.org/jira/browse/SQOOP-3445
             Project: Sqoop
          Issue Type: Bug
          Components: sqoop2-kite-connector
    Affects Versions: 1.4.7
         Environment: System:
 * Debian 9
 * Hadoop 2.9
 * Spark 2.3

Installed Dependencies (JARs):
 * sqoop-1.4.7-hadoop260
 * kite-data-mapreduce-1.1.0
 * kite-hadoop-compatibility-1.1.0.jar
 * kite-data-crunch-1.1.0
 * kite-data-core-1.1.0
 * avro-tools-1.8.2.jar
 * mysql-connector-java-5.1.42
 * parquet-tools-1.8.3
            Reporter: Dovy Paukstys


Not sure if the error is deep in scoop or if the error is in Kite, so I 
cross-posted here: [https://github.com/kite-sdk/kite/issues/490].

I am reading from a MySQL Database and trying to write out to parquet. When 
writing to Avro there are no issues, but when Kite is involved (parquet) all 
hell breaks loose. First I had to manually add a ton of jar's to even get the 
sucker to run. But that all seems resolved.

Also, please note, I have tried various versions of the installed dependencies, 
downgrading and upgrading scoop accordingly.

When Sqoop is used without Kite (IE, Avro, not parquet) there are no issues. 
The moment the job runs to export to parquet, everything blows up. It seems 
like Kite may be the offender, but it may be in the scoop code for how Kite is 
run.

System:
 * Debian 9
 * Hadoop 2.9
 * Spark 2.3

Installed Dependencies (JARs):
 * sqoop-1.4.7-hadoop260
 * kite-data-mapreduce-1.1.0
 * kite-hadoop-compatibility-1.1.0.jar
 * kite-data-crunch-1.1.0
 * kite-data-core-1.1.0
 * avro-tools-1.8.2.jar
 * mysql-connector-java-5.1.42
 * parquet-tools-1.8.3

Error:
 
{code:java}
19/07/09 17:55:28 INFO mapreduce.Job: Job job_1562682312457_0020 failed with 
state FAILED due to: Job setup failed : java.lang.IllegalArgumentException: 
Parquet only supports generic and specific data models, type parameter must 
implement IndexedRecord at 
org.kitesdk.shaded.com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
 at 
org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:96)
 at 
org.kitesdk.data.spi.filesystem.FileSystemDataset.<init>(FileSystemDataset.java:128)
 at 
org.kitesdk.data.spi.filesystem.FileSystemDataset$Builder.build(FileSystemDataset.java:687)
 at 
org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.load(FileSystemDatasetRepository.java:199)
 at org.kitesdk.data.Datasets.load(Datasets.java:108) at 
org.kitesdk.data.Datasets.load(Datasets.java:165) at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.load(DatasetKeyOutputFormat.java:542)
 at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.loadOrCreateJobDataset(DatasetKeyOutputFormat.java:569)
 at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat.access$300(DatasetKeyOutputFormat.java:67)
 at 
org.kitesdk.data.mapreduce.DatasetKeyOutputFormat$MergeOutputCommitter.setupJob(DatasetKeyOutputFormat.java:369)
 at 
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.handleJobSetup(CommitterEventHandler.java:255)
 at 
org.apache.hadoop.mapreduce.v2.app.commit.CommitterEventHandler$EventProcessor.run(CommitterEventHandler.java:235)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 19/07/09 17:55:28 INFO mapreduce.Job: 
Counters: 2{code}
Again, it only fails on the final conversion. I am not sure of the full details 
since the command is inside a parallel process. Any direction would be 
appreciated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to