nsivabalan opened a new issue #5189:
URL: https://github.com/apache/hudi/issues/5189
**_Tips before filing an issue_**
**Describe the problem you faced**
From user:
I am trying to read a hoodie table and write to a hoodie table using delta
streamer and I am getting this error:
Steps to reproduce:
```
create first hudi table using ConfluentAvroKafkaSource ->
second by HoodieIncrSource consuming output of first table ->
third by HoodieIncrSource and consuming output of second table
( error is on incremental runs of deltastreamer in 3rd table )
```
stacktrace:
```
client token: N/A
diagnostics: User class threw exception:
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data
schema: `_hoodie_partition_path`;
at
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:90)
at
org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:70)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:440)
at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:297)
at
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:286)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:286)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
at
com.navi.sources.HoodieIncrSource.fetchNextBatch(HoodieIncrSource.java:122)
at
org.apache.hudi.utilities.sources.RowSource.fetchNewData(RowSource.java:43)
at org.apache.hudi.utilities.sources.Source.fetchNext(Source.java:76)
at
org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter.fetchNewDataInRowFormat(SourceFormatAdapter.java:95)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:388)
at
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:283)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
at
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:728)
```
write configs:
```
spark-submit --master yarn --jars
/usr/lib/spark/external/lib/spark-avro.jar,s3://***/jars/hudi-utilities-bundle_2.12-0.10.0.jar
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --conf
spark.executor.cores=3 --conf spark.driver.memory=4g --conf
spark.driver.memoryOverhead=800m --conf spark.executor.memoryOverhead=1800m
--conf spark.executor.memory=16g --conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.initialExecutors=1 --conf
spark.dynamicAllocation.minExecutors=1 --conf
spark.dynamicAllocation.maxExecutors=6 --conf spark.scheduler.mode=FAIR --conf
spark.task.maxFailures=5 --conf spark.rdd.compress=true --conf
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf
spark.shuffle.service.enabled=true --conf
spark.sql.hive.convertMetastoreParquet=false --conf
spark.yarn.max.executor.failures=5 --conf spark.driver.userClassPathFirst=true
--conf spark.executor.userClassPathFirst=true --conf
spark.sql.catalogImplementation=hive --d
eploy-mode cluster s3://*****/jars/deltastreamer-addons-1.1-SNAPSHOT.jar
--hoodie-conf hoodie.parquet.compression.codec=snappy --hoodie-conf
hoodie.deltastreamer.source.hoodieincr.num_instants=10 --hoodie-conf
hoodie.datasource.write.partitionpath.field= --table-type COPY_ON_WRITE
--source-class com.navi.sources.HoodieIncrSource --hoodie-conf
hoodie.deltastreamer.source.hoodieincr.path=s3://*****/input_path --hoodie-conf
hoodie.metrics.on=true --hoodie-conf
hoodie.metrics.reporter.type=PROMETHEUS_PUSHGATEWAY --hoodie-conf
hoodie.metrics.pushgateway.host=pushgateway.prod.navi-tech.in --hoodie-conf
hoodie.metrics.pushgateway.port=443 --hoodie-conf
hoodie.metrics.pushgateway.delete.on.shutdown=false --hoodie-conf
hoodie.metrics.pushgateway.job.name=*** --hoodie-conf
hoodie.metrics.pushgateway.random.job.name.suffix=false --hoodie-conf
hoodie.metrics.reporter.metricsname.prefix=hudi --target-base-path
s3://*****/output_path --target-table some_table --enable-sync --hoodie-conf
hoodie.da
tasource.hive_sync.database=db --hoodie-conf
hoodie.datasource.hive_sync.table=out_tbl --hoodie-conf
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000 --hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
--hoodie-conf hoodie.clustering.inline=true --hoodie-conf
hoodie.clustering.inline.max.commits=2 --hoodie-conf
hoodie.datasource.write.recordkey.field=contact_number_cleaned --hoodie-conf
hoodie.datasource.write.precombine.field=id --hoodie-conf
hoodie.datasource.clustering.inline.enable=true --hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator
--hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=134217728
--hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=273741824
--source-ordering-field id --hoodie-conf transformer.normalize.json.column=id
--hoodie-conf "hoodie.deltastreamer.transformer.sql=select id,col1,col2 from
<SRC>)" --t
ransformer-class
com.custom.transform.ArrayJsonToStructTypeTransformer,org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
```
**Expected behavior**
A clear and concise description of what you expected to happen.
**Environment Description**
* Hudi version :
* Spark version :
* Hive version :
* Hadoop version :
* Storage (HDFS/S3/GCS..) :
* Running on Docker? (yes/no) :
**Additional context**
Add any other context about the problem here.
**Stacktrace**
```Add the stacktrace of the error.```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]