[ 
https://issues.apache.org/jira/browse/HUDI-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456701#comment-17456701
 ] 

sivabalan narayanan commented on HUDI-2943:
-------------------------------------------

I guess this is what is happening. the fix we have put up will work if you have 
enabled async clustering and not for inline clustering. Can you try our async 
clustering and let us know how it goes. Once clustering completes, the regular 
commits should go through.

 

you need to set this with spark-submit command when you start deltastreamer. 

--hoodie-conf hoodie.clustering.async.enabled=true

Once the pending clustering completes and one regular commit succeeds, you can 
switch back to inline clustering. We will try to put in a fix for inline 
clustering as well. 

 

 

> Deltastreamer fails to continue with pending clustering after restart in 
> 0.10.0
> -------------------------------------------------------------------------------
>
>                 Key: HUDI-2943
>                 URL: https://issues.apache.org/jira/browse/HUDI-2943
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Harsha Teja Kanna
>            Priority: Major
>         Attachments: image-2021-12-08-15-10-02-420.png
>
>
> Deltastreamer fails to restart when there is a pending clustering commit from 
> a previous run with Upsert failed exception when inline clustering is on.
> {*}Note{*}: workaround of running Clustering job with 
> --retry-last-failed-clustering-job works
> Hudi version : 0.10.0
> Spark version : 3.1.2
> EMR : 6.4.0
> diagnostics: User class threw exception: 
> org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for commit 
> time 20211206081248919
> at 
> org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:62)
> at 
> org.apache.hudi.table.action.commit.SparkUpsertCommitActionExecutor.execute(SparkUpsertCommitActionExecutor.java:46)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:119)
> at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.upsert(HoodieSparkCopyOnWriteTable.java:103)
> at 
> org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:159)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:501)
> at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:306)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:193)
> at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:191)
> at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:511)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
> Caused by: org.apache.hudi.exception.HoodieClusteringUpdateException: Not 
> allowed to update the clustering file group 
> HoodieFileGroupId\{partitionPath='', 
> fileId='39ca735d-1fc4-40f9-a314-93744642b38c-0'}. For pending clustering 
> operations, we are not going to support update for now.
> at 
> org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy.lambda$handleUpdate$0(SparkRejectUpdateStrategy.java:65)
> Config:
> hoodie.index.type=GLOBAL_SIMPLE
> hoodie.datasource.write.partitionpath.field=
> hoodie.datasource.write.precombine.field=updatedate
> hoodie.datasource.hive_sync.database=datalake
> hoodie.datasource.write.operation=upsert
> hoodie.datasource.hive_sync.table=hudi.prd.surveys
> hoodie.datasource.hive_sync.mode=hms
> hoodie.datasource.hive_sync.enable=false
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
> hoodie.datasource.hive_sync.use_jdbc=false
> hoodie.datasource.write.recordkey.field=id
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> hoodie.datasource.write.hive_style_partitioning=true
> hoodie.finalize.write.parallelism=256
> hoodie.deltastreamer.source.dfs.root=s3://datalake-bucket/raw/parquet/data/surveys/year=2021/month=12/day=06/hour=16
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> hoodie.parquet.max.file.size=134217728
> hoodie.parquet.small.file.limit=67108864
> hoodie.parquet.block.size=134217728
> hoodie.parquet.compression.codec=snappy
> hoodie.file.listing.parallelism=256
> hoodie.upsert.shuffle.parallelism=10
> hoodie.metadata.enable=false
> hoodie.metadata.clean.async=true
> hoodie.clustering.preserve.commit.metadata=true
> hoodie.clustering.inline.max.commits=1
> hoodie.clustering.inline=true
> hoodie.clustering.plan.strategy.target.file.max.bytes=134217728
> hoodie.clustering.plan.strategy.small.file.limit=67108864
> hoodie.clustering.plan.strategy.sort.columns=projectid
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
> hoodie.clean.async=true
> hoodie.clean.automatic=true
> hoodie.cleaner.policy=KEEP_LATEST_COMMITS
> hoodie.cleaner.commits.retained=10
> hoodie.deltastreamer.transformer.sql=SELECT id, sid FROM <SRC> a



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to