Re: [I] [SUPPORT]Spark Job Reading Null Data from Hudi COPY_ON_WRITE Table Due to Inflight Commit During Snapshot Query [hudi]

via GitHub Mon, 23 Sep 2024 03:20:46 -0700


maabkhan commented on issue #11971:
URL: https://github.com/apache/hudi/issues/11971#issuecomment-2367793940


   @ad1happy2go 
   spark configs passed , rest configs will take default values - 
   "sparkConf": {
       "spark.local.dir": "/tmp/spark-local-dir-shuffle-f2086f4d",
       "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
       "spark.sql.extensions": 
"org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
       "spark.hadoop.fs.s3.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
       "spark.sql.caseSensitive": "false",
       "spark.decommission.enabled": "true",
       "spark.sql.adaptive.enabled": "true",
       "spark.eventLog.rolling.enabled": "true",
       "spark.dynamicAllocation.enabled": "true",
       "spark.sql.catalog.spark_catalog": 
"org.apache.spark.sql.hudi.catalog.HoodieCatalog",
       "spark.sql.catalogImplementation": "hive",
       "spark.cleaner.periodicGC.interval": "1min",
       "spark.storage.decommission.enabled": "true",
       "spark.dynamicAllocation.maxExecutors": "200",
       "spark.dynamicAllocation.minExecutors": "1",
       "spark.kubernetes.allocation.batch.size": "10",
       "spark.kubernetes.driver.requestTimeout": "30000",
       "spark.sql.avro.datetimeRebaseModeInRead": "CORRECTED",
       "spark.dynamicAllocation.initialExecutors": "1",
       "spark.sql.avro.datetimeRebaseModeInWrite": "CORRECTED",
       "spark.sql.execution.arrow.sparkr.enabled": "true",
       "spark.kubernetes.driver.connectionTimeout": "30000",
       "spark.sql.execution.arrow.pyspark.enabled": "true",
       "spark.sql.parquet.datetimeRebaseModeInRead": "CORRECTED",
       "spark.sql.legacy.pathOptionBehavior.enabled": "true",
       "spark.sql.parquet.datetimeRebaseModeInWrite": "CORRECTED",
       "spark.storage.decommission.rddBlocks.enabled": "true",
       "spark.dynamicAllocation.executorAllocationRatio": "0.33",
       "spark.dynamicAllocation.shuffleTracking.enabled": "True",
       "spark.storage.decommission.shuffleBlocks.enabled": "true",
       "spark.kubernetes.allocation.driver.readinessTimeout": "120s",
       "spark.dynamicAllocation.sustainedSchedulerBacklogTimeout": "60",
       "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version": "2"
     }
     "deps": {
       "jars": [
         
"https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark3.4-bundle_2.12/0.14.0/hudi-spark3.4-bundle_2.12-0.14.0.jar";,
         
"https://repo1.maven.org/maven2/org/apache/hive/hcatalog/hive-hcatalog-core/3.1.3/hive-hcatalog-core-3.1.3.jar";
       ]
     }
     
     Hudi configs passed , rest will take default values - 
    {
       "className": "org.apache.hudi",
       "hoodie.datasource.hive_sync.use_jdbc": "false",
       "hoodie.datasource.write.precombine.field": "dms_timestamp",
       "hoodie.datasource.write.recordkey.field": "uuid",
       "hoodie.table.name": "users",
       "hoodie.consistency.check.enabled": "false",
       "hoodie.datasource.hive_sync.table": "users",
       "hoodie.datasource.hive_sync.database": "luna_lazypay",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.mode": "hms",
       "hoodie.datasource.hive_sync.support_timestamp": "true",
       "hoodie.datasource.write.reconcile.schema": "true",
       "path": "s3a://refined-luna-prod/luna_lazypay/users/",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.partitionpath.field": "year,month,day",
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.datasource.hive_sync.partition_fields": "year,month,day",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.upsert.shuffle.parallelism": 40,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.cleaner.policy": "KEEP_LATEST_COMMITS",
       "hoodie.cleaner.commits.retained": 1
   }
   
   These configs are of the job details shared above which is trying to read 
from a table while it was getting updated. 
   Also the table my job is reading is also a hudi table and that gets updated 
by similar kind of spark-hudi job but with some other configs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [SUPPORT]Spark Job Reading Null Data from Hudi COPY_ON_WRITE Table Due to Inflight Commit During Snapshot Query [hudi]

Reply via email to