[
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17484351#comment-17484351
]
sivabalan narayanan edited comment on HUDI-3335 at 1/30/22, 1:04 PM:
---------------------------------------------------------------------
Unfortunately metadata table is in an inconsistent state wrt data table. Only
option we have for now is for you to remove entire metadata directory and let
your next writer bootstrap from scratch. But if its possible, would like to
understand how the problem occurred. Is it possible to mask any PII data and
give us the dataset. Or give us any reproducible steps. I understand,
reproducible steps might be hard. Or another option is, you can make a copy of
your dataset, and delete all data files (.hoodie/partitions/*), and share us
the dataset. So, that we can inspect data timeline, metadata timeline, metadata
table data, and archival of data timeline. we may not be able to query the
dataset, but have to manually inspect these and see what happened.
was (Author: shivnarayan):
Unfortunately metadata table is in an inconsistent state wrt data table. Only
option we have for now is for you to remove entire metadata directory and let
you next writer bootstrap from scratch. But if its possible, would like to
understand how the problem occurred. Is it possible to mask any PII data and
give us the dataset. Or give us any reproducible steps. I understand,
reproducible steps might be hard. Or another option is, you can make a copy of
your dataset, and delete all data files (.hoodie/partitions/*), and share us
the dataset. So, that we can inspect data timeline, metadata timeline, metadata
table data, and archival of data timeline. we may not be able to query the
dataset, but have to manually inspect these and see what happened.
> Loading Hudi table fails with NullPointerException
> --------------------------------------------------
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 0.10.1
> Reporter: Harsha Teja Kanna
> Priority: Blocker
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
> val df = spark.
> read.
> format("org.apache.hudi").
> option(HoodieMetadataConfig.ENABLE.key(), "true").
> option(DataSourceReadOptions.QUERY_TYPE.key(),
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
> load(s"${basePath}/sessions/")
> df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
> read.
> format("org.apache.hudi").
> option(HoodieMetadataConfig.ENABLE.key(), "true").
> option(DataSourceReadOptions.QUERY_TYPE.key(),
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
> load(s"${basePath}/sessions/date=2022/01/25")
> df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
> read.
> format("org.apache.hudi").
> option(DataSourceReadOptions.QUERY_TYPE.key(),
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
> load(s"${basePath}/sessions/")
> df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
> at
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
> at
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
> at
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
> at
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
> at
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
> at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
> at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
> at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
> at
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
> at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
> at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:184)
> at
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
> at
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
> at
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
> at
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
> at scala.Option.getOrElse(Option.scala:189)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
> at $anonfun$res3$1(<console>:46)
> at $anonfun$res3$1$adapted(<console>:40)
> at scala.collection.Iterator.foreach(Iterator.scala:941)
> at scala.collection.Iterator.foreach$(Iterator.scala:941)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> at scala.collection.IterableLike.foreach(IterableLike.scala:74)
> at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
> \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/sessions \
> --target-table sessions \
> --transformer-class
> org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
> --op INSERT \
> --hoodie-conf hoodie.clean.automatic=true \
> --hoodie-conf hoodie.cleaner.commits.retained=10 \
> --hoodie-conf hoodie.cleaner.policy=KEEP_LATEST_COMMITS \
> --hoodie-conf hoodie.clustering.inline=true \
> --hoodie-conf hoodie.clustering.inline.max.commits=5 \
> --hoodie-conf
> hoodie.clustering.plan.strategy.class=org.apache.hudi.client.clustering.plan.strategy.SparkRecentDaysClusteringPlanStrategy
> \
> --hoodie-conf hoodie.clustering.plan.strategy.max.num.groups=1000 \
> --hoodie-conf hoodie.clustering.plan.strategy.small.file.limit=268435456 \
> --hoodie-conf
> hoodie.clustering.plan.strategy.sort.columns=survey_dbid,session_dbid \
> --hoodie-conf hoodie.clustering.plan.strategy.target.file.max.bytes=536870912
> \
> --hoodie-conf hoodie.clustering.preserve.commit.metadata=true \
> --hoodie-conf hoodie.datasource.hive_sync.database=datalake-hudi \
> --hoodie-conf hoodie.datasource.hive_sync.enable=false \
> --hoodie-conf hoodie.datasource.hive_sync.ignore_exceptions=true \
> --hoodie-conf hoodie.datasource.hive_sync.mode=hms \
> --hoodie-conf
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> \
> --hoodie-conf hoodie.datasource.hive_sync.partition_fields=date \
> --hoodie-conf hoodie.datasource.hive_sync.table=sessions \
> --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false \
> --hoodie-conf hoodie.datasource.write.hive_style_partitioning=true \
> --hoodie-conf
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
> \
> --hoodie-conf hoodie.datasource.write.operation=insert \
> --hoodie-conf hoodie.datasource.write.partitionpath.field=date:TIMESTAMP \
> --hoodie-conf hoodie.datasource.write.precombine.field=timestamp \
> --hoodie-conf
> hoodie.datasource.write.recordkey.field=session_dbid,question_id,answer \
> --hoodie-conf
> hoodie.deltastreamer.keygen.timebased.input.dateformat=yyyy/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.input.timezone=GMT \
> --hoodie-conf
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd \
> --hoodie-conf hoodie.deltastreamer.keygen.timebased.output.timezone=GMT \
> --hoodie-conf
> hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING \
> --hoodie-conf
> hoodie.deltastreamer.source.dfs.root=s3://datalake-hudi/raw/parquet/data/sessions/year=2022/month=01/day=26/hour=02
> \
> --hoodie-conf
> hoodie.deltastreamer.source.input.selector=org.apache.hudi.utilities.sources.helpers.DFSPathSelector
> \
> --hoodie-conf "\"hoodie.deltastreamer.transformer.sql=SELECT question_id,
> answer, to_timestamp(timestamp) as timestamp, session_dbid, survey_dbid,
> date_format(to_timestamp(timestamp), 'yyyy/MM/dd') AS date FROM <SRC> a \"" \
> --hoodie-conf hoodie.file.listing.parallelism=256 \
> --hoodie-conf hoodie.finalize.write.parallelism=256 \
> --hoodie-conf
> hoodie.generate.consistent.timestamp.logical.for.key.generator=true \
> --hoodie-conf hoodie.insert.shuffle.parallelism=1000 \
> --hoodie-conf hoodie.metadata.enable=true \
> --hoodie-conf hoodie.metadata.metrics.enable=true \
> --hoodie-conf
> hoodie.metrics.cloudwatch.metric.prefix=emr.datalake.prd.insert.sessions \
> --hoodie-conf hoodie.metrics.on=false \
> --hoodie-conf hoodie.metrics.reporter.type=CLOUDWATCH \
> --hoodie-conf hoodie.parquet.block.size=536870912 \
> --hoodie-conf hoodie.parquet.compression.codec=snappy \
> --hoodie-conf hoodie.parquet.max.file.size=536870912 \
> --hoodie-conf hoodie.parquet.small.file.limit=268435456
>
> *Metadata Commits (.hoodie/metadata/.hoodie)*
> **
> 20220125154001455002.clean
> 20220125154001455002.clean.inflight
> 20220125154001455002.clean.requested
> 20220125160751769002.clean
> 20220125160751769002.clean.inflight
> 20220125160751769002.clean.requested
> 20220125163020781002.clean
> 20220125163020781002.clean.inflight
> 20220125163020781002.clean.requested
> 20220125165722170002.clean
> 20220125165722170002.clean.inflight
> 20220125165722170002.clean.requested
> 20220125172016239002.clean
> 20220125172016239002.clean.inflight
> 20220125172016239002.clean.requested
> 20220125174427654002.clean
> 20220125174427654002.clean.inflight
> 20220125174427654002.clean.requested
> 20220125181218237002.clean
> 20220125181218237002.clean.inflight
> 20220125181218237002.clean.requested
> 20220125184343588002.clean
> 20220125184343588002.clean.inflight
> 20220125184343588002.clean.requested
> 20220125191038318002.clean
> 20220125191038318002.clean.inflight
> 20220125191038318002.clean.requested
> 20220125193445223002.clean
> 20220125193445223002.clean.inflight
> 20220125193445223002.clean.requested
> 20220125200741168002.clean
> 20220125200741168002.clean.inflight
> 20220125200741168002.clean.requested
> 20220125203814934002.clean
> 20220125203814934002.clean.inflight
> 20220125203814934002.clean.requested
> 20220125211447323002.clean
> 20220125211447323002.clean.inflight
> 20220125211447323002.clean.requested
> 20220125214421740002.clean
> 20220125214421740002.clean.inflight
> 20220125214421740002.clean.requested
> 20220125221009798002.clean
> 20220125221009798002.clean.inflight
> 20220125221009798002.clean.requested
> 20220125224319264002.clean
> 20220125224319264002.clean.inflight
> 20220125224319264002.clean.requested
> 20220125231128580002.clean
> 20220125231128580002.clean.inflight
> 20220125231128580002.clean.requested
> 20220125234345790002.clean
> 20220125234345790002.clean.inflight
> 20220125234345790002.clean.requested
> 20220126001130415002.clean
> 20220126001130415002.clean.inflight
> 20220126001130415002.clean.requested
> 20220126004341130002.clean
> 20220126004341130002.clean.inflight
> 20220126004341130002.clean.requested
> 20220126011114529002.clean
> 20220126011114529002.clean.inflight
> 20220126011114529002.clean.requested
> 20220126013648751002.clean
> 20220126013648751002.clean.inflight
> 20220126013648751002.clean.requested
> 20220126013859643.deltacommit
> 20220126013859643.deltacommit.inflight
> 20220126013859643.deltacommit.requested
> 20220126014254294.deltacommit
> 20220126014254294.deltacommit.inflight
> 20220126014254294.deltacommit.requested
> 20220126014516195.deltacommit
> 20220126014516195.deltacommit.inflight
> 20220126014516195.deltacommit.requested
> 20220126014711043.deltacommit
> 20220126014711043.deltacommit.inflight
> 20220126014711043.deltacommit.requested
> 20220126014808898.deltacommit
> 20220126014808898.deltacommit.inflight
> 20220126014808898.deltacommit.requested
> 20220126015008443.deltacommit
> 20220126015008443.deltacommit.inflight
> 20220126015008443.deltacommit.requested
> 20220126015119193.deltacommit
> 20220126015119193.deltacommit.inflight
> 20220126015119193.deltacommit.requested
> 20220126015119193001.commit
> 20220126015119193001.compaction.inflight
> 20220126015119193001.compaction.requested
> 20220126015653770.deltacommit
> 20220126015653770.deltacommit.inflight
> 20220126015653770.deltacommit.requested
> 20220126020011172.deltacommit
> 20220126020011172.deltacommit.inflight
> 20220126020011172.deltacommit.requested
> 20220126020405299.deltacommit
> 20220126020405299.deltacommit.inflight
> 20220126020405299.deltacommit.requested
> 20220126020405299002.clean
> 20220126020405299002.clean.inflight
> 20220126020405299002.clean.requested
> 20220126020813841.deltacommit
> 20220126020813841.deltacommit.inflight
> 20220126020813841.deltacommit.requested
> 20220126021002748.deltacommit
> 20220126021002748.deltacommit.inflight
> 20220126021002748.deltacommit.requested
> 20220126021231085.deltacommit
> 20220126021231085.deltacommit.inflight
> 20220126021231085.deltacommit.requested
> 20220126021429124.deltacommit
> 20220126021429124.deltacommit.inflight
> 20220126021429124.deltacommit.requested
> 20220126021445188.deltacommit
> 20220126021445188.deltacommit.inflight
> 20220126021445188.deltacommit.requested
> 20220126021949824.deltacommit
> 20220126021949824.deltacommit.inflight
> 20220126021949824.deltacommit.requested
> 20220126022154561.deltacommit
> 20220126022154561.deltacommit.inflight
> 20220126022154561.deltacommit.requested
> 20220126022154561001.commit
> 20220126022154561001.compaction.inflight
> 20220126022154561001.compaction.requested
> 20220126022523011.deltacommit
> 20220126022523011.deltacommit.inflight
> 20220126022523011.deltacommit.requested
> 20220126023054200.deltacommit
> 20220126023054200.deltacommit.inflight
> 20220126023054200.deltacommit.requested
> 20220126023530250.deltacommit
> 20220126023530250.deltacommit.inflight
> 20220126023530250.deltacommit.requested
> 20220126023530250002.clean
> 20220126023530250002.clean.inflight
> 20220126023530250002.clean.requested
> 20220126023637109.deltacommit
> 20220126023637109.deltacommit.inflight
> 20220126023637109.deltacommit.requested
> 20220126024028688.deltacommit
> 20220126024028688.deltacommit.inflight
> 20220126024028688.deltacommit.requested
> 20220126024137627.deltacommit
> 20220126024137627.deltacommit.inflight
> 20220126024137627.deltacommit.requested
> 20220126024720121.deltacommit
> 20220126024720121.deltacommit.inflight
> 20220126024720121.deltacommit.requested
> *Commits(.hoodie)*
> 20220125224502471.clean
> 20220125224502471.clean.inflight
> 20220125224502471.clean.requested
> 20220125225810828.clean
> 20220125225810828.clean.inflight
> 20220125225810828.clean.requested
> 20220125230125674.clean
> 20220125230125674.clean.inflight
> 20220125230125674.clean.requested
> 20220125230854957.clean
> 20220125230854957.clean.inflight
> 20220125230854957.clean.requested
> 20220125232236767.clean
> 20220125232236767.clean.inflight
> 20220125232236767.clean.requested
> 20220125232638588.clean
> 20220125232638588.clean.inflight
> 20220125232638588.clean.requested
> 20220125233355290.clean
> 20220125233355290.clean.inflight
> 20220125233355290.clean.requested
> 20220125234539672.clean
> 20220125234539672.clean.inflight
> 20220125234539672.clean.requested
> 20220125234944271.clean
> 20220125234944271.clean.inflight
> 20220125234944271.clean.requested
> 20220125235718218.clean
> 20220125235718218.clean.inflight
> 20220125235718218.clean.requested
> 20220126000225375.clean
> 20220126000225375.clean.inflight
> 20220126000225375.clean.requested
> 20220126000937875.clean
> 20220126000937875.clean.inflight
> 20220126000937875.clean.requested
> 20220126003307449.clean
> 20220126003307449.clean.inflight
> 20220126003307449.clean.requested
> 20220126003617137.clean
> 20220126003617137.clean.inflight
> 20220126003617137.clean.requested
> 20220126004518227.clean
> 20220126004518227.clean.inflight
> 20220126004518227.clean.requested
> 20220126005806798.clean
> 20220126005806798.clean.inflight
> 20220126005806798.clean.requested
> 20220126010011407.commit
> 20220126010011407.commit.requested
> 20220126010011407.inflight
> 20220126010227320.clean
> 20220126010227320.clean.inflight
> 20220126010227320.clean.requested
> 20220126010242754.replacecommit
> 20220126010242754.replacecommit.inflight
> 20220126010242754.replacecommit.requested
> 20220126010800207.commit
> 20220126010800207.commit.requested
> 20220126010800207.inflight
> 20220126010920192.clean
> 20220126010920192.clean.inflight
> 20220126010920192.clean.requested
> 20220126011114529.commit
> 20220126011114529.commit.requested
> 20220126011114529.inflight
> 20220126011230532.clean
> 20220126011230532.clean.inflight
> 20220126011230532.clean.requested
> 20220126011426028.commit
> 20220126011426028.commit.requested
> 20220126011426028.inflight
> 20220126011818299.commit
> 20220126011818299.commit.requested
> 20220126011818299.inflight
> 20220126012003045.clean
> 20220126012003045.clean.inflight
> 20220126012003045.clean.requested
> 20220126012240288.commit
> 20220126012240288.commit.requested
> 20220126012240288.inflight
> 20220126012443455.clean
> 20220126012443455.clean.inflight
> 20220126012443455.clean.requested
> 20220126012508460.replacecommit
> 20220126012508460.replacecommit.inflight
> 20220126012508460.replacecommit.requested
> 20220126013218816.commit
> 20220126013218816.commit.requested
> 20220126013218816.inflight
> 20220126013428875.clean
> 20220126013428875.clean.inflight
> 20220126013428875.clean.requested
> 20220126013648751.commit
> 20220126013648751.commit.requested
> 20220126013648751.inflight
> 20220126013859643.clean
> 20220126013859643.clean.inflight
> 20220126013859643.clean.requested
> 20220126014254294.commit
> 20220126014254294.commit.requested
> 20220126014254294.inflight
> 20220126014516195.clean
> 20220126014516195.clean.inflight
> 20220126014516195.clean.requested
> 20220126014711043.commit
> 20220126014711043.commit.requested
> 20220126014711043.inflight
> 20220126014808898.clean
> 20220126014808898.clean.inflight
> 20220126014808898.clean.requested
> 20220126015008443.commit
> 20220126015008443.commit.requested
> 20220126015008443.inflight
> 20220126015119193.replacecommit
> 20220126015119193.replacecommit.inflight
> 20220126015119193.replacecommit.requested
> 20220126015653770.commit
> 20220126015653770.commit.requested
> 20220126015653770.inflight
> 20220126020011172.commit
> 20220126020011172.commit.requested
> 20220126020011172.inflight
> 20220126020405299.commit
> 20220126020405299.commit.requested
> 20220126020405299.inflight
> 20220126020813841.commit
> 20220126020813841.commit.requested
> 20220126020813841.inflight
> 20220126021002748.clean
> 20220126021002748.clean.inflight
> 20220126021002748.clean.requested
> 20220126021231085.commit
> 20220126021231085.commit.requested
> 20220126021231085.inflight
> 20220126021429124.clean
> 20220126021429124.clean.inflight
> 20220126021429124.clean.requested
> 20220126021445188.replacecommit
> 20220126021445188.replacecommit.inflight
> 20220126021445188.replacecommit.requested
> 20220126021949824.commit
> 20220126021949824.commit.requested
> 20220126021949824.inflight
> 20220126022154561.clean
> 20220126022154561.clean.inflight
> 20220126022154561.clean.requested
> 20220126022523011.commit
> 20220126022523011.commit.requested
> 20220126022523011.inflight
> 20220126023054200.commit
> 20220126023054200.commit.requested
> 20220126023054200.inflight
> 20220126023530250.commit
> 20220126023530250.commit.requested
> 20220126023530250.inflight
> 20220126023637109.clean
> 20220126023637109.clean.inflight
> 20220126023637109.clean.requested
> 20220126024028688.commit
> 20220126024028688.commit.requested
> 20220126024028688.inflight
> 20220126024137627.replacecommit
> 20220126024137627.replacecommit.inflight
> 20220126024137627.replacecommit.requested
> 20220126024720121.commit
> 20220126024720121.commit.requested
> 20220126024720121.inflight
>
> **
--
This message was sent by Atlassian Jira
(v8.20.1#820001)