[GitHub] [hudi] PhantomHunt opened a new issue, #6936: [SUPPORT] NPE when trying to upsert with option hoodie.metadata.index.column.stats.enable : true.

GitBox Wed, 12 Oct 2022 22:48:15 -0700


PhantomHunt opened a new issue, #6936:
URL: https://github.com/apache/hudi/issues/6936


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)? - Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We are upserting records in non-partioned table using following options 
primarily to test multi-modal indexing where we are using Apache Hudi latest 
version - 0.12.0 via AWS Glue.
   The job is working fine when records are inserted for the first time but 
from second iteration onwards  with same records, we are getting an error - "An 
error occurred while calling o130.save. java.lang.NullPointerException"
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Upsert data in a non partitioned Hudi table with following options -
   
   hudi_write_options_no_partition = {
       "hoodie.table.name": noPartitionHudiTableName,
       "hoodie.datasource.write.recordkey.field": "VODSRID",
       'hoodie.datasource.write.table.name': noPartitionHudiTableName,
       'hoodie.datasource.write.precombine.field': 'LastUpdatedOn',
       'hoodie.datasource.write.table.type': 'MERGE_ON_READ',
       'hoodie.metadata.enable':'true',
       'hoodie.metadata.index.bloom.filter.enable' : 'true',
       'hoodie.metadata.index.column.stats.enable' : 'true'
   }
   (
     df_DB_dept.write.format("org.apache.hudi")
     .option('hoodie.datasource.write.operation', 'upsert')
     .options(**hudi_write_options_no_partition)
     .mode("append")
     .save(table_path)
   )
   
   **Expected behavior**
   
   We are only able to insert fresh records. Getting error from second 
iteration with same set of records. We want successful upsert to happen from 
second iteration onwards which is not happening currently. 
   
   **Environment Description**
   
   * Hudi version : 0.12.0
   
   * AWS Glue version :  Glue 3.0
   
   * Spark version :  3.1
   
   * Python version :  3
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   **Additional context**
   
   When we remove the option - 'hoodie.metadata.index.column.stats.enable' : 
'true', upserts work properly for all iterations.
   
   **Stacktrace**
   
   2022-10-12 11:14:05,442 ERROR [main] 
glueexceptionanalysis.GlueExceptionAnalysisListener 
(Logging.scala:logError(9)): [Glue Exception Analysis] 
{"Event":"GlueETLJobExceptionEvent","Timestamp":1665573245439,"Failure 
Reason":"Traceback (most recent call last):\n  File 
\"/tmp/Indexing_test_5billion.py\", line 106, in <module>\n    
.save(table_path)\n  File 
\"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py\", line 
1109, in save\n    self._jwrite.save(path)\n  File 
\"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py\", line 
1305, in __call__\n    answer, self.gateway_client, self.target_id, 
self.name)\n  File 
\"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py\", line 111, in 
deco\n    return f(*a, **kw)\n  File 
\"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py\", line 
328, in get_return_value\n    format(target_id, \".\", name), 
value)\npy4j.protocol.Py4JJavaError: An error occurred while calling 
o130.save.\n: org.ap
 ache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 
23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 654) 
(10.218.20.37 executor 1): org.apache.hudi.exception.HoodieUpsertException: 
Error upserting bucketType UPDATE for partition :0\n\tat 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)\n\tat
 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)\n\tat
 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)\n\tat
 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)\n\tat
 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)\n\tat
 org.apache.spark.rdd.MapPartitionsRDD.compute(M
 apPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)\n\tat 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)\n\tat 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)\n\tat
 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)\n\tat
 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)\n\tat
 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)\n\tat
 org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:335)\n\tat 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD
 .scala:373)\n\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:337)\n\tat 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat 
org.apache.spark.scheduler.Task.run(Task.scala:131)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)\n\tat
 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)\n\tat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat
 java.lang.Thread.run(Thread.java:750)\nCaused by: 
org.apache.hudi.exception.HoodieAppendException: Failed while appending records 
to s3 PATH
   
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:410)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:382)\n\tat
 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:84)\n\tat
 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)\n\t...
 28 more\nCaused by: java.lang.NullPointerException\n\tat 
org.apache.hudi.avro.HoodieAvroUtils.convertValueForAvroLogicalTypes(HoodieAvroUtils.java:646)\n\tat
 
org.apache.hudi.avro.HoodieAvroUtils.convertValueForSpecificDataTypes(HoodieAvroUtils.java:620)\n\tat
 
org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$null$1(HoodieTableMetadataUtil.java:147)\n\tat
 java.util.ArrayList.forEach(ArrayList.java:1259)\n\tat 
org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$collectColumnRangeMetadata$2(HoodieTableMetadataUtil.java:142)\n\t
 at java.util.ArrayList.forEach(ArrayList.java:1259)\n\tat 
org.apache.hudi.metadata.HoodieTableMetadataUtil.collectColumnRangeMetadata(HoodieTableMetadataUtil.java:139)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.processAppendResult(HoodieAppendHandle.java:363)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:405)\n\t...
 31 more\n\nDriver stacktrace:\n\tat 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2465)\n\tat
 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2414)\n\tat
 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2413)\n\tat
 scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)\n\tat 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)\n\tat 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)\n\tat 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala
 :2413)\n\tat 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1124)\n\tat
 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1124)\n\tat
 scala.Option.foreach(Option.scala:257)\n\tat 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1124)\n\tat
 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2679)\n\tat
 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2621)\n\tat
 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2610)\n\tat
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)\n\tat 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:914)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2238)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2259)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2278
 )\n\tat org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)\n\tat 
org.apache.spark.rdd.RDD.count(RDD.scala:1253)\n\tat 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:696)\n\tat
 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:338)\n\tat
 org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:183)\n\tat 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)\n\tat
 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)\n\tat
 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)\n\tat
 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)\n\tat
 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)\n\tat
 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)\n\tat
 or
 
g.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)\n\tat
 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)\n\tat
 org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)\n\tat 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)\n\tat
 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)\n\tat
 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)\n\tat
 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)\n\tat
 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)\n\tat
 
 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)\n\tat
 org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)\n\tat 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)\n\tat
 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)\n\tat
 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)\n\tat
 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)\n\tat
 org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)\n\tat 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat 
sun.reflect.NativeMethodAccessorImpl.invoke(N
 ativeMethodAccessorImpl.java:62)\n\tat 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
 java.lang.reflect.Method.invoke(Method.java:498)\n\tat 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\n\tat 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\n\tat 
py4j.Gateway.invoke(Gateway.java:282)\n\tat 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat 
py4j.GatewayConnection.run(GatewayConnection.java:238)\n\tat 
java.lang.Thread.run(Thread.java:750)\nCaused by: 
org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType 
UPDATE for partition :0\n\tat 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)\n\tat
 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244
 )\n\tat 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)\n\tat
 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)\n\tat
 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)\n\tat
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)\n\tat 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)\n\tat 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)\n\tat
 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManage
 r.scala:1350)\n\tat 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)\n\tat
 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)\n\tat
 org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:335)\n\tat 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337)\n\tat 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat 
org.apache.spark.scheduler.Task.run(Task.scala:131)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)\n\tat
 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)\n\tat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat
 java.util.concurrent.ThreadPoolExecutor
 $Worker.run(ThreadPoolExecutor.java:624)\n\t... 1 more\nCaused by: 
org.apache.hudi.exception.HoodieAppendException: Failed while appending records 
to S3 PATH 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:410)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:382)\n\tat
 
org.apache.hudi.table.action.deltacommit.BaseSparkDeltaCommitActionExecutor.handleUpdate(BaseSparkDeltaCommitActionExecutor.java:84)\n\tat
 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)\n\t...
 28 more\nCaused by: java.lang.NullPointerException\n\tat 
org.apache.hudi.avro.HoodieAvroUtils.convertValueForAvroLogicalTypes(HoodieAvroUtils.java:646)\n\tat
 
org.apache.hudi.avro.HoodieAvroUtils.convertValueForSpecificDataTypes(HoodieAvroUtils.java:620)\n\tat
 
org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$null$1(HoodieTableMetadataUtil.java:147)\n\tat
 java.util.ArrayList.f
 orEach(ArrayList.java:1259)\n\tat 
org.apache.hudi.metadata.HoodieTableMetadataUtil.lambda$collectColumnRangeMetadata$2(HoodieTableMetadataUtil.java:142)\n\tat
 java.util.ArrayList.forEach(ArrayList.java:1259)\n\tat 
org.apache.hudi.metadata.HoodieTableMetadataUtil.collectColumnRangeMetadata(HoodieTableMetadataUtil.java:139)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.processAppendResult(HoodieAppendHandle.java:363)\n\tat
 
org.apache.hudi.io.HoodieAppendHandle.appendDataAndDeleteBlocks(HoodieAppendHandle.java:405)\n\t...
 31 more\n","Stack Trace":[{
       "Declaring Class": "get_return_value",
       "Method Name": "format(target_id, \".\", name), value)",
       "File Name": 
"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
       "Line Number": 328
   },{
       "Declaring Class": "deco",
       "Method Name": "return f(*a, **kw)",
       "File Name": 
"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
       "Line Number": 111
   },{
       "Declaring Class": "__call__",
       "Method Name": "answer, self.gateway_client, self.target_id, self.name)",
       "File Name": 
"/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
       "Line Number": 1305
   },{
       "Declaring Class": "save",
       "Method Name": "self._jwrite.save(path)",
       "File Name": 
"/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
       "Line Number": 1109
   },{
       "Declaring Class": "<module>",
       "Method Name": ".save(table_path)",
       "File Name": "/tmp/Indexing_test_5billion.py",
       "Line Number": 106
   }],"Last Executed Line number":106,"script":"Indexing_test_5billion.py"}
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] PhantomHunt opened a new issue, #6936: [SUPPORT] NPE when trying to upsert with option hoodie.metadata.index.column.stats.enable : true.

Reply via email to