[GitHub] [hudi] prashanthpdesai opened a new issue #1695: [SUPPORT] : Global Bloom Index config issue

GitBox Mon, 01 Jun 2020 09:06:23 -0700


prashanthpdesai opened a new issue #1695:
URL: https://github.com/apache/hudi/issues/1695



   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   yes
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   We are trying to use GLOBAL_BLOOM index for our use case in production , 
prior to that we trying it using test data using spark-shell ,facing below 
exception with below configuration. 
   
   In the first run with test3.csv file gets successfully ingested by creating 
3 partitions with ts column. 
   
   in the second run we did dummy update only the partition column , by keeping 
other attributes same . please check test4.csv file below. Second run would not 
able to create a new partition with new ts value , getting the exception . 
   
   
   **Run1 :**  
   Input csv file: 
   cat test3.csv
   fanme,lname,ts,uuid
   pd,desai1,2019-10-15,10
   pp,sai,2019-10-14,11
   pp,sai,2019-10-14,11
   prabil,bal,2020-01-30,20
   
   scala> val table ="hudi_cow1"
   scala> val basepath="/datalake/globalndextest"
   scala> val 
df3=spark.read.option("header","true").csv("/datalake/888/test3.csv"
   scala> val 
dfh4=df3.write.format("org.apache.hudi").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY,"ts").option("hoodie.index.type","GLOBAL_BLOOM").option("hoodie.bloom.index.update.partition.path","true").option(TABLE_NAME,table)
   scala> dfh4.mode(Append).save(basepath)
   
   Ouput: 
   spark.read.parquet("/datalake/globalndextest/*").show(false)
   
+-------------------+--------------------+------------------+----------------------+------------------------------------------------------------------------+------+------+----------+----+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|fanme
 |lname |ts |uuid|
   |20200529115326     |20200529115326_0_1  |20 
|2020-01-30|86941ae8-a5b6-4d31-b12c-63ec2883a2d3-0_0-52-25756_20200529115326.parquet|prabil|bal
   |2020-01-30|20  |
   |20200529115326     |20200529115326_2_2  |10 
|2019-10-15|972ee074-942c-417c-93ca-08d18b4e5897-0_2-52-25758_20200529115326.parquet|pd
 |desai1|2019-10-15|10  |
   |20200529115326     |20200529115326_1_1  |11 
|2019-10-14|7df81054-b67b-402e-bc7e-c935a18ab3eb-0_1-52-25757_20200529115326.parquet|pp
 |sai   |2019-10-14|11  |
   
   
   **Run2:**
   Input file: 
   cat test4.csv
   fanme,lname,ts,uuid
   pd,desai,2019-10-17,10
   pp,sai,2019-10-18,11
   rg,fg,2019-10-18,25
   scala> val table ="hudi_cow1"
   scala> val basepath="/datalake/globalndextest"
   scala> val 
df3=spark.read.option("header","true").csv("/datalake/888/test4.csv"
   scala> val 
dfh4=df3.write.format("org.apache.hudi").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY,"ts").option("hoodie.index.type","GLOBAL_BLOOM").option("hoodie.bloom.index.update.partition.path","true").option(TABLE_NAME,table)
   scala> dfh4.mode(Append).save(basepath)
   
   Exception: 
   **Caused by: org.apache.hudi.exception.HoodieUpsertException: Error 
upserting bucketType UPDATE for partition :1** 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:264)
   
org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:428)
   
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
   
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
     at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
     at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
     at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
     at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
     at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
     at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
     at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
     at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
     at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
     at org.apache.spark.scheduler.Task.run(Task.scala:123)
     at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.org$apache$spark$executor$Executor$TaskRunner$$anonfun$$res$1(Executor.scala:412)
     at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:419)
     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1359)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:430)
     at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   **Caused by: java.util.NoSuchElementException: No value present in Option**
     at org.apache.hudi.common.util.Option.get(Option.java:88)
     at 
org.apache.hudi.io.HoodieMergeHandle.<init>(HoodieMergeHandle.java:74)at 
org.apache.hudi.table.HoodieCopyOnWriteTable.getUpdateHandle(HoodieCopyOnWriteTable.java:220)at
 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdate(HoodieCopyOnWriteTable.java:177)
 at 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:257)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   Our expectation is the since the record key column is same with new 
partition value it has to delete the previous record key value ( 
hoodie.bloom.index.update.partition.path","true") and create a  record in new 
partition. 
   In our case we have same record key coming in different partition in 
different runs , need to delete the record in previous partition and ingest the 
incoming record key into new partition
   
   **Environment Description**
   
   * Hudi version :
   0.5.1
   
   * Spark version :
   2.2.1
   
   * Hive version :
   
   * Hadoop version :
   2.7 
   
   * Storage (HDFS/S3/GCS..) :
   HDFS
   
   * Running on Docker? (yes/no) :
   no
   
   
   **Additional context**
   
   **_In our case we have same record key coming in different partition in 
different runs , need to delete the record in previous partition and ingest the 
incoming record key into new partition._**  
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] prashanthpdesai opened a new issue #1695: [SUPPORT] : Global Bloom Index config issue

Reply via email to