[GitHub] [incubator-hudi] lamber-ken edited a comment on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

GitBox Sat, 09 May 2020 08:56:08 -0700


lamber-ken edited a comment on pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#issuecomment-626196481



   Hi @vinothchandar @nsivabalan thanks for you review, all review comments are 
addressed.
   
   ### SYNC
   
   |  task   |  status  |
   |  ----  | ----  |
   |  fix an implicit bug which causes input record repeat  | done and junit 
covered |
   | implement fetchRecordLocation  | done and junit covered  |
   |  junit tests for TestHoodieBloomIndexV2  | done  |
   | junit test for TestHoodieBloomRangeInfoHandle  | done |
   | revert HoodieTimer | done |
   | global HoodieBloomIndexV2 | done and junit covered |
   
   
   ### Test
   ```
   
   // BLOOM, GLOBAL_BLOOM, BLOOM_V2, GLOBAL_BLOOM_V2
   
   export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
   ${SPARK_HOME}/bin/spark-shell \
       --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
       --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   
   import org.apache.spark.sql.functions._
   
   val tableName = "hudi_mor_table"
   val basePath = "file:///tmp/hudi_mor_tablen"
   
   val hudiOptions = Map[String,String](
     "hoodie.upsert.shuffle.parallelism" -> "10",
     "hoodie.datasource.write.recordkey.field" -> "name",
     "hoodie.datasource.write.partitionpath.field" -> "location",
     "hoodie.table.name" -> tableName,
     "hoodie.datasource.write.precombine.field" -> "ts",
     "hoodie.index.type" -> "BLOOM_V2"
   )
   
   var datas = List(
       """{ "name": "kenken1", "ts": 1574297893836, "age": 123, "location": 
"2019-03-01"}""",
       """{ "name": "kenken1", "ts": 1574297893836, "age": 123, "location": 
"2019-03-02"}"""
   )
   val inputDF = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Overwrite").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   
   
   // update
   // update
   
   var datas = List(
       """{ "name": "kenken1", "ts": 1574297893838, "age": 100, "location": 
"2019-03-01"}"""
   )
   val inputDF = spark.read.json(spark.sparkContext.parallelize(datas, 2))
   
   inputDF.write.format("org.apache.hudi").
     options(hudiOptions).
     mode("Append").
     save(basePath)
   
   spark.read.format("org.apache.hudi").load(basePath + "/*/*").show();
   ```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-hudi] lamber-ken edited a comment on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

Reply via email to