[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989596455


   
   ## CI report:
   
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989571491


   
   ## CI report:
   
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989592033


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4119)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989586395


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


xiarixiaoyao commented on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989590875


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989559491


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989586395


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2323) Upsert of Case Class with single field causes SchemaParseException

2021-12-08 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456194#comment-17456194
 ] 

sivabalan narayanan commented on HUDI-2323:
---

There is a known bug in parquet for the same issue reported:

[https://github.com/apache/parquet-mr/pull/560]

 

 

> Upsert of Case Class with single field causes SchemaParseException
> --
>
> Key: HUDI-2323
> URL: https://issues.apache.org/jira/browse/HUDI-2323
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: Tyler Jackson
>Priority: Critical
>  Labels: schema, sev:critical
> Fix For: 0.11.0
>
> Attachments: HudiSchemaGenerationTest.scala
>
>
> Additional background information:
> Spark version 3.1.1
>  Scala version 2.12
>  Hudi version 0.8.0 (hudi-spark-bundle_2.12 artifact)
>  
> While testing a spark job in EMR of inserting and then upserting data for a 
> fairly complex nested case class structure, I ran into an issue that I was 
> having a hard time tracking down. It seems when part of the case class in the 
> dataframe to be written has a single field in it, the avro schema generation 
> fails with the following stacktrace, but only on the upsert:
> {{21/08/19 15:08:34 ERROR BoundedInMemoryExecutor: error producing records}}
>  {{org.apache.avro.SchemaParseException: Can't redefine: array}}
>  \{{ at org.apache.avro.Schema$Names.put(Schema.java:1128) }}
>  \{{ at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562) }}
>  \{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690) }}
>  \{{ at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805) }}
>  \{{ at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882) }}
>  \{{ at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716) }}
>  \{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)}}
>  \{{ at org.apache.avro.Schema.toString(Schema.java:324)}}
>  \{{ at 
> org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:475)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)}}
>  \{{ at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)}}
>  \{{ at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)}}
>  \{{ at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)}}
>  \{{ at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)}}
>  \{{ at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)}}
>  \{{ at 
> org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)}}
>  \{{ at 
> org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)}}
>  \{{ at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)}}
>  \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
>  \{{ at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
>  \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266) }}
>  \{{ at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  }}
>  \{{ at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  }}
>  \{{ at java.lang.Thread.run(Thread.java:748) }}
>  
> I am able to replicate the problem in my local IntelliJ setup using the test 
> that has been attached to this issue. The problem can be observed in the 
> DummyStepParent case class. Simply adding an additional field to the case 
> class eliminates the problem altogether (which is an acceptable workaround 
> for our purposes, but shouldn't ultimately be necessary).
> {{case class DummyObject (}}
>  {{ fieldOne: String,}}
>  {{ listTwo: Seq[String],}}
>  {{ listThree: Seq[DummyChild],}}
>  {{ listFour: Seq[DummyStepChild],}}
>  {{ fieldFive: Boolean,}}
>  {{ listSix: Seq[DummyParent],}}
>  {{ listSeven: Seq[DummyCousin],}}
>  {{ {color:#de350b}listEight: 

[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-989582099


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN
   * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4118)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-989559372


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN
   * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2021-12-08 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456189#comment-17456189
 ] 

sivabalan narayanan commented on HUDI-864:
--

[~rolandjohann] : Can you give it a try and let us know. Setting this with 
other hudi props should work with latest master or with 0.10.0. 

> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, Spark Integration, Writer Core
>Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0
>Reporter: Roland Johann
>Priority: Critical
>  Labels: sev:critical, user-support-issues
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> 

[jira] [Commented] (HUDI-1079) Cannot upsert on schema with Array of Record with single field

2021-12-08 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456186#comment-17456186
 ] 

sivabalan narayanan commented on HUDI-1079:
---

Unfortunately, hudi does not do any schema fiddling in this regard. we just 
rely on parquet-avro to do the conversion for us and apparently arrays with a 
struct type with just 1 type runs into issues. But I also wonder why would 
someone(even your upstream datasets), define a struct type with just 1 entry.

Anyways, we try our best not to re-write these logic internally and try to 
re-use the library as is. Let us know if you are still looking for a solution 
on this end.  

> Cannot upsert on schema with Array of Record with single field
> --
>
> Key: HUDI-1079
> URL: https://issues.apache.org/jira/browse/HUDI-1079
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.9.0
> Environment: spark 2.4.4, local 
>Reporter: Adrian Tanase
>Priority: Critical
>  Labels: schema, sev:critical, user-support-issues
> Fix For: 0.11.0
>
>
> I am trying to trigger upserts on a table that has an array field with 
> records of just one field.
>  Here is the code to reproduce:
> {code:scala}
>   val spark = SparkSession.builder()
>   .master("local[1]")
>   .appName("SparkByExamples.com")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .getOrCreate();
>   // https://sparkbyexamples.com/spark/spark-dataframe-array-of-struct/
>   val arrayStructData = Seq(
> Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
> Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))),
> Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250))),
> Row("Washington",null)
>   )
>   val arrayStructSchema = new StructType()
>   .add("name",StringType)
>   .add("booksIntersted",ArrayType(
> new StructType()
>   .add("bookName",StringType)
> //  .add("author",StringType)
> //  .add("pages",IntegerType)
>   ))
> val df = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> {code}
> Running insert following by upsert will fail:
> {code:scala}
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Overwrite)
>   .save(basePath)
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Append)
>   .save(basePath)
> {code}
> If I create the books record with all the fields (at least 2), it works as 
> expected.
> The relevant part of the exception is this:
> {noformat}
> Caused by: java.lang.ClassCastException: required binary bookName (UTF8) is 
> not a groupCaused by: java.lang.ClassCastException: required binary bookName 
> (UTF8) is not a group at 
> org.apache.parquet.schema.Type.asGroupType(Type.java:207) at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:536)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:486)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)
>  at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>  at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
>  at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at 
> org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
>  at 
> 

[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989571491


   
   ## CI report:
   
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117)
 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989569735


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989569735


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989557070


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] guanziyue commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


guanziyue commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989568422


   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989559491


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989558237


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-989559372


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN
   * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-989556962


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4265:
URL: https://github.com/apache/hudi/pull/4265#issuecomment-989558237


   
   ## CI report:
   
   * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner

2021-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2966:
-
Labels: pull-request-available  (was: )

> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
> ---
>
> Key: HUDI-2966
> URL: https://issues.apache.org/jira/browse/HUDI-2966
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When 
> the query is completed。



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xiarixiaoyao opened a new pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.

2021-12-08 Thread GitBox


xiarixiaoyao opened a new pull request #4265:
URL: https://github.com/apache/hudi/pull/4265


   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989531356


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989557070


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-986559751


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4172:
URL: https://github.com/apache/hudi/pull/4172#issuecomment-989556962


   
   ## CI report:
   
   * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN
   * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041)
 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045)
 
   * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service

2021-12-08 Thread GitBox


danny0405 commented on a change in pull request #4252:
URL: https://github.com/apache/hudi/pull/4252#discussion_r765464342



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java
##
@@ -141,35 +140,17 @@ public void start(Function 
onShutdownCallback) {
   protected abstract Pair startService();
 
   /**
-   * A monitor thread is started which would trigger a callback if the service 
is shutdown.
+   * Add shutdown callback for the completable future.
* 
-   * @param onShutdownCallback
+   * @param callback The callback
*/
-  private void monitorThreads(Function onShutdownCallback) {
-LOG.info("Submitting monitor thread !!");
-Executors.newSingleThreadExecutor(r -> {
-  Thread t = new Thread(r, "Monitor Thread");
-  t.setDaemon(isRunInDaemonMode());

Review comment:
   Here the code new a thread pool but never shut down it, the core working 
thread always lives there thus leak ~




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service

2021-12-08 Thread GitBox


danny0405 commented on a change in pull request #4252:
URL: https://github.com/apache/hudi/pull/4252#discussion_r765463984



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AsyncCleanerService.java
##
@@ -73,6 +73,8 @@ public static void waitForCompletion(AsyncCleanerService 
asyncCleanerService) {
 asyncCleanerService.waitForShutdown();
   } catch (Exception e) {
 throw new HoodieException("Error waiting for async cleaning to 
finish", e);
+  } finally {
+asyncCleanerService.shutdown(false);
   }

Review comment:
   Actually no, it is the monitor thread pool leaks !




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service

2021-12-08 Thread GitBox


vinothchandar commented on a change in pull request #4252:
URL: https://github.com/apache/hudi/pull/4252#discussion_r765451717



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AsyncCleanerService.java
##
@@ -73,6 +73,8 @@ public static void waitForCompletion(AsyncCleanerService 
asyncCleanerService) {
 asyncCleanerService.waitForShutdown();
   } catch (Exception e) {
 throw new HoodieException("Error waiting for async cleaning to 
finish", e);
+  } finally {
+asyncCleanerService.shutdown(false);
   }

Review comment:
   Guess this is the leak 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1894) NULL values in timestamp column defaulted

2021-12-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-1894.
---

> NULL values in timestamp column defaulted 
> --
>
> Key: HUDI-1894
> URL: https://issues.apache.org/jira/browse/HUDI-1894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Eldhose Paul
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: schema, sev:critical, triaged
> Fix For: 0.9.0
>
>
> Reading timestamp column from hudi and underlying parquet files in spark 
> gives different results. 
> *hudi properties:*
> {code:java}
>  hdfs dfs -cat 
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>  
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji = 
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala>  ji.filter($"id" === 
> 1237858).withColumn("inputfile", 
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate", 
> $"inputfile").show(false)
> +---+--+--+--+---+--+++
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate|archiveddate|inputfile
>   
>  |
> +---+--+--+--+---+--+++
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
>|
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +---+--+--+--+---+--+++
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih = 
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +---+--+--+--+---+---+---+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate |archiveddate   |
> 

[jira] [Updated] (HUDI-1894) NULL values in timestamp column defaulted

2021-12-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1894:
--
Fix Version/s: 0.9.0

> NULL values in timestamp column defaulted 
> --
>
> Key: HUDI-1894
> URL: https://issues.apache.org/jira/browse/HUDI-1894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Eldhose Paul
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: schema, sev:critical, triaged
> Fix For: 0.9.0
>
>
> Reading timestamp column from hudi and underlying parquet files in spark 
> gives different results. 
> *hudi properties:*
> {code:java}
>  hdfs dfs -cat 
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>  
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji = 
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala>  ji.filter($"id" === 
> 1237858).withColumn("inputfile", 
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate", 
> $"inputfile").show(false)
> +---+--+--+--+---+--+++
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate|archiveddate|inputfile
>   
>  |
> +---+--+--+--+---+--+++
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
>|
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +---+--+--+--+---+--+++
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih = 
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +---+--+--+--+---+---+---+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate |archiveddate   |
> 

[jira] [Commented] (HUDI-1894) NULL values in timestamp column defaulted

2021-12-08 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456152#comment-17456152
 ] 

sivabalan narayanan commented on HUDI-1894:
---

Since the issue is not reproducible with 0.9.0 and later, closing the jira. 

> NULL values in timestamp column defaulted 
> --
>
> Key: HUDI-1894
> URL: https://issues.apache.org/jira/browse/HUDI-1894
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Eldhose Paul
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: schema, sev:critical, triaged
>
> Reading timestamp column from hudi and underlying parquet files in spark 
> gives different results. 
> *hudi properties:*
> {code:java}
>  hdfs dfs -cat 
> /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties
> #Properties saved on Tue May 11 17:17:22 EDT 2021
> #Tue May 11 17:17:22 EDT 2021
> hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
> hoodie.table.name=jiraissue
> hoodie.archivelog.folder=archived
> hoodie.table.type=MERGE_ON_READ
> hoodie.table.version=1
> hoodie.timeline.layout.version=1
> {code}
>  
> *Reading directly from parquet using Spark:*
> {code:java}
> scala> val ji = 
> spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet")
> ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala>  ji.filter($"id" === 
> 1237858).withColumn("inputfile", 
> input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate", 
> $"inputfile").show(false)
> +---+--+--+--+---+--+++
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate|archiveddate|inputfile
>   
>  |
> +---+--+--+--+---+--+++
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet
>|
> |20210511171722 |20210511171722_7_13718|1237858.0 |   
>
> |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null 
>  |null
> |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C
> +---+--+--+--+---+--+++
> {code}
> *Reading `hudi` using Spark:*
> {code:java}
> scala> val jih = 
> spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events")
> jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, 
> _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === 
> 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", 
> $"_hoodie_record_key", $"_hoodie_partition_path", 
> $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false)
> +---+--+--+--+---+---+---+
> |_hoodie_commit_time|_hoodie_commit_seqno  
> |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name  
> |resolutiondate |archiveddate   |
> 

[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989531356


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   * b561916256de18ffca7093d2e8200ae02c945efc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989529213


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989529213


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989527936


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989527936


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989526749


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4264:
URL: https://github.com/apache/hudi/pull/4264#issuecomment-989526749


   
   ## CI report:
   
   * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jaehyeon-kim commented on issue #4244: [SUPPORT] DeltaStreamer doesn't create a Glue/Hive table if deploy mode is cluster due to HiveConnection Failure

2021-12-08 Thread GitBox


jaehyeon-kim commented on issue #4244:
URL: https://github.com/apache/hudi/issues/4244#issuecomment-989526615


   Found the following config is necessary.
   
   ```
   hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://:1
   ```
   
   It defaults to `localhost:1` and it works with the `client` mode because 
the driver program run in the master node. However it fails with the `cluster` 
mode because the driver program runs in the core node.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] jaehyeon-kim closed issue #4244: [SUPPORT] DeltaStreamer doesn't create a Glue/Hive table if deploy mode is cluster due to HiveConnection Failure

2021-12-08 Thread GitBox


jaehyeon-kim closed issue #4244:
URL: https://github.com/apache/hudi/issues/4244


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2785) Create Trino setup in docker demo

2021-12-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2785:
-
Labels: pull-request-available  (was: )

> Create Trino setup in docker demo
> -
>
> Key: HUDI-2785
> URL: https://issues.apache.org/jira/browse/HUDI-2785
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] guanziyue opened a new pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …

2021-12-08 Thread GitBox


guanziyue opened a new pull request #4264:
URL: https://github.com/apache/hudi/pull/4264


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Fix problem mentioned in https://issues.apache.org/jira/browse/HUDI-2875.
   
   
   ## Brief change log
   1. Add synchronized to HoodieParquetWriter to make sure it is thread safe 
for any possible following uses.
   2. Add a graceful exit for BoundedInMemoryExecutor.
   3. let's first stop BoundedInMemoryExecutor and then close HoodieMergeHandle 
in SparkMergeHelper.
   
   ## Verify this pull request
   This pull request is manually verified the change by running a job locally
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (9c8ad0f -> 5ac9ce7)

2021-12-08 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 9c8ad0f  [HUDI-2665] Fix overflow of huge log file in 
HoodieLogFormatWriter (#3912)
 add 5ac9ce7  [MINOR] Fix Compile broken (#4263)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/common/functional/TestHoodieLogFormat.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[GitHub] [hudi] leesf merged pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


leesf merged pull request #4263:
URL: https://github.com/apache/hudi/pull/4263


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989492641


   
   ## CI report:
   
   * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989474190


   
   ## CI report:
   
   * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-989459744


   
   ## CI report:
   
   * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040)
 
   * a085e101422d1df36b94127e75e5d60716986e69 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-989486769


   
   ## CI report:
   
   * a085e101422d1df36b94127e75e5d60716986e69 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal

2021-12-08 Thread GitBox


xiarixiaoyao commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r765414682



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
 super();
 Configuration hadoopConf = new Configuration(conf);
-hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());

Review comment:
   @codope  thanks for your review。
   yes,I think if findsmallprecisiondecimaltype returns false, we need respect 
the user's settings; if findsmallprecisiondecimaltype returns true,we need 
ignore user's choice, because the user's choosie may lead  the failure of 
subsequent updating of the Hudi table




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4178: [HUDI-2901] Fixed the bug clustering jobs cannot running in parallel

2021-12-08 Thread GitBox


xiarixiaoyao commented on a change in pull request #4178:
URL: https://github.com/apache/hudi/pull/4178#discussion_r765412301



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##
@@ -95,8 +95,7 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, 
HoodieEngineContext
 .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
 clusteringPlan.getStrategy().getStrategyParams(),
 
Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-instantTime))
-.map(CompletableFuture::join);
+
instantTime)).collect(Collectors.toList()).stream().map(CompletableFuture::join);

Review comment:
   @alexeykudinkin,  here is my explain:
   1) in original code, one streaming pipleline is used, using join in the same 
pipeline will lead threads to execute one by one
   2) Here are two stream operations. Join operation is in the second stream 
pipleline, this operation will not lead  threads to execute one by one。
   
   we can verify by follow code:
   public static void main(String[] args) {
   Integer[] test = new Integer[] {0, 2, 3};
   long time = System.currentTimeMillis();
   List> ls =
   Arrays.stream(test).map(f -> CompletableFuture.supplyAsync(() -> {
   System.out.println(Thread.currentThread().getName());
   try {
   Thread.sleep(1);
   } catch (InterruptedException e) {
   }
   return f;
   })).collect(Collectors.toList());
   ls.stream().map(f  -> f.join()).collect(Collectors.toList());
   System.out.println(String.format("cost time: %s", 
(System.currentTimeMillis() - time) /1000));
   }
   
   
   
   
   
   You are an expert in this field, thank you for your patient guidance. If you 
still think there is a problem, I will modify the code. Thanks again
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4178: [HUDI-2901] Fixed the bug clustering jobs cannot running in parallel

2021-12-08 Thread GitBox


xiarixiaoyao commented on a change in pull request #4178:
URL: https://github.com/apache/hudi/pull/4178#discussion_r765412301



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java
##
@@ -95,8 +95,7 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, 
HoodieEngineContext
 .map(inputGroup -> runClusteringForGroupAsync(inputGroup,
 clusteringPlan.getStrategy().getStrategyParams(),
 
Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false),
-instantTime))
-.map(CompletableFuture::join);
+
instantTime)).collect(Collectors.toList()).stream().map(CompletableFuture::join);

Review comment:
   @alexeykudinkin,  here is my explain:
   1) in original code, one streaming pipleline is used, using join in the same 
pipeline will lead threads to execute one by one
   2) Here are two stream operations. Join operation is in the second stream 
pipleline, this operation will not lead  threads to execute one by one。
   
   public static void main(String[] args) {
   Integer[] test = new Integer[] {0, 2, 3};
   long time = System.currentTimeMillis();
   List> ls =
   Arrays.stream(test).map(f -> CompletableFuture.supplyAsync(() -> {
   System.out.println(Thread.currentThread().getName());
   try {
   Thread.sleep(1);
   } catch (InterruptedException e) {
   }
   return f;
   })).collect(Collectors.toList());
   ls.stream().map(f  -> f.join()).collect(Collectors.toList());
   System.out.println(String.format("cost time: %s", 
(System.currentTimeMillis() - time) /1000));
   }
   
   
   
   
   
   You are an expert in this field, thank you for your patient guidance. If you 
still think there is a problem, I will modify the code. Thanks again
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: updated writing pages with more details on delete and descibing the full write path (#4250)

2021-12-08 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 69e8906  updated writing pages with more details on delete and 
descibing the full write path (#4250)
69e8906 is described below

commit 69e8906193bda0bcf2f443e61ba78d71e4d64e1b
Author: Kyle Weller 
AuthorDate: Wed Dec 8 19:43:02 2021 -0800

updated writing pages with more details on delete and descibing the full 
write path (#4250)
---
 website/docs/write_operations.md | 62 ++-
 website/docs/writing_data.md | 93 +---
 2 files changed, 137 insertions(+), 18 deletions(-)

diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md
index 06176e5..ccdac23 100644
--- a/website/docs/write_operations.md
+++ b/website/docs/write_operations.md
@@ -5,16 +5,56 @@ toc: true
 last_modified_at:
 ---
 
-It may be helpful to understand the 3 different write operations provided by 
Hudi datasource or the delta streamer tool and how best to leverage them. These 
operations
-can be chosen/changed across each commit/deltacommit issued against the table.
+It may be helpful to understand the different write operations of Hudi and how 
best to leverage them. These operations
+can be chosen/changed across each commit/deltacommit issued against the table. 
See the [How To docs on Writing Data](/docs/writing_data) 
+to see more examples.
 
+## Operation Types
+### UPSERT 
+This is the default operation where the input records are first tagged as 
inserts or updates by looking up the index.
+The records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing.
+This operation is recommended for use-cases like database change capture where 
the input almost certainly contains updates. The target table will never show 
duplicates. 
 
-- **UPSERT** : This is the default operation where the input records are first 
tagged as inserts or updates by looking up the index.
-  The records are ultimately written after heuristics are run to determine how 
best to pack them on storage to optimize for things like file sizing.
-  This operation is recommended for use-cases like database change capture 
where the input almost certainly contains updates. The target table will never 
show duplicates.
-- **INSERT** : This operation is very similar to upsert in terms of 
heuristics/file sizing but completely skips the index lookup step. Thus, it can 
be a lot faster than upserts
-  for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
table can tolerate duplicates, but just
-  need the transactional writes/incremental pull/storage management 
capabilities of Hudi.
-- **BULK_INSERT** : Both upsert and insert operations keep input records in 
memory to speed up storage heuristics computations faster (among other things) 
and thus can be cumbersome for
-  initial loading/bootstrapping a Hudi table at first. Bulk insert provides 
the same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs
-  of initial load. However, this just does a best-effort job at sizing files 
vs guaranteeing file sizes like inserts/upserts do.
+### INSERT
+This operation is very similar to upsert in terms of heuristics/file sizing 
but completely skips the index lookup step. Thus, it can be a lot faster than 
upserts
+for use-cases like log de-duplication (in conjunction with options to filter 
duplicates mentioned below). This is also suitable for use-cases where the 
table can tolerate duplicates, but just
+need the transactional writes/incremental pull/storage management capabilities 
of Hudi.
+
+### BULK_INSERT
+Both upsert and insert operations keep input records in memory to speed up 
storage heuristics computations faster (among other things) and thus can be 
cumbersome for
+initial loading/bootstrapping a Hudi table at first. Bulk insert provides the 
same semantics as insert, while implementing a sort-based data writing 
algorithm, which can scale very well for several hundred TBs
+of initial load. However, this just does a best-effort job at sizing files vs 
guaranteeing file sizes like inserts/upserts do.
+
+### DELETE
+Hudi supports implementing two types of deletes on data stored in Hudi tables, 
by enabling the user to specify a different record payload implementation.
+- **Soft Deletes** : Retain the record key and just null out the values for 
all the other fields.
+  This can be achieved by ensuring the appropriate fields are nullable in the 
table schema and simply upserting the table after setting these fields to null.
+- **Hard Deletes** : A stronger form of deletion is 

[GitHub] [hudi] bhasudha merged pull request #4250: [HUDI-2956] - Updating write docs for deletes and full write path description

2021-12-08 Thread GitBox


bhasudha merged pull request #4250:
URL: https://github.com/apache/hudi/pull/4250


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codope commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType

2021-12-08 Thread GitBox


codope commented on a change in pull request #4253:
URL: https://github.com/apache/hudi/pull/4253#discussion_r765408801



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
##
@@ -46,13 +53,32 @@
   public HoodieRowParquetWriteSupport(Configuration conf, StructType 
structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) {
 super();
 Configuration hadoopConf = new Configuration(conf);
-hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
writeConfig.parquetWriteLegacyFormatEnabled());
+hadoopConf.set("spark.sql.parquet.writeLegacyFormat", 
findSmallPrecisionDecimalType(structType) ? "true" : 
writeConfig.parquetWriteLegacyFormatEnabled());

Review comment:
   If `findSmallPrecisionDecimalType` returns false and 
`parquetWriteLegacyFormatEnabled` is set to true in the config then 
`spark.sql.parquet.writeLegacyFormat` will be true. Is that the intention?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated (6706c1c -> d22db6c)

2021-12-08 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a change to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 6706c1c  [MINOR] Updating configs to reflect dynamodb based lock 
configs (#4262)
 add d22db6c  [HUDI-2827] - Docs for DeltaStreamer SchemaProviders, 
Sources, and Checkpoints (#4235)

No new revisions were added by this update.

Summary of changes:
 website/docs/hoodie_deltastreamer.md | 129 ++-
 1 file changed, 128 insertions(+), 1 deletion(-)


[GitHub] [hudi] leesf commented on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


leesf commented on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989477004


   > +1 once CI passes
   
   And can we add a workflow to force make contributor rebase against latest 
master to avoid the problem ? CC @xushiyan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4067: [HUDI-2763] Metadata table records key deduplication

2021-12-08 Thread GitBox


manojpec commented on a change in pull request #4067:
URL: https://github.com/apache/hudi/pull/4067#discussion_r765408100



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlockFactory.java
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log.block;
+
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hudi.common.config.HoodieMetadataConfig;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieMetadataHFileDataBlock;
+import org.apache.hudi.metadata.HoodieMetadataPayload;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+
+import java.util.List;
+import java.util.Map;
+
+public class HoodieLogBlockFactory {
+
+  /**
+   * Util method to get a data block for the requested type.
+   *
+   * @param logDataBlockFormat - Data block type
+   * @param recordList - List of records that goes in the data block
+   * @param header - data block header
+   * @param tableConfig- Table config
+   * @param metadataConfig - Metadata config
+   * @param tableBasePath  - Table base path
+   * @param populateMetaFields - Whether to populate meta fields in the record
+   * @return Data block of the requested type.
+   */
+  public static HoodieLogBlock getBlock(HoodieLogBlock.HoodieLogBlockType 
logDataBlockFormat,
+List recordList,
+Map header,
+HoodieTableConfig tableConfig, 
HoodieMetadataConfig metadataConfig,
+String tableBasePath, boolean 
populateMetaFields) {
+final boolean isMetadataKeyDeDuplicate = 
metadataConfig.getRecordKeyDeDuplicate()
+&& HoodieTableMetadata.isMetadataTable(tableBasePath);
+String keyField;
+if (populateMetaFields) {
+  keyField = (isMetadataKeyDeDuplicate
+  ? HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY : 
HoodieRecord.RECORD_KEY_METADATA_FIELD);
+} else {
+  keyField = tableConfig.getRecordKeyFieldProp();
+}
+return getBlock(logDataBlockFormat, recordList, header, keyField, 
isMetadataKeyDeDuplicate);
+  }
+
+  /**
+   * Util method to get a data block for the requested type.
+   *
+   * @param logDataBlockFormat   - Data block type
+   * @param recordList   - List of records that goes in the data 
block
+   * @param header   - data block header
+   * @param keyField - FieldId to get the key from the records
+   * @param isMetadataKeyDeDuplicate - Whether metadata key de duplication 
needed
+   * @return Data block of the requested type.
+   */
+  private static HoodieLogBlock getBlock(HoodieLogBlock.HoodieLogBlockType 
logDataBlockFormat, List recordList,
+ 
Map header, String keyField,
+ boolean isMetadataKeyDeDuplicate) {
+switch (logDataBlockFormat) {
+  case AVRO_DATA_BLOCK:
+return new HoodieAvroDataBlock(recordList, header, keyField);
+  case HFILE_DATA_BLOCK:
+if (isMetadataKeyDeDuplicate) {
+  return new HoodieMetadataHFileDataBlock(recordList, header, 
keyField);

Review comment:
   sure, I can pull the key deduplication logic to higher layer. Will post 
a revision. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


yihua commented on a change in pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#discussion_r765217559



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sort;
+
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.optimize.HilbertCurveUtils;
+import org.apache.hudi.optimize.ZOrderingUtil;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.Row$;
+import org.apache.spark.sql.hudi.execution.RangeSampleSort$;
+import org.apache.spark.sql.hudi.execution.ZorderingBinarySort;
+import org.apache.spark.sql.types.BinaryType;
+import org.apache.spark.sql.types.BinaryType$;
+import org.apache.spark.sql.types.BooleanType;
+import org.apache.spark.sql.types.ByteType;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DateType;
+import org.apache.spark.sql.types.DecimalType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.FloatType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.LongType;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.ShortType;
+import org.apache.spark.sql.types.StringType;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.StructType$;
+import org.apache.spark.sql.types.TimestampType;
+import org.davidmoten.hilbert.HilbertCurve;
+import scala.collection.JavaConversions;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class SpaceCurveSortingHelper {
+
+  private static final Logger LOG = 
LogManager.getLogger(SpaceCurveSortingHelper.class);
+
+  /**
+   * Orders provided {@link Dataset} by mapping values of the provided list of 
columns
+   * {@code orderByCols} onto a specified space curve (Z-curve, Hilbert, etc)
+   *
+   * 
+   * NOTE: Only support base data-types: 
long,int,short,double,float,string,timestamp,decimal,date,byte.
+   *   This method is more effective than {@link 
#orderDataFrameBySamplingValues} leveraging
+   *   data sampling instead of direct mapping
+   *
+   * @param df Spark {@link Dataset} holding data to be ordered
+   * @param orderByCols list of columns to be ordered by
+   * @param targetPartitionCount target number of output partitions
+   * @param layoutOptStrategy target layout optimization strategy
+   * @return a {@link Dataset} holding data ordered by mapping tuple of values 
from provided columns
+   * onto a specified space-curve
+   */
+  public static Dataset orderDataFrameByMappingValues(
+  Dataset df,
+  HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy,
+  List orderByCols,
+  int targetPartitionCount
+  ) {
+Map columnsMap =
+Arrays.stream(df.schema().fields())
+.collect(Collectors.toMap(StructField::name, Function.identity()));
+
+List checkCols =
+orderByCols.stream()
+.filter(columnsMap::containsKey)
+.collect(Collectors.toList());
+
+if (orderByCols.size() != checkCols.size()) {
+  LOG.error(String.format("Trying to ordering over a column(s) not present 
in the schema (%s); skipping", CollectionUtils.diff(orderByCols, checkCols)));
+  return df;
+}
+
+// In case when there's just one column to be ordered by, we can skip 
space-curve
+// ordering altogether (since it will match linear ordering anyway)
+if (orderByCols.size() == 1) {
+  String orderByColName = orderByCols.get(0);
+  LOG.debug(String.format("Single column to order by (%s), skipping 
space-curve ordering", orderByColName));
+
+  // TODO validate if 

[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


yihua commented on a change in pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#discussion_r765407715



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java
##
@@ -0,0 +1,260 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sort;
+
+import org.apache.hudi.common.util.CollectionUtils;
+import org.apache.hudi.config.HoodieClusteringConfig;
+import org.apache.hudi.optimize.HilbertCurveUtils;
+import org.apache.hudi.optimize.ZOrderingUtil;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.Row$;
+import org.apache.spark.sql.hudi.execution.RangeSampleSort$;
+import org.apache.spark.sql.hudi.execution.ZorderingBinarySort;
+import org.apache.spark.sql.types.BinaryType;
+import org.apache.spark.sql.types.BinaryType$;
+import org.apache.spark.sql.types.BooleanType;
+import org.apache.spark.sql.types.ByteType;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DateType;
+import org.apache.spark.sql.types.DecimalType;
+import org.apache.spark.sql.types.DoubleType;
+import org.apache.spark.sql.types.FloatType;
+import org.apache.spark.sql.types.IntegerType;
+import org.apache.spark.sql.types.LongType;
+import org.apache.spark.sql.types.Metadata;
+import org.apache.spark.sql.types.ShortType;
+import org.apache.spark.sql.types.StringType;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+import org.apache.spark.sql.types.StructType$;
+import org.apache.spark.sql.types.TimestampType;
+import org.davidmoten.hilbert.HilbertCurve;
+import scala.collection.JavaConversions;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+public class SpaceCurveSortingHelper {
+
+  private static final Logger LOG = 
LogManager.getLogger(SpaceCurveSortingHelper.class);
+
+  /**
+   * Orders provided {@link Dataset} by mapping values of the provided list of 
columns
+   * {@code orderByCols} onto a specified space curve (Z-curve, Hilbert, etc)
+   *
+   * 
+   * NOTE: Only support base data-types: 
long,int,short,double,float,string,timestamp,decimal,date,byte.
+   *   This method is more effective than {@link 
#orderDataFrameBySamplingValues} leveraging
+   *   data sampling instead of direct mapping
+   *
+   * @param df Spark {@link Dataset} holding data to be ordered
+   * @param orderByCols list of columns to be ordered by
+   * @param targetPartitionCount target number of output partitions
+   * @param layoutOptStrategy target layout optimization strategy
+   * @return a {@link Dataset} holding data ordered by mapping tuple of values 
from provided columns
+   * onto a specified space-curve
+   */
+  public static Dataset orderDataFrameByMappingValues(
+  Dataset df,
+  HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy,
+  List orderByCols,
+  int targetPartitionCount
+  ) {
+Map columnsMap =
+Arrays.stream(df.schema().fields())
+.collect(Collectors.toMap(StructField::name, Function.identity()));
+
+List checkCols =
+orderByCols.stream()
+.filter(columnsMap::containsKey)
+.collect(Collectors.toList());
+
+if (orderByCols.size() != checkCols.size()) {
+  LOG.error(String.format("Trying to ordering over a column(s) not present 
in the schema (%s); skipping", CollectionUtils.diff(orderByCols, checkCols)));
+  return df;
+}

Review comment:
   Got it.  Sg.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha merged pull request #4235: [HUDI-2827] - Docs for DeltaStreamer SchemaProviders, Sources, and Checkpoints

2021-12-08 Thread GitBox


bhasudha merged pull request #4235:
URL: https://github.com/apache/hudi/pull/4235


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


yihua commented on a change in pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#discussion_r765213822



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java
##
@@ -621,4 +495,5 @@ public static String createIndexMergeSql(
 String.format("%s.%s", newIndexTable, columns.get(0))
 );
   }
+

Review comment:
   nit: remove empty line?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


yihua commented on a change in pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#discussion_r765407226



##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java
##
@@ -79,13 +71,11 @@
 import java.util.stream.Collectors;
 import java.util.stream.StreamSupport;
 
-import scala.collection.JavaConversions;
-
 import static org.apache.hudi.util.DataTypeUtils.areCompatible;
 
-public class ZOrderingIndexHelper {
+public class ColumnStatsIndexHelper {

Review comment:
   Sg




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989474190


   
   ## CI report:
   
   * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989472984


   
   ## CI report:
   
   * 4b3807f2900ed4719457ed3738cb387006b088f6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


vinothchandar commented on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989473251


   +1 once CI passes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4263:
URL: https://github.com/apache/hudi/pull/4263#issuecomment-989472984


   
   ## CI report:
   
   * 4b3807f2900ed4719457ed3738cb387006b088f6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf opened a new pull request #4263: [MINOR] Fix Compile broken

2021-12-08 Thread GitBox


leesf opened a new pull request #4263:
URL: https://github.com/apache/hudi/pull/4263


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   
   CC @guanziyue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2968:
-
Description: 
Allow to delete/update using non-pk fields
{code:java}
create table h0 (
  id int,
  name string,
  price double
) using hudi 
options (primaryKey = 'id');

update h0 set price = 10 where name = 'foo'; 
delete from h0 where name = 'foo';

{code}

  was:
Allow to delete/update a non-pk table.
{code:java}
create table h0 (
  id int,
  name string,
  price double
) using hudi;

delete from h0 where id = 10;
update h0 set price = 10 where id = 12;

{code}


> Support Delete/Update using non-pk fields
> -
>
> Key: HUDI-2968
> URL: https://issues.apache.org/jira/browse/HUDI-2968
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Allow to delete/update using non-pk fields
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi 
> options (primaryKey = 'id');
> update h0 set price = 10 where name = 'foo'; 
> delete from h0 where name = 'foo';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2968:
-
Priority: Blocker  (was: Critical)

> Support Delete/Update using non-pk fields
> -
>
> Key: HUDI-2968
> URL: https://issues.apache.org/jira/browse/HUDI-2968
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Allow to delete/update using non-pk fields
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi 
> options (primaryKey = 'id');
> update h0 set price = 10 where name = 'foo'; 
> delete from h0 where name = 'foo';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2968) Support Delete/Update using non-pk fields

2021-12-08 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-2968:


 Summary: Support Delete/Update using non-pk fields
 Key: HUDI-2968
 URL: https://issues.apache.org/jira/browse/HUDI-2968
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: pengzhiwei
Assignee: Yann Byron


Allow to delete/update a non-pk table.
{code:java}
create table h0 (
  id int,
  name string,
  price double
) using hudi;

delete from h0 where id = 10;
update h0 set price = 10 where id = 12;

{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2968:
-
Fix Version/s: 0.11.0

> Support Delete/Update using non-pk fields
> -
>
> Key: HUDI-2968
> URL: https://issues.apache.org/jira/browse/HUDI-2968
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: Yann Byron
>Priority: Critical
> Fix For: 0.11.0
>
>
> Allow to delete/update a non-pk table.
> {code:java}
> create table h0 (
>   id int,
>   name string,
>   price double
> ) using hudi;
> delete from h0 where id = 10;
> update h0 set price = 10 where id = 12;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#issuecomment-989469074


   
   ## CI report:
   
   * 42210d2837e5690b12baa7a50e8a2e6cea96aec2 UNKNOWN
   * 4db8444e25fee92d22f0f2e06d1cdff511b05eaf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4106:
URL: https://github.com/apache/hudi/pull/4106#issuecomment-989445501


   
   ## CI report:
   
   * 42210d2837e5690b12baa7a50e8a2e6cea96aec2 UNKNOWN
   * 1b836c6b3457c9d07d0ac4cc684c5046062d5b47 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4107)
 
   * 4db8444e25fee92d22f0f2e06d1cdff511b05eaf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4108)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1912) Presto defaults to GenericHiveRecordCursor for all Hudi tables

2021-12-08 Thread Rajesh Mahindra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Mahindra updated HUDI-1912:
--
Status: Resolved  (was: Patch Available)

> Presto defaults to GenericHiveRecordCursor for all Hudi tables
> --
>
> Key: HUDI-1912
> URL: https://issues.apache.org/jira/browse/HUDI-1912
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Affects Versions: 0.7.0
>Reporter: satish
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.11.0
>
>
> See code here 
> https://github.com/prestodb/presto/blob/2ad67dcf000be86ebc5ff7732bbb9994c8e324a8/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L168
> Starting Hudi 0.7, HoodieInputFormat comes with 
> UseRecordReaderFromInputFormat annotation. As a result, we are skipping all 
> optimizations in parquet PageSource and using basic GenericHiveRecordCursor 
> which has several limitations:
> 1) No support for timestamp
> 2) No support for synthesized columns
> 3) No support for vectorized reading?
> Example errors we saw:
> Error#1
> {code}
> java.lang.IllegalStateException: column type must be regular
>   at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:507)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.(GenericHiveRecordCursor.java:167)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:79)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:449)
>   at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:177)
>   at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:63)
>   at 
> com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:80)
>   at 
> com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:231)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>   at 
> com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077)
>   at 
> com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162)
>   at 
> com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:545)
>   at 
> com.facebook.presto.$gen.Presto_0_247_17f857e20210506_210241_1.run(Unknown
>  Source)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>   at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>   at java.base/java.lang.Thread.run(Thread.java:834) 
> {code}
> Error#2
> {code}
> java.lang.ClassCastException: class org.apache.hadoop.io.LongWritable cannot 
> be cast to class org.apache.hadoop.hive.serde2.io.TimestampWritable 
> (org.apache.hadoop.io.LongWritable and 
> org.apache.hadoop.hive.serde2.io.TimestampWritable are in unnamed module of 
> loader com.facebook.presto.server.PluginClassLoader @5c4e86e7)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:25)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseLongColumn(GenericHiveRecordCursor.java:286)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.parseColumn(GenericHiveRecordCursor.java:550)
>   at 
> com.facebook.presto.hive.GenericHiveRecordCursor.isNull(GenericHiveRecordCursor.java:508)
>   at 
> com.facebook.presto.hive.HiveRecordCursor.isNull(HiveRecordCursor.java:233)
>   at 
> com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:112)
>   at 
> com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:251)
>   at com.facebook.presto.operator.Driver.processInternal(Driver.java:418)
>   at 
> com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301)
>   at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722)
>   at com.facebook.presto.operator.Driver.processFor(Driver.java:294)
>  

[hudi] branch asf-site updated: [MINOR] Updating configs to reflect dynamodb based lock configs (#4262)

2021-12-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 6706c1c  [MINOR] Updating configs to reflect dynamodb based lock 
configs (#4262)
6706c1c is described below

commit 6706c1c708bcddd1c46850f7ba2fd77949965989
Author: Sivabalan Narayanan 
AuthorDate: Wed Dec 8 22:12:32 2021 -0500

[MINOR] Updating configs to reflect dynamodb based lock configs (#4262)
---
 website/docs/configurations.md | 63 +-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/website/docs/configurations.md b/website/docs/configurations.md
index 9cad2d9..02363e7 100644
--- a/website/docs/configurations.md
+++ b/website/docs/configurations.md
@@ -4,7 +4,7 @@ keywords: [ configurations, default, flink options, spark, 
configs, parameters ]
 permalink: /docs/configurations.html
 summary: This page covers the different ways of configuring your job to 
write/read Hudi tables. At a high level, you can control behaviour at few 
levels.
 toc: true
-last_modified_at: 2021-12-08T09:59:32.441
+last_modified_at: 2021-12-08T17:24:42.348
 ---
 
 This page covers the different ways of configuring your job to write/read Hudi 
tables. At a high level, you can control behaviour at few levels.
@@ -1517,6 +1517,67 @@ Configurations that control aspects around writing, 
sizing, reading base and log
 
 ---
 
+### DynamoDB based Locks Configurations {#DynamoDB-based-Locks-Configurations}
+
+Configs that control DynamoDB based locking mechanisms required for 
concurrency control  between writers to a Hudi table. Concurrency between 
Hudi's own table services  are auto managed internally.
+
+`Config Class`: org.apache.hudi.config.DynamoDbBasedLockConfig
+>  hoodie.write.lock.dynamodb.billing_mode
+> For DynamoDB based lock provider, by default it is PAY_PER_REQUEST 
mode
+> **Default Value**: PAY_PER_REQUEST (Optional)
+> `Config Param: DYNAMODB_LOCK_BILLING_MODE`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.table
+> For DynamoDB based lock provider, the name of the DynamoDB table acting as 
lock table
+> **Default Value**: N/A (Required)
+> `Config Param: DYNAMODB_LOCK_TABLE_NAME`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.region
+> For DynamoDB based lock provider, the region used in endpoint for Amazon 
DynamoDB service. Would try to first get it from AWS_REGION environment 
variable. If not find, by default use us-east-1
+> **Default Value**: us-east-1 (Optional)
+> `Config Param: DYNAMODB_LOCK_REGION`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.partition_key
+> For DynamoDB based lock provider, the partition key for the DynamoDB lock 
table. Each Hudi dataset should has it's unique key so concurrent writers could 
refer to the same partition key. By default we use the Hudi table name 
specified to be the partition key
+> **Default Value**: N/A (Required)
+> `Config Param: DYNAMODB_LOCK_PARTITION_KEY`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.write_capacity
+> For DynamoDB based lock provider, write capacity units when using 
PROVISIONED billing mode
+> **Default Value**: 10 (Optional)
+> `Config Param: DYNAMODB_LOCK_WRITE_CAPACITY`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.table_creation_timeout
+> For DynamoDB based lock provider, the maximum number of milliseconds to wait 
for creating DynamoDB table
+> **Default Value**: 60 (Optional)
+> `Config Param: DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT`
+> `Since Version: 0.10.0`
+
+---
+
+>  hoodie.write.lock.dynamodb.read_capacity
+> For DynamoDB based lock provider, read capacity units when using PROVISIONED 
billing mode
+> **Default Value**: 20 (Optional)
+> `Config Param: DYNAMODB_LOCK_READ_CAPACITY`
+> `Since Version: 0.10.0`
+
+---
+
 ### Metadata Configs {#Metadata-Configs}
 
 Configurations used by the Hudi Metadata Table. This table maintains the 
metadata about a given Hudi table (e.g file listings)  to avoid overhead of 
accessing cloud storage, during queries.


[GitHub] [hudi] nsivabalan merged pull request #4262: [MINOR] Updating configs to reflect dynamoDB based lock configs

2021-12-08 Thread GitBox


nsivabalan merged pull request #4262:
URL: https://github.com/apache/hudi/pull/4262


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #4260: [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study

2021-12-08 Thread GitBox


vinothchandar commented on pull request #4260:
URL: https://github.com/apache/hudi/pull/4260#issuecomment-989465646


   this is a good start. we can keep improving this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on pull request #4259: [HUDI-2962] Local process lock provider to guard single writer process with async table operations + Enabling metadata table

2021-12-08 Thread GitBox


manojpec commented on pull request #4259:
URL: https://github.com/apache/hudi/pull/4259#issuecomment-989463817


   CI failed job passed in the re-run - 
   
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=4106=results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2467) Delete data is not working with 0.9.0 and pySpark

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2467:
-
Parent: (was: HUDI-1658)
Issue Type: Bug  (was: Sub-task)

> Delete data is not working with 0.9.0 and pySpark
> -
>
> Key: HUDI-2467
> URL: https://issues.apache.org/jira/browse/HUDI-2467
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Phil Chen
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.11.0
>
>
> Following this spark guide:
> [https://hudi.apache.org/docs/quick-start-guide/]
> Everything works until delete data:
> I'm using Pyspark with Spark 3.1.2 with python 3.9
> {code:java}
> // code placeholder
> # pyspark# fetch total records count
> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count()
> # fetch two records to be deleted
> ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2)
> # issue deletes
> hudi_delete_options = {  'hoodie.table.name': tableName,  
> 'hoodie.datasource.write.recordkey.field': 'uuid',  
> 'hoodie.datasource.write.partitionpath.field': 'partitionpath',  
> 'hoodie.datasource.write.table.name': tableName,  
> 'hoodie.datasource.write.operation': 'delete',  
> 'hoodie.datasource.write.precombine.field': 'ts',  
> 'hoodie.upsert.shuffle.parallelism': 2,   
> 'hoodie.insert.shuffle.parallelism': 2}
> from pyspark.sql.functions import lit
> deletes = list(map(lambda row: (row[0], row[1]), ds.collect()))
> df = spark.sparkContext.parallelize(deletes).toDF(['uuid', 
> 'partitionpath']).withColumn('ts', lit(0.0))
> df.write.format("hudi"). \  
>   options(**hudi_delete_options). \
>   mode("append"). \
>   save(basePath)
> # run the same read query as above.
> roAfterDeleteViewDF = spark. \
>   read. \
>   format("hudi"). \
>   load(basePath) 
> roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot")
> # fetch should return (total - 2) records
> spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count(){code}
> The count before delete is 10 and after delete is still 10 (expecting 8)
> {code:java}
> // code placeholder
> >>> df.show()
> +++---+
> |   partitionpath|uuid| ts|
> +++---+
> |74bed794-c854-4aa...|americas/united_s...|0.0|
> |ce71c2dc-dedf-483...|americas/united_s...|0.0|
> +++---+
> {code}
>  
> The 2 records to be deleted
> Note, the 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-989459744


   
   ## CI report:
   
   * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040)
 
   * a085e101422d1df36b94127e75e5d60716986e69 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-989458305


   
   ## CI report:
   
   * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040)
 
   * a085e101422d1df36b94127e75e5d60716986e69 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from a bad no

2021-12-08 Thread GitBox


nsivabalan commented on issue #4055:
URL: https://github.com/apache/hudi/issues/4055#issuecomment-989459731


   @guanlisheng : Can you please file a new github issue with all details and 
CC me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-864:

Fix Version/s: (was: 0.11.0)

> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, Spark Integration
>Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0
>Reporter: Roland Johann
>Priority: Critical
>  Labels: sev:critical, user-support-issues
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 

[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group

2021-12-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-864:

Component/s: Writer Core

> parquet schema conflict: optional binary  (UTF8) is not a group
> ---
>
> Key: HUDI-864
> URL: https://issues.apache.org/jira/browse/HUDI-864
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core, Spark Integration, Writer Core
>Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0
>Reporter: Roland Johann
>Priority: Critical
>  Labels: sev:critical, user-support-issues
>
> When dealing with struct types like this
> {code:json}
> {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryResults",
>   "type": {
> "type": "array",
> "elementType": {
>   "type": "struct",
>   "fields": [
> {
>   "name": "categoryId",
>   "type": "string",
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> },
> "containsNull": true
>   },
>   "nullable": true,
>   "metadata": {}
> }
>   ]
> }
> {code}
> The second ingest batch throws that exception:
> {code}
> ERROR [Executor task launch worker for task 15] 
> commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error 
> upserting bucketType UPDATE for partition :0
> org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: operation has failed
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100)
>   at 
> org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76)
>   at 
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271)
>   at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337)
>   at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at 

[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot commented on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-989458305


   
   ## CI report:
   
   * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040)
 
   * a085e101422d1df36b94127e75e5d60716986e69 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path

2021-12-08 Thread GitBox


hudi-bot removed a comment on pull request #4222:
URL: https://github.com/apache/hudi/pull/4222#issuecomment-986406347


   
   ## CI report:
   
   * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

2021-12-08 Thread GitBox


nsivabalan edited a comment on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674


   Looks like the support was never added to deltastreamer only. I have filed a 
tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If 
either of you are interested in working towards it, let me know. I can guide 
you. we can get it in for 0.11. 
   Since we have a tracking jira, will close the github issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

2021-12-08 Thread GitBox


nsivabalan commented on issue #3945:
URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674


   Looks like the support was never added to deltastreamer only. I have filed a 
tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If you 
are interested in working towards it, let me know. I can guide you. we can get 
it in for 0.11. 
   Since we have a tracking jira, will close the github issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

2021-12-08 Thread GitBox


nsivabalan closed issue #3945:
URL: https://github.com/apache/hudi/issues/3945


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2967) Support drop partition columns in deltastreamer

2021-12-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-2967:
--
Fix Version/s: 0.11.0

> Support drop partition columns in deltastreamer
> ---
>
> Key: HUDI-2967
> URL: https://issues.apache.org/jira/browse/HUDI-2967
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.0
>
>
> We added support to drop partition columns in spark datasource, but the 
> support has not been added to deltastreamer. creating a ticket to add the 
> support. 
> [https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-2967) Support drop partition columns in deltastreamer

2021-12-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-2967:
-

Assignee: sivabalan narayanan

> Support drop partition columns in deltastreamer
> ---
>
> Key: HUDI-2967
> URL: https://issues.apache.org/jira/browse/HUDI-2967
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.11.0
>
>
> We added support to drop partition columns in spark datasource, but the 
> support has not been added to deltastreamer. creating a ticket to add the 
> support. 
> [https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-2967) Support drop partition columns in deltastreamer

2021-12-08 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-2967:
-

 Summary: Support drop partition columns in deltastreamer
 Key: HUDI-2967
 URL: https://issues.apache.org/jira/browse/HUDI-2967
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: sivabalan narayanan


We added support to drop partition columns in spark datasource, but the support 
has not been added to deltastreamer. creating a ticket to add the support. 

[https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f]

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] guanlisheng edited a comment on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from

2021-12-08 Thread GitBox


guanlisheng edited a comment on issue #4055:
URL: https://github.com/apache/hudi/issues/4055#issuecomment-989456516


   hey @nsivabalan , thanks for the reply. 
   
   very similar and the only difference is the dataset operation vs a spark 
SQL. 
   an additional clue is that the transformer works well with HoodieIncrSource 
and such an issue is happening with JsonKafkaSource.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (bd08470 -> 9c8ad0f)

2021-12-08 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from bd08470  [HUDI-2957] Shade kryo jar for flink bundle jar (#4251)
 add 9c8ad0f  [HUDI-2665] Fix overflow of huge log file in 
HoodieLogFormatWriter (#3912)

No new revisions were added by this update.

Summary of changes:
 .../common/table/log/HoodieLogFormatWriter.java| 12 +++--
 .../common/functional/TestHoodieLogFormat.java | 62 ++
 2 files changed, 71 insertions(+), 3 deletions(-)


[GitHub] [hudi] guanlisheng commented on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from a bad n

2021-12-08 Thread GitBox


guanlisheng commented on issue #4055:
URL: https://github.com/apache/hudi/issues/4055#issuecomment-989456516


   hey @nsivabalan , thanks for the reply. 
   
   very similar and the only difference is the dataset operation vs a spark 
SQL. 
   additional clue is that the transformer works well with HoodieIncrSource and 
such incident happens with JsonKafkaSource.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf merged pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter

2021-12-08 Thread GitBox


leesf merged pull request #3912:
URL: https://github.com/apache/hudi/pull/3912


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Limess opened a new issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)

2021-12-08 Thread GitBox


Limess opened a new issue #3945:
URL: https://github.com/apache/hudi/issues/3945


   **Describe the problem you faced**
   
   We're running a deltastreamer job into a new Hudi table.
   
   We have a partition column: `story_published_partition_date`, and we set 
`hoodie.datasource.write.drop.partition.columns=true`.
   
   When the execution completes, we observe that the partition column is 
present in the parquet file, and data:
   
   ```shell
   parquet-tools show 
s3:///articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet
 --head 1 --columns story_published_partition_date --awsprofile signal-prod
   ℹ 
s3:///articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet
 => 
/var/folders/lx/83dtr4vx0cs87l55pwnwk760gq/T/tmp0ypmpxcw/f39767b2-3218-4ebe-9396-9549d6998c02.parquet
   
   +--+
   | story_published_partition_date   |
   |--|
   | 2021-01-07T09:00:00Z |
   +--+
   ```
   
   Configuration:
   
   ```
   "Args": [
   "spark-submit",
   "--deploy-mode",
   "cluster",
   "--class",
   "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer",
   "--jars",
   
"/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar,/home/hadoop/extra-jars/spark-avro_2.12-3.0.1.jar,/home/hadoop/extra-jars/hudi-spark3-bundle_2.12-0.9.0.jar",
   
   "/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar",
   "--props",
   "/etc/hudi/conf/hudi-base.properties",
   "--table-type",
   "COPY_ON_WRITE",
   "--op",
   "UPSERT ",
   "--source-ordering-field",
   "version",
   "--source-class",
   "org.apache.hudi.utilities.sources.ParquetDFSSource",
   "--transformer-class",
   "org.apache.hudi.utilities.transform.SqlFileBasedTransformer",
   "--target-base-path",
   
"s3:///articles_hudi_copy_on_write_drop_partition_column_test/",
   "--target-table",
   "articles_hudi_copy_on_write_drop_partition_column_test",
   "--enable-hive-sync",
   "--hoodie-conf",
   
"hoodie.table.name=articles_hudi_copy_on_write_drop_partition_column_test",
   "--hoodie-conf",
   
"hoodie.deltastreamer.transformer.sql.file=/etc/hudi/conf/schema/documents_schema.sql",
   "--hoodie-conf",
   "hoodie.datasource.write.recordkey.field=id",
   "--hoodie-conf",
   "hoodie.datasource.write.precombine.field=version",
   "--hoodie-conf",
   "hoodie.bloom.index.prune.by.ranges=false",
   "--hoodie-conf",
   
"hoodie.datasource.write.partitionpath.field=story_published_partition_date",
   "--hoodie-conf",
   
"hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator",
   "--hoodie-conf",
   
"hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING",
   "--hoodie-conf",
   
"hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd'T'HH:mm:ssZ,-MM-dd'T'HH:mm:ss.SSSZ",
   "--hoodie-conf",
   
"hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=,",
   "--hoodie-conf",
   
"hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd",
   "--hoodie-conf",
   "hoodie.deltastreamer.keygen.timebased.output.timezone=UTC",
   "--hoodie-conf",
   
"hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "--hoodie-conf",
   "hoodie.datasource.write.hive_style_partitioning=true",
   "--hoodie-conf",
   "hoodie.datasource.write.drop.partition.columns=true",
   "--hoodie-conf",
   "hoodie.datasource.write.reconcile.schema=true",
   "--hoodie-conf",
   "hoodie.datasource.hive_sync.enable=true",
   "--hoodie-conf",
   "hoodie.datasource.hive_sync.database=articles",
   "--hoodie-conf",
   
"hoodie.datasource.hive_sync.table=articles_hudi_copy_on_write_drop_partition_column_test",
   "--hoodie-conf",
   
"hoodie.datasource.hive_sync.partition_fields=story_published_partition_date",
   "--hoodie-conf",
   
"hoodie.deltastreamer.source.dfs.root=s3:///firehose_received_date=2021-11-08/"
  

  1   2   3   4   >