[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989596455 ## CI report: * b561916256de18ffca7093d2e8200ae02c945efc Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989571491 ## CI report: * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot commented on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989592033 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4119) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot removed a comment on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989586395 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
xiarixiaoyao commented on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989590875 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot removed a comment on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989559491 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot commented on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989586395 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-2323) Upsert of Case Class with single field causes SchemaParseException
[ https://issues.apache.org/jira/browse/HUDI-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456194#comment-17456194 ] sivabalan narayanan commented on HUDI-2323: --- There is a known bug in parquet for the same issue reported: [https://github.com/apache/parquet-mr/pull/560] > Upsert of Case Class with single field causes SchemaParseException > -- > > Key: HUDI-2323 > URL: https://issues.apache.org/jira/browse/HUDI-2323 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: Tyler Jackson >Priority: Critical > Labels: schema, sev:critical > Fix For: 0.11.0 > > Attachments: HudiSchemaGenerationTest.scala > > > Additional background information: > Spark version 3.1.1 > Scala version 2.12 > Hudi version 0.8.0 (hudi-spark-bundle_2.12 artifact) > > While testing a spark job in EMR of inserting and then upserting data for a > fairly complex nested case class structure, I ran into an issue that I was > having a hard time tracking down. It seems when part of the case class in the > dataframe to be written has a single field in it, the avro schema generation > fails with the following stacktrace, but only on the upsert: > {{21/08/19 15:08:34 ERROR BoundedInMemoryExecutor: error producing records}} > {{org.apache.avro.SchemaParseException: Can't redefine: array}} > \{{ at org.apache.avro.Schema$Names.put(Schema.java:1128) }} > \{{ at org.apache.avro.Schema$NamedSchema.writeNameRef(Schema.java:562) }} > \{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:690) }} > \{{ at org.apache.avro.Schema$ArraySchema.toJson(Schema.java:805) }} > \{{ at org.apache.avro.Schema$UnionSchema.toJson(Schema.java:882) }} > \{{ at org.apache.avro.Schema$RecordSchema.fieldsToJson(Schema.java:716) }} > \{{ at org.apache.avro.Schema$RecordSchema.toJson(Schema.java:701)}} > \{{ at org.apache.avro.Schema.toString(Schema.java:324)}} > \{{ at > org.apache.avro.SchemaCompatibility.checkReaderWriterCompatibility(SchemaCompatibility.java:68)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.isElementType(AvroRecordConverter.java:866)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:475)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)}} > \{{ at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)}} > \{{ at > org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)}} > \{{ at > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)}} > \{{ at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)}} > \{{ at > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156)}} > \{{ at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135)}} > \{{ at > org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)}} > \{{ at > org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)}} > \{{ at > org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)}} > \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}} > \{{ at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}} > \{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266) }} > \{{ at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > }} > \{{ at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > }} > \{{ at java.lang.Thread.run(Thread.java:748) }} > > I am able to replicate the problem in my local IntelliJ setup using the test > that has been attached to this issue. The problem can be observed in the > DummyStepParent case class. Simply adding an additional field to the case > class eliminates the problem altogether (which is an acceptable workaround > for our purposes, but shouldn't ultimately be necessary). > {{case class DummyObject (}} > {{ fieldOne: String,}} > {{ listTwo: Seq[String],}} > {{ listThree: Seq[DummyChild],}} > {{ listFour: Seq[DummyStepChild],}} > {{ fieldFive: Boolean,}} > {{ listSix: Seq[DummyParent],}} > {{ listSeven: Seq[DummyCousin],}} > {{ {color:#de350b}listEight:
[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot commented on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-989582099 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4118) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot removed a comment on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-989559372 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group
[ https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456189#comment-17456189 ] sivabalan narayanan commented on HUDI-864: -- [~rolandjohann] : Can you give it a try and let us know. Setting this with other hudi props should work with latest master or with 0.10.0. > parquet schema conflict: optional binary (UTF8) is not a group > --- > > Key: HUDI-864 > URL: https://issues.apache.org/jira/browse/HUDI-864 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, Spark Integration, Writer Core >Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0 >Reporter: Roland Johann >Priority: Critical > Labels: sev:critical, user-support-issues > > When dealing with struct types like this > {code:json} > { > "type": "struct", > "fields": [ > { > "name": "categoryResults", > "type": { > "type": "array", > "elementType": { > "type": "struct", > "fields": [ > { > "name": "categoryId", > "type": "string", > "nullable": true, > "metadata": {} > } > ] > }, > "containsNull": true > }, > "nullable": true, > "metadata": {} > } > ] > } > {code} > The second ingest batch throws that exception: > {code} > ERROR [Executor task launch worker for task 15] > commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: operation has failed > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100) > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76) > at > org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at >
[jira] [Commented] (HUDI-1079) Cannot upsert on schema with Array of Record with single field
[ https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456186#comment-17456186 ] sivabalan narayanan commented on HUDI-1079: --- Unfortunately, hudi does not do any schema fiddling in this regard. we just rely on parquet-avro to do the conversion for us and apparently arrays with a struct type with just 1 type runs into issues. But I also wonder why would someone(even your upstream datasets), define a struct type with just 1 entry. Anyways, we try our best not to re-write these logic internally and try to re-use the library as is. Let us know if you are still looking for a solution on this end. > Cannot upsert on schema with Array of Record with single field > -- > > Key: HUDI-1079 > URL: https://issues.apache.org/jira/browse/HUDI-1079 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.9.0 > Environment: spark 2.4.4, local >Reporter: Adrian Tanase >Priority: Critical > Labels: schema, sev:critical, user-support-issues > Fix For: 0.11.0 > > > I am trying to trigger upserts on a table that has an array field with > records of just one field. > Here is the code to reproduce: > {code:scala} > val spark = SparkSession.builder() > .master("local[1]") > .appName("SparkByExamples.com") > .config("spark.serializer", > "org.apache.spark.serializer.KryoSerializer") > .getOrCreate(); > // https://sparkbyexamples.com/spark/spark-dataframe-array-of-struct/ > val arrayStructData = Seq( > Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))), > Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))), > Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250))), > Row("Washington",null) > ) > val arrayStructSchema = new StructType() > .add("name",StringType) > .add("booksIntersted",ArrayType( > new StructType() > .add("bookName",StringType) > // .add("author",StringType) > // .add("pages",IntegerType) > )) > val df = > spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema) > {code} > Running insert following by upsert will fail: > {code:scala} > df.write > .format("hudi") > .options(getQuickstartWriteConfigs) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name") > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name") > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .mode(Overwrite) > .save(basePath) > df.write > .format("hudi") > .options(getQuickstartWriteConfigs) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name") > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .mode(Append) > .save(basePath) > {code} > If I create the books record with all the fields (at least 2), it works as > expected. > The relevant part of the exception is this: > {noformat} > Caused by: java.lang.ClassCastException: required binary bookName (UTF8) is > not a groupCaused by: java.lang.ClassCastException: required binary bookName > (UTF8) is not a group at > org.apache.parquet.schema.Type.asGroupType(Type.java:207) at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232) > at > org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:536) > at > org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:486) > at > org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141) > at > org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95) > at > org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33) > at > org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183) > at > org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at > org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at > org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49) > at >
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989571491 ## CI report: * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117) Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989569735 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989569735 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4117) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989557070 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] guanziyue commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
guanziyue commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989568422 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot commented on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989559491 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4116) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot removed a comment on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989558237 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot commented on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-989559372 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN * 81698b84b31a98aee5adecc18e5e9d2ad16e3a96 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot removed a comment on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-989556962 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
hudi-bot commented on pull request #4265: URL: https://github.com/apache/hudi/pull/4265#issuecomment-989558237 ## CI report: * 31c0fd46f52922816e1e023bd9b7c9873ae31fb6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2966) Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner
[ https://issues.apache.org/jira/browse/HUDI-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2966: - Labels: pull-request-available (was: ) > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner > --- > > Key: HUDI-2966 > URL: https://issues.apache.org/jira/browse/HUDI-2966 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.11.0 > > > Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScanner When > the query is completed。 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xiarixiaoyao opened a new pull request #4265: [HUDI-2966] Add TaskCompletionListener for HoodieMergeOnReadRDD to close logScaner when the query finished.
xiarixiaoyao opened a new pull request #4265: URL: https://github.com/apache/hudi/pull/4265 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989531356 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989557070 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4115) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot removed a comment on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-986559751 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4172: [HUDI-2892][BUG]Pending Clustering may stain the ActiveTimeLine and lead to incomplete query results
hudi-bot commented on pull request #4172: URL: https://github.com/apache/hudi/pull/4172#issuecomment-989556962 ## CI report: * 8740fecc1391600a836eb6158c31ab7416f57eec UNKNOWN * f65e6971ee3e9659155c6b453dec2c60e22fd29d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4041) Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4045) * 03e9cacdf292f3bf719110807cf4ebe4067e92c0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service
danny0405 commented on a change in pull request #4252: URL: https://github.com/apache/hudi/pull/4252#discussion_r765464342 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/async/HoodieAsyncService.java ## @@ -141,35 +140,17 @@ public void start(Function onShutdownCallback) { protected abstract Pair startService(); /** - * A monitor thread is started which would trigger a callback if the service is shutdown. + * Add shutdown callback for the completable future. * - * @param onShutdownCallback + * @param callback The callback */ - private void monitorThreads(Function onShutdownCallback) { -LOG.info("Submitting monitor thread !!"); -Executors.newSingleThreadExecutor(r -> { - Thread t = new Thread(r, "Monitor Thread"); - t.setDaemon(isRunInDaemonMode()); Review comment: Here the code new a thread pool but never shut down it, the core working thread always lives there thus leak ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service
danny0405 commented on a change in pull request #4252: URL: https://github.com/apache/hudi/pull/4252#discussion_r765463984 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AsyncCleanerService.java ## @@ -73,6 +73,8 @@ public static void waitForCompletion(AsyncCleanerService asyncCleanerService) { asyncCleanerService.waitForShutdown(); } catch (Exception e) { throw new HoodieException("Error waiting for async cleaning to finish", e); + } finally { +asyncCleanerService.shutdown(false); } Review comment: Actually no, it is the monitor thread pool leaks ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #4252: [HUDI-2959] Fix the thread leak of cleaning service
vinothchandar commented on a change in pull request #4252: URL: https://github.com/apache/hudi/pull/4252#discussion_r765451717 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/AsyncCleanerService.java ## @@ -73,6 +73,8 @@ public static void waitForCompletion(AsyncCleanerService asyncCleanerService) { asyncCleanerService.waitForShutdown(); } catch (Exception e) { throw new HoodieException("Error waiting for async cleaning to finish", e); + } finally { +asyncCleanerService.shutdown(false); } Review comment: Guess this is the leak -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-1894) NULL values in timestamp column defaulted
[ https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-1894. --- > NULL values in timestamp column defaulted > -- > > Key: HUDI-1894 > URL: https://issues.apache.org/jira/browse/HUDI-1894 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Eldhose Paul >Assignee: sivabalan narayanan >Priority: Critical > Labels: schema, sev:critical, triaged > Fix For: 0.9.0 > > > Reading timestamp column from hudi and underlying parquet files in spark > gives different results. > *hudi properties:* > {code:java} > hdfs dfs -cat > /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties > #Properties saved on Tue May 11 17:17:22 EDT 2021 > #Tue May 11 17:17:22 EDT 2021 > hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload > hoodie.table.name=jiraissue > hoodie.archivelog.folder=archived > hoodie.table.type=MERGE_ON_READ > hoodie.table.version=1 > hoodie.timeline.layout.version=1 > {code} > > *Reading directly from parquet using Spark:* > {code:java} > scala> val ji = > spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet") > ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> ji.filter($"id" === > 1237858).withColumn("inputfile", > input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate", > $"inputfile").show(false) > +---+--+--+--+---+--+++ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate|archiveddate|inputfile > > | > +---+--+--+--+---+--+++ > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet >| > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C > +---+--+--+--+---+--+++ > {code} > *Reading `hudi` using Spark:* > {code:java} > scala> val jih = > spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events") > jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === > 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false) > +---+--+--+--+---+---+---+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate |archiveddate | >
[jira] [Updated] (HUDI-1894) NULL values in timestamp column defaulted
[ https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-1894: -- Fix Version/s: 0.9.0 > NULL values in timestamp column defaulted > -- > > Key: HUDI-1894 > URL: https://issues.apache.org/jira/browse/HUDI-1894 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Eldhose Paul >Assignee: sivabalan narayanan >Priority: Critical > Labels: schema, sev:critical, triaged > Fix For: 0.9.0 > > > Reading timestamp column from hudi and underlying parquet files in spark > gives different results. > *hudi properties:* > {code:java} > hdfs dfs -cat > /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties > #Properties saved on Tue May 11 17:17:22 EDT 2021 > #Tue May 11 17:17:22 EDT 2021 > hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload > hoodie.table.name=jiraissue > hoodie.archivelog.folder=archived > hoodie.table.type=MERGE_ON_READ > hoodie.table.version=1 > hoodie.timeline.layout.version=1 > {code} > > *Reading directly from parquet using Spark:* > {code:java} > scala> val ji = > spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet") > ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> ji.filter($"id" === > 1237858).withColumn("inputfile", > input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate", > $"inputfile").show(false) > +---+--+--+--+---+--+++ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate|archiveddate|inputfile > > | > +---+--+--+--+---+--+++ > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet >| > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C > +---+--+--+--+---+--+++ > {code} > *Reading `hudi` using Spark:* > {code:java} > scala> val jih = > spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events") > jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === > 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false) > +---+--+--+--+---+---+---+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate |archiveddate | >
[jira] [Commented] (HUDI-1894) NULL values in timestamp column defaulted
[ https://issues.apache.org/jira/browse/HUDI-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17456152#comment-17456152 ] sivabalan narayanan commented on HUDI-1894: --- Since the issue is not reproducible with 0.9.0 and later, closing the jira. > NULL values in timestamp column defaulted > -- > > Key: HUDI-1894 > URL: https://issues.apache.org/jira/browse/HUDI-1894 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Eldhose Paul >Assignee: sivabalan narayanan >Priority: Critical > Labels: schema, sev:critical, triaged > > Reading timestamp column from hudi and underlying parquet files in spark > gives different results. > *hudi properties:* > {code:java} > hdfs dfs -cat > /user/hive/warehouse/jira_expl.db/jiraissue_events/.hoodie/hoodie.properties > #Properties saved on Tue May 11 17:17:22 EDT 2021 > #Tue May 11 17:17:22 EDT 2021 > hoodie.compaction.payload.class=org.apache.hudi.common.model.OverwriteWithLatestAvroPayload > hoodie.table.name=jiraissue > hoodie.archivelog.folder=archived > hoodie.table.type=MERGE_ON_READ > hoodie.table.version=1 > hoodie.timeline.layout.version=1 > {code} > > *Reading directly from parquet using Spark:* > {code:java} > scala> val ji = > spark.read.format("parquet").load("/user/hive/warehouse/jira_expl.db/jiraissue_events/*.parquet") > ji: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> ji.filter($"id" === > 1237858).withColumn("inputfile", > input_file_name()).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate", > $"inputfile").show(false) > +---+--+--+--+---+--+++ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate|archiveddate|inputfile > > | > +---+--+--+--+---+--+++ > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet >| > |20210511171722 |20210511171722_7_13718|1237858.0 | > > |832cf07f-637b-4a4c-ab08-6929554f003a-0_7-98-5106_20210511171722.parquet|null > |null > |hdfs://nameservice1/user/hive/warehouse/jira_expl.db/jiraissue_events/832cf07f-637b-4a4c-ab08-6929554f003a-0_8-1610-78711_20210511173615.parquet%7C > +---+--+--+--+---+--+++ > {code} > *Reading `hudi` using Spark:* > {code:java} > scala> val jih = > spark.read.format("org.apache.hudi").load("/user/hive/warehouse/jira_expl.db/jiraissue_events") > jih: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, > _hoodie_commit_seqno: string ... 49 more fields]scala> jih.filter($"id" === > 1237858).select($"_hoodie_commit_time", $"_hoodie_commit_seqno", > $"_hoodie_record_key", $"_hoodie_partition_path", > $"_hoodie_file_name",$"resolutiondate", $"archiveddate").show(false) > +---+--+--+--+---+---+---+ > |_hoodie_commit_time|_hoodie_commit_seqno > |_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name > |resolutiondate |archiveddate | >
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989531356 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) * b561916256de18ffca7093d2e8200ae02c945efc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989529213 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989529213 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989527936 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989527936 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4114) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot removed a comment on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989526749 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
hudi-bot commented on pull request #4264: URL: https://github.com/apache/hudi/pull/4264#issuecomment-989526749 ## CI report: * c0d4c1ca673451e77a10cc3058d3a1df18ffc81b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jaehyeon-kim commented on issue #4244: [SUPPORT] DeltaStreamer doesn't create a Glue/Hive table if deploy mode is cluster due to HiveConnection Failure
jaehyeon-kim commented on issue #4244: URL: https://github.com/apache/hudi/issues/4244#issuecomment-989526615 Found the following config is necessary. ``` hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://:1 ``` It defaults to `localhost:1` and it works with the `client` mode because the driver program run in the master node. However it fails with the `cluster` mode because the driver program runs in the core node. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jaehyeon-kim closed issue #4244: [SUPPORT] DeltaStreamer doesn't create a Glue/Hive table if deploy mode is cluster due to HiveConnection Failure
jaehyeon-kim closed issue #4244: URL: https://github.com/apache/hudi/issues/4244 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2785) Create Trino setup in docker demo
[ https://issues.apache.org/jira/browse/HUDI-2785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-2785: - Labels: pull-request-available (was: ) > Create Trino setup in docker demo > - > > Key: HUDI-2785 > URL: https://issues.apache.org/jira/browse/HUDI-2785 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Ethan Guo >Assignee: Ethan Guo >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] guanziyue opened a new pull request #4264: [HUDI-2785] Make HoodieParquetWriter Thread safe and memory executor …
guanziyue opened a new pull request #4264: URL: https://github.com/apache/hudi/pull/4264 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Fix problem mentioned in https://issues.apache.org/jira/browse/HUDI-2875. ## Brief change log 1. Add synchronized to HoodieParquetWriter to make sure it is thread safe for any possible following uses. 2. Add a graceful exit for BoundedInMemoryExecutor. 3. let's first stop BoundedInMemoryExecutor and then close HoodieMergeHandle in SparkMergeHelper. ## Verify this pull request This pull request is manually verified the change by running a job locally ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (9c8ad0f -> 5ac9ce7)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 9c8ad0f [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter (#3912) add 5ac9ce7 [MINOR] Fix Compile broken (#4263) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/common/functional/TestHoodieLogFormat.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
[GitHub] [hudi] leesf merged pull request #4263: [MINOR] Fix Compile broken
leesf merged pull request #4263: URL: https://github.com/apache/hudi/pull/4263 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken
hudi-bot commented on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989492641 ## CI report: * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4263: [MINOR] Fix Compile broken
hudi-bot removed a comment on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989474190 ## CI report: * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot removed a comment on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-989459744 ## CI report: * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040) * a085e101422d1df36b94127e75e5d60716986e69 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot commented on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-989486769 ## CI report: * a085e101422d1df36b94127e75e5d60716986e69 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimal
xiarixiaoyao commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r765414682 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java ## @@ -46,13 +53,32 @@ public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) { super(); Configuration hadoopConf = new Configuration(conf); -hadoopConf.set("spark.sql.parquet.writeLegacyFormat", writeConfig.parquetWriteLegacyFormatEnabled()); +hadoopConf.set("spark.sql.parquet.writeLegacyFormat", findSmallPrecisionDecimalType(structType) ? "true" : writeConfig.parquetWriteLegacyFormatEnabled()); Review comment: @codope thanks for your review。 yes,I think if findsmallprecisiondecimaltype returns false, we need respect the user's settings; if findsmallprecisiondecimaltype returns true,we need ignore user's choice, because the user's choosie may lead the failure of subsequent updating of the Hudi table -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4178: [HUDI-2901] Fixed the bug clustering jobs cannot running in parallel
xiarixiaoyao commented on a change in pull request #4178: URL: https://github.com/apache/hudi/pull/4178#discussion_r765412301 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java ## @@ -95,8 +95,7 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext .map(inputGroup -> runClusteringForGroupAsync(inputGroup, clusteringPlan.getStrategy().getStrategyParams(), Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false), -instantTime)) -.map(CompletableFuture::join); + instantTime)).collect(Collectors.toList()).stream().map(CompletableFuture::join); Review comment: @alexeykudinkin, here is my explain: 1) in original code, one streaming pipleline is used, using join in the same pipeline will lead threads to execute one by one 2) Here are two stream operations. Join operation is in the second stream pipleline, this operation will not lead threads to execute one by one。 we can verify by follow code: public static void main(String[] args) { Integer[] test = new Integer[] {0, 2, 3}; long time = System.currentTimeMillis(); List> ls = Arrays.stream(test).map(f -> CompletableFuture.supplyAsync(() -> { System.out.println(Thread.currentThread().getName()); try { Thread.sleep(1); } catch (InterruptedException e) { } return f; })).collect(Collectors.toList()); ls.stream().map(f -> f.join()).collect(Collectors.toList()); System.out.println(String.format("cost time: %s", (System.currentTimeMillis() - time) /1000)); } You are an expert in this field, thank you for your patient guidance. If you still think there is a problem, I will modify the code. Thanks again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on a change in pull request #4178: [HUDI-2901] Fixed the bug clustering jobs cannot running in parallel
xiarixiaoyao commented on a change in pull request #4178: URL: https://github.com/apache/hudi/pull/4178#discussion_r765412301 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/MultipleSparkJobExecutionStrategy.java ## @@ -95,8 +95,7 @@ public MultipleSparkJobExecutionStrategy(HoodieTable table, HoodieEngineContext .map(inputGroup -> runClusteringForGroupAsync(inputGroup, clusteringPlan.getStrategy().getStrategyParams(), Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false), -instantTime)) -.map(CompletableFuture::join); + instantTime)).collect(Collectors.toList()).stream().map(CompletableFuture::join); Review comment: @alexeykudinkin, here is my explain: 1) in original code, one streaming pipleline is used, using join in the same pipeline will lead threads to execute one by one 2) Here are two stream operations. Join operation is in the second stream pipleline, this operation will not lead threads to execute one by one。 public static void main(String[] args) { Integer[] test = new Integer[] {0, 2, 3}; long time = System.currentTimeMillis(); List> ls = Arrays.stream(test).map(f -> CompletableFuture.supplyAsync(() -> { System.out.println(Thread.currentThread().getName()); try { Thread.sleep(1); } catch (InterruptedException e) { } return f; })).collect(Collectors.toList()); ls.stream().map(f -> f.join()).collect(Collectors.toList()); System.out.println(String.format("cost time: %s", (System.currentTimeMillis() - time) /1000)); } You are an expert in this field, thank you for your patient guidance. If you still think there is a problem, I will modify the code. Thanks again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: updated writing pages with more details on delete and descibing the full write path (#4250)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 69e8906 updated writing pages with more details on delete and descibing the full write path (#4250) 69e8906 is described below commit 69e8906193bda0bcf2f443e61ba78d71e4d64e1b Author: Kyle Weller AuthorDate: Wed Dec 8 19:43:02 2021 -0800 updated writing pages with more details on delete and descibing the full write path (#4250) --- website/docs/write_operations.md | 62 ++- website/docs/writing_data.md | 93 +--- 2 files changed, 137 insertions(+), 18 deletions(-) diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md index 06176e5..ccdac23 100644 --- a/website/docs/write_operations.md +++ b/website/docs/write_operations.md @@ -5,16 +5,56 @@ toc: true last_modified_at: --- -It may be helpful to understand the 3 different write operations provided by Hudi datasource or the delta streamer tool and how best to leverage them. These operations -can be chosen/changed across each commit/deltacommit issued against the table. +It may be helpful to understand the different write operations of Hudi and how best to leverage them. These operations +can be chosen/changed across each commit/deltacommit issued against the table. See the [How To docs on Writing Data](/docs/writing_data) +to see more examples. +## Operation Types +### UPSERT +This is the default operation where the input records are first tagged as inserts or updates by looking up the index. +The records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. +This operation is recommended for use-cases like database change capture where the input almost certainly contains updates. The target table will never show duplicates. -- **UPSERT** : This is the default operation where the input records are first tagged as inserts or updates by looking up the index. - The records are ultimately written after heuristics are run to determine how best to pack them on storage to optimize for things like file sizing. - This operation is recommended for use-cases like database change capture where the input almost certainly contains updates. The target table will never show duplicates. -- **INSERT** : This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts - for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the table can tolerate duplicates, but just - need the transactional writes/incremental pull/storage management capabilities of Hudi. -- **BULK_INSERT** : Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for - initial loading/bootstrapping a Hudi table at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs - of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. +### INSERT +This operation is very similar to upsert in terms of heuristics/file sizing but completely skips the index lookup step. Thus, it can be a lot faster than upserts +for use-cases like log de-duplication (in conjunction with options to filter duplicates mentioned below). This is also suitable for use-cases where the table can tolerate duplicates, but just +need the transactional writes/incremental pull/storage management capabilities of Hudi. + +### BULK_INSERT +Both upsert and insert operations keep input records in memory to speed up storage heuristics computations faster (among other things) and thus can be cumbersome for +initial loading/bootstrapping a Hudi table at first. Bulk insert provides the same semantics as insert, while implementing a sort-based data writing algorithm, which can scale very well for several hundred TBs +of initial load. However, this just does a best-effort job at sizing files vs guaranteeing file sizes like inserts/upserts do. + +### DELETE +Hudi supports implementing two types of deletes on data stored in Hudi tables, by enabling the user to specify a different record payload implementation. +- **Soft Deletes** : Retain the record key and just null out the values for all the other fields. + This can be achieved by ensuring the appropriate fields are nullable in the table schema and simply upserting the table after setting these fields to null. +- **Hard Deletes** : A stronger form of deletion is
[GitHub] [hudi] bhasudha merged pull request #4250: [HUDI-2956] - Updating write docs for deletes and full write path description
bhasudha merged pull request #4250: URL: https://github.com/apache/hudi/pull/4250 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codope commented on a change in pull request #4253: [HUDI-2958] Automatically set spark.sql.parquet.writelegacyformat, when using bulkinsert to insert data which contains decimalType
codope commented on a change in pull request #4253: URL: https://github.com/apache/hudi/pull/4253#discussion_r765408801 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java ## @@ -46,13 +53,32 @@ public HoodieRowParquetWriteSupport(Configuration conf, StructType structType, BloomFilter bloomFilter, HoodieWriteConfig writeConfig) { super(); Configuration hadoopConf = new Configuration(conf); -hadoopConf.set("spark.sql.parquet.writeLegacyFormat", writeConfig.parquetWriteLegacyFormatEnabled()); +hadoopConf.set("spark.sql.parquet.writeLegacyFormat", findSmallPrecisionDecimalType(structType) ? "true" : writeConfig.parquetWriteLegacyFormatEnabled()); Review comment: If `findSmallPrecisionDecimalType` returns false and `parquetWriteLegacyFormatEnabled` is set to true in the config then `spark.sql.parquet.writeLegacyFormat` will be true. Is that the intention? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated (6706c1c -> d22db6c)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a change to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git. from 6706c1c [MINOR] Updating configs to reflect dynamodb based lock configs (#4262) add d22db6c [HUDI-2827] - Docs for DeltaStreamer SchemaProviders, Sources, and Checkpoints (#4235) No new revisions were added by this update. Summary of changes: website/docs/hoodie_deltastreamer.md | 129 ++- 1 file changed, 128 insertions(+), 1 deletion(-)
[GitHub] [hudi] leesf commented on pull request #4263: [MINOR] Fix Compile broken
leesf commented on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989477004 > +1 once CI passes And can we add a workflow to force make contributor rebase against latest master to avoid the problem ? CC @xushiyan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4067: [HUDI-2763] Metadata table records key deduplication
manojpec commented on a change in pull request #4067: URL: https://github.com/apache/hudi/pull/4067#discussion_r765408100 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlockFactory.java ## @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.log.block; + +import org.apache.avro.generic.IndexedRecord; +import org.apache.hudi.common.config.HoodieMetadataConfig; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.table.HoodieTableConfig; +import org.apache.hudi.exception.HoodieException; +import org.apache.hudi.metadata.HoodieMetadataHFileDataBlock; +import org.apache.hudi.metadata.HoodieMetadataPayload; +import org.apache.hudi.metadata.HoodieTableMetadata; + +import java.util.List; +import java.util.Map; + +public class HoodieLogBlockFactory { + + /** + * Util method to get a data block for the requested type. + * + * @param logDataBlockFormat - Data block type + * @param recordList - List of records that goes in the data block + * @param header - data block header + * @param tableConfig- Table config + * @param metadataConfig - Metadata config + * @param tableBasePath - Table base path + * @param populateMetaFields - Whether to populate meta fields in the record + * @return Data block of the requested type. + */ + public static HoodieLogBlock getBlock(HoodieLogBlock.HoodieLogBlockType logDataBlockFormat, +List recordList, +Map header, +HoodieTableConfig tableConfig, HoodieMetadataConfig metadataConfig, +String tableBasePath, boolean populateMetaFields) { +final boolean isMetadataKeyDeDuplicate = metadataConfig.getRecordKeyDeDuplicate() +&& HoodieTableMetadata.isMetadataTable(tableBasePath); +String keyField; +if (populateMetaFields) { + keyField = (isMetadataKeyDeDuplicate + ? HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY : HoodieRecord.RECORD_KEY_METADATA_FIELD); +} else { + keyField = tableConfig.getRecordKeyFieldProp(); +} +return getBlock(logDataBlockFormat, recordList, header, keyField, isMetadataKeyDeDuplicate); + } + + /** + * Util method to get a data block for the requested type. + * + * @param logDataBlockFormat - Data block type + * @param recordList - List of records that goes in the data block + * @param header - data block header + * @param keyField - FieldId to get the key from the records + * @param isMetadataKeyDeDuplicate - Whether metadata key de duplication needed + * @return Data block of the requested type. + */ + private static HoodieLogBlock getBlock(HoodieLogBlock.HoodieLogBlockType logDataBlockFormat, List recordList, + Map header, String keyField, + boolean isMetadataKeyDeDuplicate) { +switch (logDataBlockFormat) { + case AVRO_DATA_BLOCK: +return new HoodieAvroDataBlock(recordList, header, keyField); + case HFILE_DATA_BLOCK: +if (isMetadataKeyDeDuplicate) { + return new HoodieMetadataHFileDataBlock(recordList, header, keyField); Review comment: sure, I can pull the key deduplication logic to higher layer. Will post a revision. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
yihua commented on a change in pull request #4106: URL: https://github.com/apache/hudi/pull/4106#discussion_r765217559 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java ## @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sort; + +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.config.HoodieClusteringConfig; +import org.apache.hudi.optimize.HilbertCurveUtils; +import org.apache.hudi.optimize.ZOrderingUtil; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.Row$; +import org.apache.spark.sql.hudi.execution.RangeSampleSort$; +import org.apache.spark.sql.hudi.execution.ZorderingBinarySort; +import org.apache.spark.sql.types.BinaryType; +import org.apache.spark.sql.types.BinaryType$; +import org.apache.spark.sql.types.BooleanType; +import org.apache.spark.sql.types.ByteType; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.DateType; +import org.apache.spark.sql.types.DecimalType; +import org.apache.spark.sql.types.DoubleType; +import org.apache.spark.sql.types.FloatType; +import org.apache.spark.sql.types.IntegerType; +import org.apache.spark.sql.types.LongType; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.ShortType; +import org.apache.spark.sql.types.StringType; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.sql.types.StructType$; +import org.apache.spark.sql.types.TimestampType; +import org.davidmoten.hilbert.HilbertCurve; +import scala.collection.JavaConversions; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.function.Function; +import java.util.stream.Collectors; + +public class SpaceCurveSortingHelper { + + private static final Logger LOG = LogManager.getLogger(SpaceCurveSortingHelper.class); + + /** + * Orders provided {@link Dataset} by mapping values of the provided list of columns + * {@code orderByCols} onto a specified space curve (Z-curve, Hilbert, etc) + * + * + * NOTE: Only support base data-types: long,int,short,double,float,string,timestamp,decimal,date,byte. + * This method is more effective than {@link #orderDataFrameBySamplingValues} leveraging + * data sampling instead of direct mapping + * + * @param df Spark {@link Dataset} holding data to be ordered + * @param orderByCols list of columns to be ordered by + * @param targetPartitionCount target number of output partitions + * @param layoutOptStrategy target layout optimization strategy + * @return a {@link Dataset} holding data ordered by mapping tuple of values from provided columns + * onto a specified space-curve + */ + public static Dataset orderDataFrameByMappingValues( + Dataset df, + HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy, + List orderByCols, + int targetPartitionCount + ) { +Map columnsMap = +Arrays.stream(df.schema().fields()) +.collect(Collectors.toMap(StructField::name, Function.identity())); + +List checkCols = +orderByCols.stream() +.filter(columnsMap::containsKey) +.collect(Collectors.toList()); + +if (orderByCols.size() != checkCols.size()) { + LOG.error(String.format("Trying to ordering over a column(s) not present in the schema (%s); skipping", CollectionUtils.diff(orderByCols, checkCols))); + return df; +} + +// In case when there's just one column to be ordered by, we can skip space-curve +// ordering altogether (since it will match linear ordering anyway) +if (orderByCols.size() == 1) { + String orderByColName = orderByCols.get(0); + LOG.debug(String.format("Single column to order by (%s), skipping space-curve ordering", orderByColName)); + + // TODO validate if
[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
yihua commented on a change in pull request #4106: URL: https://github.com/apache/hudi/pull/4106#discussion_r765407715 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/sort/SpaceCurveSortingHelper.java ## @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.sort; + +import org.apache.hudi.common.util.CollectionUtils; +import org.apache.hudi.config.HoodieClusteringConfig; +import org.apache.hudi.optimize.HilbertCurveUtils; +import org.apache.hudi.optimize.ZOrderingUtil; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.sql.Column; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.Row$; +import org.apache.spark.sql.hudi.execution.RangeSampleSort$; +import org.apache.spark.sql.hudi.execution.ZorderingBinarySort; +import org.apache.spark.sql.types.BinaryType; +import org.apache.spark.sql.types.BinaryType$; +import org.apache.spark.sql.types.BooleanType; +import org.apache.spark.sql.types.ByteType; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.DateType; +import org.apache.spark.sql.types.DecimalType; +import org.apache.spark.sql.types.DoubleType; +import org.apache.spark.sql.types.FloatType; +import org.apache.spark.sql.types.IntegerType; +import org.apache.spark.sql.types.LongType; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.ShortType; +import org.apache.spark.sql.types.StringType; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +import org.apache.spark.sql.types.StructType$; +import org.apache.spark.sql.types.TimestampType; +import org.davidmoten.hilbert.HilbertCurve; +import scala.collection.JavaConversions; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.function.Function; +import java.util.stream.Collectors; + +public class SpaceCurveSortingHelper { + + private static final Logger LOG = LogManager.getLogger(SpaceCurveSortingHelper.class); + + /** + * Orders provided {@link Dataset} by mapping values of the provided list of columns + * {@code orderByCols} onto a specified space curve (Z-curve, Hilbert, etc) + * + * + * NOTE: Only support base data-types: long,int,short,double,float,string,timestamp,decimal,date,byte. + * This method is more effective than {@link #orderDataFrameBySamplingValues} leveraging + * data sampling instead of direct mapping + * + * @param df Spark {@link Dataset} holding data to be ordered + * @param orderByCols list of columns to be ordered by + * @param targetPartitionCount target number of output partitions + * @param layoutOptStrategy target layout optimization strategy + * @return a {@link Dataset} holding data ordered by mapping tuple of values from provided columns + * onto a specified space-curve + */ + public static Dataset orderDataFrameByMappingValues( + Dataset df, + HoodieClusteringConfig.LayoutOptimizationStrategy layoutOptStrategy, + List orderByCols, + int targetPartitionCount + ) { +Map columnsMap = +Arrays.stream(df.schema().fields()) +.collect(Collectors.toMap(StructField::name, Function.identity())); + +List checkCols = +orderByCols.stream() +.filter(columnsMap::containsKey) +.collect(Collectors.toList()); + +if (orderByCols.size() != checkCols.size()) { + LOG.error(String.format("Trying to ordering over a column(s) not present in the schema (%s); skipping", CollectionUtils.diff(orderByCols, checkCols))); + return df; +} Review comment: Got it. Sg. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha merged pull request #4235: [HUDI-2827] - Docs for DeltaStreamer SchemaProviders, Sources, and Checkpoints
bhasudha merged pull request #4235: URL: https://github.com/apache/hudi/pull/4235 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
yihua commented on a change in pull request #4106: URL: https://github.com/apache/hudi/pull/4106#discussion_r765213822 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java ## @@ -621,4 +495,5 @@ public static String createIndexMergeSql( String.format("%s.%s", newIndexTable, columns.get(0)) ); } + Review comment: nit: remove empty line? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
yihua commented on a change in pull request #4106: URL: https://github.com/apache/hudi/pull/4106#discussion_r765407226 ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java ## @@ -79,13 +71,11 @@ import java.util.stream.Collectors; import java.util.stream.StreamSupport; -import scala.collection.JavaConversions; - import static org.apache.hudi.util.DataTypeUtils.areCompatible; -public class ZOrderingIndexHelper { +public class ColumnStatsIndexHelper { Review comment: Sg -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken
hudi-bot commented on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989474190 ## CI report: * 4b3807f2900ed4719457ed3738cb387006b088f6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4112) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4263: [MINOR] Fix Compile broken
hudi-bot removed a comment on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989472984 ## CI report: * 4b3807f2900ed4719457ed3738cb387006b088f6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #4263: [MINOR] Fix Compile broken
vinothchandar commented on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989473251 +1 once CI passes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4263: [MINOR] Fix Compile broken
hudi-bot commented on pull request #4263: URL: https://github.com/apache/hudi/pull/4263#issuecomment-989472984 ## CI report: * 4b3807f2900ed4719457ed3738cb387006b088f6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf opened a new pull request #4263: [MINOR] Fix Compile broken
leesf opened a new pull request #4263: URL: https://github.com/apache/hudi/pull/4263 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. CC @guanziyue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields
[ https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2968: - Description: Allow to delete/update using non-pk fields {code:java} create table h0 ( id int, name string, price double ) using hudi options (primaryKey = 'id'); update h0 set price = 10 where name = 'foo'; delete from h0 where name = 'foo'; {code} was: Allow to delete/update a non-pk table. {code:java} create table h0 ( id int, name string, price double ) using hudi; delete from h0 where id = 10; update h0 set price = 10 where id = 12; {code} > Support Delete/Update using non-pk fields > - > > Key: HUDI-2968 > URL: https://issues.apache.org/jira/browse/HUDI-2968 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Critical > Fix For: 0.11.0 > > > Allow to delete/update using non-pk fields > {code:java} > create table h0 ( > id int, > name string, > price double > ) using hudi > options (primaryKey = 'id'); > update h0 set price = 10 where name = 'foo'; > delete from h0 where name = 'foo'; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields
[ https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2968: - Priority: Blocker (was: Critical) > Support Delete/Update using non-pk fields > - > > Key: HUDI-2968 > URL: https://issues.apache.org/jira/browse/HUDI-2968 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Blocker > Fix For: 0.11.0 > > > Allow to delete/update using non-pk fields > {code:java} > create table h0 ( > id int, > name string, > price double > ) using hudi > options (primaryKey = 'id'); > update h0 set price = 10 where name = 'foo'; > delete from h0 where name = 'foo'; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2968) Support Delete/Update using non-pk fields
Raymond Xu created HUDI-2968: Summary: Support Delete/Update using non-pk fields Key: HUDI-2968 URL: https://issues.apache.org/jira/browse/HUDI-2968 Project: Apache Hudi Issue Type: Sub-task Components: Spark Integration Reporter: pengzhiwei Assignee: Yann Byron Allow to delete/update a non-pk table. {code:java} create table h0 ( id int, name string, price double ) using hudi; delete from h0 where id = 10; update h0 set price = 10 where id = 12; {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2968) Support Delete/Update using non-pk fields
[ https://issues.apache.org/jira/browse/HUDI-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2968: - Fix Version/s: 0.11.0 > Support Delete/Update using non-pk fields > - > > Key: HUDI-2968 > URL: https://issues.apache.org/jira/browse/HUDI-2968 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: pengzhiwei >Assignee: Yann Byron >Priority: Critical > Fix For: 0.11.0 > > > Allow to delete/update a non-pk table. > {code:java} > create table h0 ( > id int, > name string, > price double > ) using hudi; > delete from h0 where id = 10; > update h0 set price = 10 where id = 12; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
hudi-bot commented on pull request #4106: URL: https://github.com/apache/hudi/pull/4106#issuecomment-989469074 ## CI report: * 42210d2837e5690b12baa7a50e8a2e6cea96aec2 UNKNOWN * 4db8444e25fee92d22f0f2e06d1cdff511b05eaf Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4106: [HUDI-2814] Make Z-index more generic Column-Stats Index
hudi-bot removed a comment on pull request #4106: URL: https://github.com/apache/hudi/pull/4106#issuecomment-989445501 ## CI report: * 42210d2837e5690b12baa7a50e8a2e6cea96aec2 UNKNOWN * 1b836c6b3457c9d07d0ac4cc684c5046062d5b47 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4107) * 4db8444e25fee92d22f0f2e06d1cdff511b05eaf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4108) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1912) Presto defaults to GenericHiveRecordCursor for all Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Mahindra updated HUDI-1912: -- Status: Resolved (was: Patch Available) > Presto defaults to GenericHiveRecordCursor for all Hudi tables > -- > > Key: HUDI-1912 > URL: https://issues.apache.org/jira/browse/HUDI-1912 > Project: Apache Hudi > Issue Type: Sub-task > Components: Presto Integration >Affects Versions: 0.7.0 >Reporter: satish >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.11.0 > > > See code here > https://github.com/prestodb/presto/blob/2ad67dcf000be86ebc5ff7732bbb9994c8e324a8/presto-hive/src/main/java/com/facebook/presto/hive/parquet/ParquetPageSourceFactory.java#L168 > Starting Hudi 0.7, HoodieInputFormat comes with > UseRecordReaderFromInputFormat annotation. As a result, we are skipping all > optimizations in parquet PageSource and using basic GenericHiveRecordCursor > which has several limitations: > 1) No support for timestamp > 2) No support for synthesized columns > 3) No support for vectorized reading? > Example errors we saw: > Error#1 > {code} > java.lang.IllegalStateException: column type must be regular > at > com.google.common.base.Preconditions.checkState(Preconditions.java:507) > at > com.facebook.presto.hive.GenericHiveRecordCursor.(GenericHiveRecordCursor.java:167) > at > com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:79) > at > com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:449) > at > com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:177) > at > com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:63) > at > com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:80) > at > com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:231) > at com.facebook.presto.operator.Driver.processInternal(Driver.java:418) > at > com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301) > at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722) > at com.facebook.presto.operator.Driver.processFor(Driver.java:294) > at > com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1077) > at > com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:162) > at > com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:545) > at > com.facebook.presto.$gen.Presto_0_247_17f857e20210506_210241_1.run(Unknown > Source) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:834) > {code} > Error#2 > {code} > java.lang.ClassCastException: class org.apache.hadoop.io.LongWritable cannot > be cast to class org.apache.hadoop.hive.serde2.io.TimestampWritable > (org.apache.hadoop.io.LongWritable and > org.apache.hadoop.hive.serde2.io.TimestampWritable are in unnamed module of > loader com.facebook.presto.server.PluginClassLoader @5c4e86e7) > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:39) > at > org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableTimestampObjectInspector.getPrimitiveJavaObject(WritableTimestampObjectInspector.java:25) > at > com.facebook.presto.hive.GenericHiveRecordCursor.parseLongColumn(GenericHiveRecordCursor.java:286) > at > com.facebook.presto.hive.GenericHiveRecordCursor.parseColumn(GenericHiveRecordCursor.java:550) > at > com.facebook.presto.hive.GenericHiveRecordCursor.isNull(GenericHiveRecordCursor.java:508) > at > com.facebook.presto.hive.HiveRecordCursor.isNull(HiveRecordCursor.java:233) > at > com.facebook.presto.spi.RecordPageSource.getNextPage(RecordPageSource.java:112) > at > com.facebook.presto.operator.TableScanOperator.getOutput(TableScanOperator.java:251) > at com.facebook.presto.operator.Driver.processInternal(Driver.java:418) > at > com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:301) > at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:722) > at com.facebook.presto.operator.Driver.processFor(Driver.java:294) >
[hudi] branch asf-site updated: [MINOR] Updating configs to reflect dynamodb based lock configs (#4262)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 6706c1c [MINOR] Updating configs to reflect dynamodb based lock configs (#4262) 6706c1c is described below commit 6706c1c708bcddd1c46850f7ba2fd77949965989 Author: Sivabalan Narayanan AuthorDate: Wed Dec 8 22:12:32 2021 -0500 [MINOR] Updating configs to reflect dynamodb based lock configs (#4262) --- website/docs/configurations.md | 63 +- 1 file changed, 62 insertions(+), 1 deletion(-) diff --git a/website/docs/configurations.md b/website/docs/configurations.md index 9cad2d9..02363e7 100644 --- a/website/docs/configurations.md +++ b/website/docs/configurations.md @@ -4,7 +4,7 @@ keywords: [ configurations, default, flink options, spark, configs, parameters ] permalink: /docs/configurations.html summary: This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels. toc: true -last_modified_at: 2021-12-08T09:59:32.441 +last_modified_at: 2021-12-08T17:24:42.348 --- This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at few levels. @@ -1517,6 +1517,67 @@ Configurations that control aspects around writing, sizing, reading base and log --- +### DynamoDB based Locks Configurations {#DynamoDB-based-Locks-Configurations} + +Configs that control DynamoDB based locking mechanisms required for concurrency control between writers to a Hudi table. Concurrency between Hudi's own table services are auto managed internally. + +`Config Class`: org.apache.hudi.config.DynamoDbBasedLockConfig +> hoodie.write.lock.dynamodb.billing_mode +> For DynamoDB based lock provider, by default it is PAY_PER_REQUEST mode +> **Default Value**: PAY_PER_REQUEST (Optional) +> `Config Param: DYNAMODB_LOCK_BILLING_MODE` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.table +> For DynamoDB based lock provider, the name of the DynamoDB table acting as lock table +> **Default Value**: N/A (Required) +> `Config Param: DYNAMODB_LOCK_TABLE_NAME` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.region +> For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service. Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1 +> **Default Value**: us-east-1 (Optional) +> `Config Param: DYNAMODB_LOCK_REGION` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.partition_key +> For DynamoDB based lock provider, the partition key for the DynamoDB lock table. Each Hudi dataset should has it's unique key so concurrent writers could refer to the same partition key. By default we use the Hudi table name specified to be the partition key +> **Default Value**: N/A (Required) +> `Config Param: DYNAMODB_LOCK_PARTITION_KEY` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.write_capacity +> For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode +> **Default Value**: 10 (Optional) +> `Config Param: DYNAMODB_LOCK_WRITE_CAPACITY` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.table_creation_timeout +> For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table +> **Default Value**: 60 (Optional) +> `Config Param: DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT` +> `Since Version: 0.10.0` + +--- + +> hoodie.write.lock.dynamodb.read_capacity +> For DynamoDB based lock provider, read capacity units when using PROVISIONED billing mode +> **Default Value**: 20 (Optional) +> `Config Param: DYNAMODB_LOCK_READ_CAPACITY` +> `Since Version: 0.10.0` + +--- + ### Metadata Configs {#Metadata-Configs} Configurations used by the Hudi Metadata Table. This table maintains the metadata about a given Hudi table (e.g file listings) to avoid overhead of accessing cloud storage, during queries.
[GitHub] [hudi] nsivabalan merged pull request #4262: [MINOR] Updating configs to reflect dynamoDB based lock configs
nsivabalan merged pull request #4262: URL: https://github.com/apache/hudi/pull/4262 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #4260: [HUDI-2821] - Docs for Metadata Table - added reference to vc's benchmark study
vinothchandar commented on pull request #4260: URL: https://github.com/apache/hudi/pull/4260#issuecomment-989465646 this is a good start. we can keep improving this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on pull request #4259: [HUDI-2962] Local process lock provider to guard single writer process with async table operations + Enabling metadata table
manojpec commented on pull request #4259: URL: https://github.com/apache/hudi/pull/4259#issuecomment-989463817 CI failed job passed in the re-run - https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=4106=results -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2467) Delete data is not working with 0.9.0 and pySpark
[ https://issues.apache.org/jira/browse/HUDI-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-2467: - Parent: (was: HUDI-1658) Issue Type: Bug (was: Sub-task) > Delete data is not working with 0.9.0 and pySpark > - > > Key: HUDI-2467 > URL: https://issues.apache.org/jira/browse/HUDI-2467 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Phil Chen >Priority: Critical > Labels: sev:critical > Fix For: 0.11.0 > > > Following this spark guide: > [https://hudi.apache.org/docs/quick-start-guide/] > Everything works until delete data: > I'm using Pyspark with Spark 3.1.2 with python 3.9 > {code:java} > // code placeholder > # pyspark# fetch total records count > spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count() > # fetch two records to be deleted > ds = spark.sql("select uuid, partitionpath from hudi_trips_snapshot").limit(2) > # issue deletes > hudi_delete_options = { 'hoodie.table.name': tableName, > 'hoodie.datasource.write.recordkey.field': 'uuid', > 'hoodie.datasource.write.partitionpath.field': 'partitionpath', > 'hoodie.datasource.write.table.name': tableName, > 'hoodie.datasource.write.operation': 'delete', > 'hoodie.datasource.write.precombine.field': 'ts', > 'hoodie.upsert.shuffle.parallelism': 2, > 'hoodie.insert.shuffle.parallelism': 2} > from pyspark.sql.functions import lit > deletes = list(map(lambda row: (row[0], row[1]), ds.collect())) > df = spark.sparkContext.parallelize(deletes).toDF(['uuid', > 'partitionpath']).withColumn('ts', lit(0.0)) > df.write.format("hudi"). \ > options(**hudi_delete_options). \ > mode("append"). \ > save(basePath) > # run the same read query as above. > roAfterDeleteViewDF = spark. \ > read. \ > format("hudi"). \ > load(basePath) > roAfterDeleteViewDF.registerTempTable("hudi_trips_snapshot") > # fetch should return (total - 2) records > spark.sql("select uuid, partitionpath from hudi_trips_snapshot").count(){code} > The count before delete is 10 and after delete is still 10 (expecting 8) > {code:java} > // code placeholder > >>> df.show() > +++---+ > | partitionpath|uuid| ts| > +++---+ > |74bed794-c854-4aa...|americas/united_s...|0.0| > |ce71c2dc-dedf-483...|americas/united_s...|0.0| > +++---+ > {code} > > The 2 records to be deleted > Note, the -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot commented on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-989459744 ## CI report: * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040) * a085e101422d1df36b94127e75e5d60716986e69 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4110) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot removed a comment on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-989458305 ## CI report: * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040) * a085e101422d1df36b94127e75e5d60716986e69 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from a bad no
nsivabalan commented on issue #4055: URL: https://github.com/apache/hudi/issues/4055#issuecomment-989459731 @guanlisheng : Can you please file a new github issue with all details and CC me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group
[ https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-864: Fix Version/s: (was: 0.11.0) > parquet schema conflict: optional binary (UTF8) is not a group > --- > > Key: HUDI-864 > URL: https://issues.apache.org/jira/browse/HUDI-864 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, Spark Integration >Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0 >Reporter: Roland Johann >Priority: Critical > Labels: sev:critical, user-support-issues > > When dealing with struct types like this > {code:json} > { > "type": "struct", > "fields": [ > { > "name": "categoryResults", > "type": { > "type": "array", > "elementType": { > "type": "struct", > "fields": [ > { > "name": "categoryId", > "type": "string", > "nullable": true, > "metadata": {} > } > ] > }, > "containsNull": true > }, > "nullable": true, > "metadata": {} > } > ] > } > {code} > The second ingest batch throws that exception: > {code} > ERROR [Executor task launch worker for task 15] > commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: operation has failed > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100) > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76) > at > org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) >
[jira] [Updated] (HUDI-864) parquet schema conflict: optional binary (UTF8) is not a group
[ https://issues.apache.org/jira/browse/HUDI-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-864: Component/s: Writer Core > parquet schema conflict: optional binary (UTF8) is not a group > --- > > Key: HUDI-864 > URL: https://issues.apache.org/jira/browse/HUDI-864 > Project: Apache Hudi > Issue Type: Bug > Components: Common Core, Spark Integration, Writer Core >Affects Versions: 0.5.2, 0.6.0, 0.5.3, 0.7.0, 0.8.0, 0.9.0 >Reporter: Roland Johann >Priority: Critical > Labels: sev:critical, user-support-issues > > When dealing with struct types like this > {code:json} > { > "type": "struct", > "fields": [ > { > "name": "categoryResults", > "type": { > "type": "array", > "elementType": { > "type": "struct", > "fields": [ > { > "name": "categoryId", > "type": "string", > "nullable": true, > "metadata": {} > } > ] > }, > "containsNull": true > }, > "nullable": true, > "metadata": {} > } > ] > } > {code} > The second ingest batch throws that exception: > {code} > ERROR [Executor task launch worker for task 15] > commit.BaseCommitActionExecutor (BaseCommitActionExecutor.java:264) - Error > upserting bucketType UPDATE for partition :0 > org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: > org.apache.hudi.exception.HoodieException: operation has failed > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdateInternal(CommitActionExecutor.java:100) > at > org.apache.hudi.table.action.commit.CommitActionExecutor.handleUpdate(CommitActionExecutor.java:76) > at > org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:73) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:258) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleInsertPartition(BaseCommitActionExecutor.java:271) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:104) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:123) > at > org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at
[GitHub] [hudi] hudi-bot commented on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot commented on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-989458305 ## CI report: * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040) * a085e101422d1df36b94127e75e5d60716986e69 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4222: [HUDI-2849] improve SparkUI job description for write path
hudi-bot removed a comment on pull request #4222: URL: https://github.com/apache/hudi/pull/4222#issuecomment-986406347 ## CI report: * 9b30498cb6205e12cb859dd8fe4654ec952f63e0 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4040) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)
nsivabalan edited a comment on issue #3945: URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674 Looks like the support was never added to deltastreamer only. I have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If either of you are interested in working towards it, let me know. I can guide you. we can get it in for 0.11. Since we have a tracking jira, will close the github issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)
nsivabalan commented on issue #3945: URL: https://github.com/apache/hudi/issues/3945#issuecomment-989457674 Looks like the support was never added to deltastreamer only. I have filed a tracking ticket [here](https://issues.apache.org/jira/browse/HUDI-2967). If you are interested in working towards it, let me know. I can guide you. we can get it in for 0.11. Since we have a tracking jira, will close the github issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)
nsivabalan closed issue #3945: URL: https://github.com/apache/hudi/issues/3945 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-2967) Support drop partition columns in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-2967: -- Fix Version/s: 0.11.0 > Support drop partition columns in deltastreamer > --- > > Key: HUDI-2967 > URL: https://issues.apache.org/jira/browse/HUDI-2967 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.11.0 > > > We added support to drop partition columns in spark datasource, but the > support has not been added to deltastreamer. creating a ticket to add the > support. > [https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-2967) Support drop partition columns in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan reassigned HUDI-2967: - Assignee: sivabalan narayanan > Support drop partition columns in deltastreamer > --- > > Key: HUDI-2967 > URL: https://issues.apache.org/jira/browse/HUDI-2967 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Major > Fix For: 0.11.0 > > > We added support to drop partition columns in spark datasource, but the > support has not been added to deltastreamer. creating a ticket to add the > support. > [https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2967) Support drop partition columns in deltastreamer
sivabalan narayanan created HUDI-2967: - Summary: Support drop partition columns in deltastreamer Key: HUDI-2967 URL: https://issues.apache.org/jira/browse/HUDI-2967 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: sivabalan narayanan We added support to drop partition columns in spark datasource, but the support has not been added to deltastreamer. creating a ticket to add the support. [https://github.com/apache/hudi/commit/968927801470953f137368cf146778a7f01aa63f] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] guanlisheng edited a comment on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from
guanlisheng edited a comment on issue #4055: URL: https://github.com/apache/hudi/issues/4055#issuecomment-989456516 hey @nsivabalan , thanks for the reply. very similar and the only difference is the dataset operation vs a spark SQL. an additional clue is that the transformer works well with HoodieIncrSource and such an issue is happening with JsonKafkaSource. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (bd08470 -> 9c8ad0f)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from bd08470 [HUDI-2957] Shade kryo jar for flink bundle jar (#4251) add 9c8ad0f [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter (#3912) No new revisions were added by this update. Summary of changes: .../common/table/log/HoodieLogFormatWriter.java| 12 +++-- .../common/functional/TestHoodieLogFormat.java | 62 ++ 2 files changed, 71 insertions(+), 3 deletions(-)
[GitHub] [hudi] guanlisheng commented on issue #4055: [SUPPORT] Hudi with SqlQueryBasedTransformer fails-> spark error exit 134 or exit 143 in "isEmpty at DeltaSync.java:344" : Container from a bad n
guanlisheng commented on issue #4055: URL: https://github.com/apache/hudi/issues/4055#issuecomment-989456516 hey @nsivabalan , thanks for the reply. very similar and the only difference is the dataset operation vs a spark SQL. additional clue is that the transformer works well with HoodieIncrSource and such incident happens with JsonKafkaSource. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf merged pull request #3912: [HUDI-2665] Fix overflow of huge log file in HoodieLogFormatWriter
leesf merged pull request #3912: URL: https://github.com/apache/hudi/pull/3912 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Limess opened a new issue #3945: [SUPPORT] hoodie.datasource.write.drop.partition.columns not working as expected (deltastreamer)
Limess opened a new issue #3945: URL: https://github.com/apache/hudi/issues/3945 **Describe the problem you faced** We're running a deltastreamer job into a new Hudi table. We have a partition column: `story_published_partition_date`, and we set `hoodie.datasource.write.drop.partition.columns=true`. When the execution completes, we observe that the partition column is present in the parquet file, and data: ```shell parquet-tools show s3:///articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet --head 1 --columns story_published_partition_date --awsprofile signal-prod ℹ s3:///articles_hudi_copy_on_write_drop_partition_column_test/story_published_partition_date=2021-01-07/b4eec094-ea1e-4b95-839f-648592eddb08-0_18-26-3681_20211108115747.parquet => /var/folders/lx/83dtr4vx0cs87l55pwnwk760gq/T/tmp0ypmpxcw/f39767b2-3218-4ebe-9396-9549d6998c02.parquet +--+ | story_published_partition_date | |--| | 2021-01-07T09:00:00Z | +--+ ``` Configuration: ``` "Args": [ "spark-submit", "--deploy-mode", "cluster", "--class", "org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer", "--jars", "/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar,/home/hadoop/extra-jars/spark-avro_2.12-3.0.1.jar,/home/hadoop/extra-jars/hudi-spark3-bundle_2.12-0.9.0.jar", "/home/hadoop/extra-jars/hudi-utilities-bundle_2.12-0.9.0.jar", "--props", "/etc/hudi/conf/hudi-base.properties", "--table-type", "COPY_ON_WRITE", "--op", "UPSERT ", "--source-ordering-field", "version", "--source-class", "org.apache.hudi.utilities.sources.ParquetDFSSource", "--transformer-class", "org.apache.hudi.utilities.transform.SqlFileBasedTransformer", "--target-base-path", "s3:///articles_hudi_copy_on_write_drop_partition_column_test/", "--target-table", "articles_hudi_copy_on_write_drop_partition_column_test", "--enable-hive-sync", "--hoodie-conf", "hoodie.table.name=articles_hudi_copy_on_write_drop_partition_column_test", "--hoodie-conf", "hoodie.deltastreamer.transformer.sql.file=/etc/hudi/conf/schema/documents_schema.sql", "--hoodie-conf", "hoodie.datasource.write.recordkey.field=id", "--hoodie-conf", "hoodie.datasource.write.precombine.field=version", "--hoodie-conf", "hoodie.bloom.index.prune.by.ranges=false", "--hoodie-conf", "hoodie.datasource.write.partitionpath.field=story_published_partition_date", "--hoodie-conf", "hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator", "--hoodie-conf", "hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING", "--hoodie-conf", "hoodie.deltastreamer.keygen.timebased.input.dateformat=-MM-dd'T'HH:mm:ssZ,-MM-dd'T'HH:mm:ss.SSSZ", "--hoodie-conf", "hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex=,", "--hoodie-conf", "hoodie.deltastreamer.keygen.timebased.output.dateformat=-MM-dd", "--hoodie-conf", "hoodie.deltastreamer.keygen.timebased.output.timezone=UTC", "--hoodie-conf", "hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor", "--hoodie-conf", "hoodie.datasource.write.hive_style_partitioning=true", "--hoodie-conf", "hoodie.datasource.write.drop.partition.columns=true", "--hoodie-conf", "hoodie.datasource.write.reconcile.schema=true", "--hoodie-conf", "hoodie.datasource.hive_sync.enable=true", "--hoodie-conf", "hoodie.datasource.hive_sync.database=articles", "--hoodie-conf", "hoodie.datasource.hive_sync.table=articles_hudi_copy_on_write_drop_partition_column_test", "--hoodie-conf", "hoodie.datasource.hive_sync.partition_fields=story_published_partition_date", "--hoodie-conf", "hoodie.deltastreamer.source.dfs.root=s3:///firehose_received_date=2021-11-08/"