[GitHub] [hudi] Sugamber commented on issue #2637: [SUPPORT] - Partial Update : update few columns of a table
Sugamber commented on issue #2637: URL: https://github.com/apache/hudi/issues/2637#issuecomment-812973626 @n3nash I'm waiting for below pull request to merge. Please let me know if we can close once pull request merged. https://github.com/apache/hudi/pull/2666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2762: [HUDI-1716] Regenerating schema with default null values as applicable
nsivabalan commented on pull request #2762: URL: https://github.com/apache/hudi/pull/2762#issuecomment-812969426 @bvaradar : I see that SchemaConverters.toAvroType(...) is used in 3 places in hudi code base. 2 of those have been fixed as part of this patch. Wondering if we need to fix HoodieSparkBootstrapSchemaProvider as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1716) rt view w/ MOR tables fails after schema evolution
[ https://issues.apache.org/jira/browse/HUDI-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1716: - Labels: pull-request-available sev:critical user-support-issues (was: sev:critical user-support-issues) > rt view w/ MOR tables fails after schema evolution > -- > > Key: HUDI-1716 > URL: https://issues.apache.org/jira/browse/HUDI-1716 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Reporter: sivabalan narayanan >Assignee: Aditya Tiwari >Priority: Major > Labels: pull-request-available, sev:critical, user-support-issues > Fix For: 0.9.0 > > > Looks like realtime view w/ MOR table fails if schema present in existing log > file is evolved to add a new field. no issues w/ writing. but reading fails > More info: [https://github.com/apache/hudi/issues/2675] > > gist of the stack trace: > Caused by: org.apache.avro.AvroTypeException: Found > hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting > hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field > evolvedFieldCaused by: org.apache.avro.AvroTypeException: Found > hoodie.hudi_trips_cow.hudi_trips_cow_record, expecting > hoodie.hudi_trips_cow.hudi_trips_cow_record, missing required field > evolvedField at > org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292) at > org.apache.avro.io.parsing.Parser.advance(Parser.java:88) at > org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130) > at > org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215) > at > org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153) > at > org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145) > at > org.apache.hudi.common.table.log.block.HoodieAvroDataBlock.deserializeRecords(HoodieAvroDataBlock.java:165) > at > org.apache.hudi.common.table.log.block.HoodieDataBlock.createRecordsFromContentBytes(HoodieDataBlock.java:128) > at > org.apache.hudi.common.table.log.block.HoodieDataBlock.getRecords(HoodieDataBlock.java:106) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processDataBlock(AbstractHoodieLogRecordScanner.java:289) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:324) > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:252) > ... 24 more21/03/25 11:27:03 WARN TaskSetManager: Lost task 0.0 in stage > 83.0 (TID 667, sivabala-c02xg219jgh6.attlocal.net, executor driver): > org.apache.hudi.exception.HoodieException: Exception when reading log file > at > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:261) > at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.performScan(HoodieMergedLogRecordScanner.java:100) > at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:93) > at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:75) > at > org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner$Builder.build(HoodieMergedLogRecordScanner.java:230) > at > org.apache.hudi.HoodieMergeOnReadRDD$.scanLog(HoodieMergeOnReadRDD.scala:328) > at > org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.(HoodieMergeOnReadRDD.scala:210) > at > org.apache.hudi.HoodieMergeOnReadRDD.payloadCombineFileIterator(HoodieMergeOnReadRDD.scala:200) > at > org.apache.hudi.HoodieMergeOnReadRDD.compute(HoodieMergeOnReadRDD.scala:77) > > Logs from local run: > [https://gist.github.com/nsivabalan/656956ab313676617d84002ef8942198] > diff with which above logs were generated: > [https://gist.github.com/nsivabalan/84dad29bc1ab567ebb6ee8c63b3969ec] > > Steps to reproduce in spark shell: > # create MOR table w/ schema1. > # Ingest (with schema1) until log files are created. // verify via hudi-cli. > It took me 2 batch of updates to see a log file. > # create a new schema2 with one new additional field. ingest a batch with > schema2 that updates existing records. > # read entire dataset. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan opened a new pull request #2762: [HUDI-1716] Regenerating schema with default null values as applicable
nsivabalan opened a new pull request #2762: URL: https://github.com/apache/hudi/pull/2762 ## What is the purpose of the pull request For Union schema types, avro expects "null" to be first entry for null to be considered as default value. But since Hudi leverages SchemaConverters.toAvroType(...) org.apache.spark:spark-avro_* library, structType to avro results in "null" being 2nd entry for UNION type schemas. Also, there is no default value set in this avro schema thus generated. This patch fixes this issue. ## Brief change log - Fixing struct type to avro schema conversion to fix null as first entry in UNION schema types and add defaults values for the same. ## Verify this pull request This change added tests and can be verified as follows: - Added tests to HoodieSparkSqlWriterSuite // This test fails if not for the fix in source code for MOR table and succeeds w/ the fix. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ssdong commented on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving
ssdong commented on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812964282 btw, @jsbali, are you working on both of these tickets? https://issues.apache.org/jira/browse/HUDI-1739 https://issues.apache.org/jira/browse/HUDI-1740 Can I pick up one as I would be happy to contribute? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2508: [SUPPORT] Error upserting bucketType UPDATE for partition
rubenssoto commented on issue #2508: URL: https://github.com/apache/hudi/issues/2508#issuecomment-812936268 Now Im testing with Hudi 0.8.0-rc1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2508: [SUPPORT] Error upserting bucketType UPDATE for partition
rubenssoto commented on issue #2508: URL: https://github.com/apache/hudi/issues/2508#issuecomment-812936246 When I dont use row writer enable on bulk insert I have no problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] rubenssoto commented on issue #2508: [SUPPORT] Error upserting bucketType UPDATE for partition
rubenssoto commented on issue #2508: URL: https://github.com/apache/hudi/issues/2508#issuecomment-812936183 Hello, Sorry for my very late response, but I tried again and I have problems with only one table, same table, same error: 21/04/03 22:09:32 WARN TaskSetManager: Lost task 0.0 in stage 31.0 (TID 1025, ip-10-0-49-182.us-west-2.compute.internal, executor 2): org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:288) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$execute$ecf5068c$1(BaseSparkCommitActionExecutor.java:139) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1388) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:102) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:317) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:308) at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:281) ... 28 more Caused by: org.apache.hudi.exception.HoodieException: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143) at org.apache.hudi.table.action.commit.SparkMergeHelper.runMerge(SparkMergeHelper.java:100) ... 31 more Caused by: java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: operation has failed at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141) ... 32 more Caused by: org.apache.hudi.exception.HoodieException: operation has failed at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.throwExceptionIfFailed(BoundedInMemoryQueue.java:247) at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.readNextRecord(BoundedInMemoryQueue.java:226) at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.access$100(BoundedInMemoryQueue.java:52) at
[GitHub] [hudi] rubenssoto commented on pull request #2627: [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator
rubenssoto commented on pull request #2627: URL: https://github.com/apache/hudi/pull/2627#issuecomment-812904015 Hello Guys, NonPartitionedExtractor of hudi sync will work too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on issue #2760: [SUPPORT] Possibly Incorrect Documentation
pratyakshsharma commented on issue #2760: URL: https://github.com/apache/hudi/issues/2760#issuecomment-812891514 Filed a jira here for the same - https://issues.apache.org/jira/browse/HUDI-1760. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1760) Incorrect Documentation for HoodieWriteConfigs
Pratyaksh Sharma created HUDI-1760: -- Summary: Incorrect Documentation for HoodieWriteConfigs Key: HUDI-1760 URL: https://issues.apache.org/jira/browse/HUDI-1760 Project: Apache Hudi Issue Type: Bug Reporter: Pratyaksh Sharma GH Issue - https://github.com/apache/hudi/issues/2760 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1741) Row Level TTL Support for records stored in Hudi
[ https://issues.apache.org/jira/browse/HUDI-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17314297#comment-17314297 ] Pratyaksh Sharma commented on HUDI-1741: Guess the same can be handled with this Jira - https://issues.apache.org/jira/browse/HUDI-349? [~vbalaji] [~shivnarayan] > Row Level TTL Support for records stored in Hudi > > > Key: HUDI-1741 > URL: https://issues.apache.org/jira/browse/HUDI-1741 > Project: Apache Hudi > Issue Type: New Feature > Components: Utilities >Reporter: Balaji Varadarajan >Priority: Major > > For e:g : Have records only updated last month > > GH: https://github.com/apache/hudi/issues/2743 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] pratyakshsharma removed a comment on pull request #2744: [HUDI-1742]improve table level config priority
pratyakshsharma removed a comment on pull request #2744: URL: https://github.com/apache/hudi/pull/2744#issuecomment-812887127 Please add a small test case covering this scenario. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2744: [HUDI-1742]improve table level config priority
pratyakshsharma commented on pull request #2744: URL: https://github.com/apache/hudi/pull/2744#issuecomment-812887127 Please add a small test case covering this scenario. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1742) improve table level config priority in HoodieMultiTableDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1742: - Labels: pull-request-available (was: ) > improve table level config priority in HoodieMultiTableDeltaStreamer > > > Key: HUDI-1742 > URL: https://issues.apache.org/jira/browse/HUDI-1742 > Project: Apache Hudi > Issue Type: Wish > Components: DeltaStreamer >Reporter: NickYoung >Priority: Major > Labels: pull-request-available > > I hope that when the table-level configuration file and the public l > configuration file have the same configuration, the table-level configuration > file configuration is used。 > But now if the table-level configuration file and the public configuration > file have the same configuration, the configuration in the public > configuration file will be adopted。 > https://hudi.apache.org/blog/ingest-multiple-tables-using-hudi/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2744: [HUDI-1742]improve table level config priority
pratyakshsharma commented on a change in pull request #2744: URL: https://github.com/apache/hudi/pull/2744#discussion_r606681411 ## File path: hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java ## @@ -117,7 +117,9 @@ private void populateTableExecutionContextList(TypedProperties properties, Strin checkIfTableConfigFileExists(configFolder, fs, configFilePath); TypedProperties tableProperties = UtilHelpers.readConfig(fs, new Path(configFilePath), new ArrayList<>()).getConfig(); properties.forEach((k, v) -> { -tableProperties.setProperty(k.toString(), v.toString()); +if (tableProperties.get(k) == null) { Review comment: Can you add a check for null and empty value as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ssdong edited a comment on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving
ssdong edited a comment on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042 @satishkotha Thanks for the tips! In fact, I was also able to reproduce this issue locally on my machine regarding the `insert_overwrite_table` issue that @jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and https://issues.apache.org/jira/browse/HUDI-1740 against with. For tracking and maybe helping you guys test the fix(in the future), I am pasting the script I used for reproducing the issue here. ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import java.util.UUID import java.sql.Timestamp val tableName = "hudi_date_mor" val basePath = "" < fill out this value to point to your local folder(in absolute path) val writeConfigs = Map( "hoodie.cleaner.incremental.mode" -> "true", "hoodie.insert.shuffle.parallelism" -> "20", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.clean.automatic" -> "false", "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.table.name" -> tableName, "hoodie.datasource.write.table.type" -> "MERGE_ON_READ", "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", "hoodie.keep.max.commits" -> "3", "hoodie.cleaner.commits.retained" -> "1", "hoodie.keep.min.commits" -> "2", "hoodie.compact.inline.max.delta.commits" -> "1" ) val dateSMap: Map[Int, String] = Map( 0-> "2020-07-01", 1-> "2020-08-01", 2-> "2020-09-01", ) val dateMap: Map[Int, Timestamp] = Map( 0-> Timestamp.valueOf("2010-07-01 11:00:15"), 1-> Timestamp.valueOf("2010-08-01 11:00:15"), 2-> Timestamp.valueOf("2010-09-01 11:00:15"), ) var seq = Seq( (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString) ) for(i <- 501 to 1000) { seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), UUID.randomUUID.toString) } val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", "uuid") ``` Run the spark shell(the one taken from hudi quick start page and I am using spark version `spark-3.0.1-bin-hadoop2.7`): ``` ./spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' ``` Copy the above script in there and hit `df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 4 times and on the fifth time, it throws the anticipated ``` Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` issue. ``` Now, the thing is, we _still_ have to _manually_ delete the first commit file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would still keep throwing the `Positive number of partitions required issue` error. The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the write to refresh its timeline so we wouldn't see the second error again, which is ``` java.io.FileNotFoundException: /.hoodie/20210403201659.replacecommit does not exist ``` However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not _quite_ necessary since the _6th_ time we write, the writer is automatically being refreshed with the _newest_ timeline and it will put all `*replacecommit` files back to a status of integrity again. If we fix the empty `partitionToReplaceFileIds` issue, we might not need to dig into the `replacecommit does not exist` issue anymore since it is caused by the workaround of _manually_ deleting the empty commit file. It would fix everything from the start. However, I would still be curious to learn about _why_ we would need a `reset` of the timeline server within the `close` action upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be removed if there is no strong reason behind it. The `reset` within `close` was originally introduced in #600 after a bit of digging in that code. I hope that helps you narrow down the scope a little bit. Maybe @bvaradar could explain it if the memory is still fresh to you since that PR is about 2 years ago from now. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ssdong edited a comment on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving
ssdong edited a comment on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042 @satishkotha Thanks for the tips! In fact, I was also able to reproduce this issue locally on my machine regarding the `insert_overwrite_table` issue that @jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and https://issues.apache.org/jira/browse/HUDI-1740 against with. For tracking and maybe helping you guys test the fix(in the future), I am pasting the script I used for reproducing the issue here. ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import java.util.UUID import java.sql.Timestamp val tableName = "hudi_date_mor" val basePath = "" < fill out this value to point to your local folder(in absolute path) val writeConfigs = Map( "hoodie.cleaner.incremental.mode" -> "true", "hoodie.insert.shuffle.parallelism" -> "20", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.clean.automatic" -> "false", "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.table.name" -> tableName, "hoodie.datasource.write.table.type" -> "MERGE_ON_READ", "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", "hoodie.keep.max.commits" -> "3", "hoodie.cleaner.commits.retained" -> "1", "hoodie.keep.min.commits" -> "2", "hoodie.compact.inline.max.delta.commits" -> "1" ) val dateSMap: Map[Int, String] = Map( 0-> "2020-07-01", 1-> "2020-08-01", 2-> "2020-09-01", ) val dateMap: Map[Int, Timestamp] = Map( 0-> Timestamp.valueOf("2010-07-01 11:00:15"), 1-> Timestamp.valueOf("2010-08-01 11:00:15"), 2-> Timestamp.valueOf("2010-09-01 11:00:15"), ) var seq = Seq( (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString) ) for(i <- 501 to 1000) { seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), UUID.randomUUID.toString) } val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", "uuid") ``` Run the spark shell(the one taken from hudi quick start page and I am using spark version `spark-3.0.1-bin-hadoop2.7`): ``` ./spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' ``` Copy the above script in there and hit `df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 4 times and on the fifth time, it throws the anticipated ``` Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` issue. ``` Now, the thing is, we _still_ have to _manually_ delete the first commit file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would still keep throwing the `Positive number of partitions required issue. error.` The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the write to refresh its timeline so we wouldn't see the second error again, which is ``` java.io.FileNotFoundException: /.hoodie/20210403201659.replacecommit does not exist ``` However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not _quite_ necessary since the _6th_ time we write, the writer is automatically being refreshed with the _newest_ timeline and it will put all `*replacecommit` files back to a status of integrity again. If we fix the empty `partitionToReplaceFileIds` issue, we might not need to dig into the `replacecommit does not exist` issue anymore since it is caused by the workaround of _manually_ deleting the empty commit file. It would fix everything from the start. However, I would still be curious to learn about _why_ we would need a `reset` of the timeline server within the `close` action upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be removed if there is no strong reason behind it. The `reset` within `close` was originally introduced in #600 after a bit of digging in that code. I hope that helps you narrow down the scope a little bit. Maybe @bvaradar could explain it if the memory is still fresh to you since that PR is about 2 years ago from now. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ssdong edited a comment on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving
ssdong edited a comment on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042 @satishkotha Thanks for the tips! In fact, I was also able to reproduce this issue locally on my machine regarding the `insert_overwrite_table` issue that @jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and https://issues.apache.org/jira/browse/HUDI-1740 against with. For tracking and maybe helping you guys test the fix(in the future), I am pasting the script I used for reproducing the issue here. ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import java.util.UUID import java.sql.Timestamp val tableName = "hudi_date_mor" val basePath = "" < fill out this value to point to your local folder(in absolute path) val writeConfigs = Map( "hoodie.cleaner.incremental.mode" -> "true", "hoodie.insert.shuffle.parallelism" -> "20", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.clean.automatic" -> "false", "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.table.name" -> tableName, "hoodie.datasource.write.table.type" -> "MERGE_ON_READ", "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", "hoodie.keep.max.commits" -> "3", "hoodie.cleaner.commits.retained" -> "1", "hoodie.keep.min.commits" -> "2", "hoodie.compact.inline.max.delta.commits" -> "1" ) val dateSMap: Map[Int, String] = Map( 0-> "2020-07-01", 1-> "2020-08-01", 2-> "2020-09-01", ) val dateMap: Map[Int, Timestamp] = Map( 0-> Timestamp.valueOf("2010-07-01 11:00:15"), 1-> Timestamp.valueOf("2010-08-01 11:00:15"), 2-> Timestamp.valueOf("2010-09-01 11:00:15"), ) var seq = Seq( (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString) ) for(i <- 501 to 1000) { seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), UUID.randomUUID.toString) } val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", "uuid") ``` Run the spark shell(the one taken from hudi quick start page and I am using spark version `spark-3.0.1-bin-hadoop2.7`): ``` ./spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' ``` Copy the above script in there and hit `df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 4 times and on the fifth time, it throws the anticipated `Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` issue. Now, the thing is, we _still_ have to _manually_ delete the first commit file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would still keep throwing the `Positive number of partitions required issue. error.` The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the write to refresh its timeline so we wouldn't see the second error again, which is ``` java.io.FileNotFoundException: /.hoodie/20210403201659.replacecommit does not exist ``` However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not _quite_ necessary since the _6th_ time we write, the writer is automatically being refreshed with the _newest_ timeline and it will put all `*replacecommit` files back to a status of integrity again. If we fix the empty `partitionToReplaceFileIds` issue, we might not need to dig into the `replacecommit does not exist` issue anymore since it is caused by the workaround of _manually_ deleting the empty commit file. It would fix everything from the start. However, I would still be curious to learn about _why_ we would need a `reset` of the timeline server within the `close` action upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be removed if there is no strong reason behind it. The `reset` within `close` was originally introduced in #600 after a bit of digging in that code. I hope that helps you narrow down the scope a little bit. Maybe @bvaradar could explain it if the memory is still fresh to you since that PR is about 2 years ago from now. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ssdong commented on issue #2707: [SUPPORT] insert_ovewrite_table failed on archiving
ssdong commented on issue #2707: URL: https://github.com/apache/hudi/issues/2707#issuecomment-812856042 @satishkotha Thanks for the tips! In fact, I was also able to reproduce this issue locally on my machine regarding the `insert_overwrite_table` issue that @jsbali has raised tickets https://issues.apache.org/jira/browse/HUDI-1739 and https://issues.apache.org/jira/browse/HUDI-1740 against with. For tracking and maybe helping you guys test the fix(in the future), I am pasting the script I used for reproducing the issue here. ``` import org.apache.hudi.QuickstartUtils._ import scala.collection.JavaConversions._ import org.apache.spark.sql.SaveMode._ import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.config.HoodieWriteConfig._ import java.util.UUID import java.sql.Timestamp val tableName = "hudi_date_mor" val basePath = "" < fill out this value to point to your local folder(in absolute path) val writeConfigs = Map( "hoodie.cleaner.incremental.mode" -> "true", "hoodie.insert.shuffle.parallelism" -> "20", "hoodie.upsert.shuffle.parallelism" -> "2", "hoodie.clean.automatic" -> "false", "hoodie.datasource.write.operation" -> "insert_overwrite_table", "hoodie.table.name" -> tableName, "hoodie.datasource.write.table.type" -> "MERGE_ON_READ", "hoodie.cleaner.policy" -> "KEEP_LATEST_FILE_VERSIONS", "hoodie.keep.max.commits" -> "3", "hoodie.cleaner.commits.retained" -> "1", "hoodie.keep.min.commits" -> "2", "hoodie.compact.inline.max.delta.commits" -> "1" ) val dateSMap: Map[Int, String] = Map( 0-> "2020-07-01", 1-> "2020-08-01", 2-> "2020-09-01", ) val dateMap: Map[Int, Timestamp] = Map( 0-> Timestamp.valueOf("2010-07-01 11:00:15"), 1-> Timestamp.valueOf("2010-08-01 11:00:15"), 2-> Timestamp.valueOf("2010-09-01 11:00:15"), ) var seq = Seq( (0, "value", dateMap(0), dateSMap(0), UUID.randomUUID.toString) ) for(i <- 501 to 1000) { seq :+= (i, "value", dateMap(i % 3), dateSMap(i % 3), UUID.randomUUID.toString) } val df = seq.toDF("id", "string_column", "timestamp_column", "date_string", "uuid") ``` Run the spark shell(the one taken from hudi quick start page and I am using spark version `spark-3.0.1-bin-hadoop2.7`): ``` ./spark-shell --packages org.apache.hudi:hudi-spark-bundle_2.12:0.7.0,org.apache.spark:spark-avro_2.12:3.0.1 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' ``` Copy the above script in there and hit `df.write.format("hudi").options(writeConfigs).mode(Overwrite).save(basePath)` 4 times and on the fifth time, it throws the anticipated `Caused by: java.lang.IllegalArgumentException: Positive number of partitions required` issue. Now, the thing is, we _still_ have to _manually_ delete the first commit file, which contains the empty `partitionToReplaceFileIds`; otherwise, it would still keep throwing the `Positive number of partitions required issue. error.` The `"hoodie.embed.timeline.server" -> "false"` _does_ help as it forces the write to refresh its timeline so we wouldn't see the second error again, which is `java.io.FileNotFoundException: /.hoodie/20210403201659.replacecommit does not exist` However, it appears `"hoodie.embed.timeline.server" -> "false"` to be not _quite_ necessary since the _6th_ time we write, the writer is automatically being refreshed with the _newest_ timeline and it will put all `*replacecommit` files back to a status of integrity again. If we fix the empty `partitionToReplaceFileIds` issue, we might not need to dig into the `replacecommit does not exist` issue anymore since it is caused by the workaround of _manually_ deleting the empty commit file. It would fix everything from the start. However, I would still be curious to learn about _why_ we would need a `reset` of the timeline server within the `close` action upon the `HoodieTableFileSystemView`. It appears unnecessary to me and could be removed if there is no strong reason behind it. The `reset` within `close` was originally introduced in #600 after a bit of digging in that code. I hope that helps you narrow down the scope a little bit. Maybe @bvaradar could explain it if the memory is still fresh to you since that PR is about 2 years ago from now. Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2757: [HUDI-1757] Assigns the buckets by record key for Flink writer
codecov-io edited a comment on pull request #2757: URL: https://github.com/apache/hudi/pull/2757#issuecomment-812247500 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2757?src=pr=h1) Report > Merging [#2757](https://codecov.io/gh/apache/hudi/pull/2757?src=pr=desc) (33aa3f7) into [master](https://codecov.io/gh/apache/hudi/commit/9804662bc8e17d6936c20326f17ec7c0360dcaf6?el=desc) (9804662) will **decrease** coverage by `42.74%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2757/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2757?src=pr=tree) ```diff @@ Coverage Diff @@ ## master #2757 +/- ## - Coverage 52.12% 9.38% -42.75% + Complexity 3646 48 -3598 Files 480 54 -426 Lines 228671993-20874 Branches 2417 236 -2181 - Hits 11920 187-11733 + Misses 99161793 -8123 + Partials 1031 13 -1018 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.38% <ø> (-60.36%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2757?src=pr=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2757/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | `0.00% <0.00%>