[jira] [Created] (HUDI-3658) Add Hudi Uber Meetup on March 1st
Nishith Agarwal created HUDI-3658: - Summary: Add Hudi Uber Meetup on March 1st Key: HUDI-3658 URL: https://issues.apache.org/jira/browse/HUDI-3658 Project: Apache Hudi Issue Type: Task Reporter: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3256) Add Links to Hudi Meetup Jan 2022
Nishith Agarwal created HUDI-3256: - Summary: Add Links to Hudi Meetup Jan 2022 Key: HUDI-3256 URL: https://issues.apache.org/jira/browse/HUDI-3256 Project: Apache Hudi Issue Type: Task Reporter: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-1576) Add ability to perform archival synchronously
[ https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476506#comment-17476506 ] Nishith Agarwal commented on HUDI-1576: --- [~guoyihua] Yes, the idea was to detach archiving from being inline to async. Although, even today, archiving happens after the "COMMIT" is successfully completed and the file has been created on disk. So, introducing a new action is not needed. I think archival can just run async and keep archiving contents without the need to create any action since that may be an overkill. One side-effect I see is that we still need a way to track the progress and activity of archiving on a table. Since the .archive folder has this history, it should be fine. That's my opinion. > Add ability to perform archival synchronously > - > > Key: HUDI-1576 > URL: https://issues.apache.org/jira/browse/HUDI-1576 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.11.0 > > > Currently, archival runs inline. We want to move archival to a table service > like cleaning, compaction etc.. > and treat it like that. of course, no new action will be introduced. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer
[ https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425912#comment-17425912 ] Nishith Agarwal commented on HUDI-2275: --- [~dave_hagman] To ensure that the checkpoints from deltastreamer commits are carried over when a concurrent datasource spark job is running, one need to enable the following configuration: [https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java#L371] Can you please check if you have enabled this config ? > HoodieDeltaStreamerException when using OCC and a second concurrent writer > -- > > Key: HUDI-2275 > URL: https://issues.apache.org/jira/browse/HUDI-2275 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer, Spark Integration, Writer Core >Affects Versions: 0.9.0 >Reporter: Dave Hagman >Assignee: Sagar Sumit >Priority: Critical > Labels: sev:critical > Fix For: 0.10.0 > > > I am trying to utilize [Optimistic Concurrency > Control|https://hudi.apache.org/docs/concurrency_control] in order to allow > two writers to update a single table simultaneously. The two writers are: > * Writer A: Deltastreamer job consuming continuously from Kafka > * Writer B: A spark datasource-based writer that is consuming parquet files > out of S3 > * Table Type: Copy on Write > > After a few commits from each writer the deltastreamer will fail with the > following exception: > > {code:java} > org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find > previous checkpoint. Please double check if this table was indeed built via > delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, > Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={ > "partitionToWriteStats" : { > ...{code} > > What appears to be happening is a lack of commit isolation between the two > writers > Writer B (spark datasource writer) will land commits which are eventually > picked up by Writer A (Delta Streamer). This is an issue because the Delta > Streamer needs checkpoint information which the spark datasource of course > does not include in its commits. My understanding was that OCC was built for > this very purpose (among others). > OCC config for Delta Streamer: > {code:java} > hoodie.write.concurrency.mode=optimistic_concurrency_control > hoodie.cleaner.policy.failed.writes=LAZY > > hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider > hoodie.write.lock.zookeeper.url= > hoodie.write.lock.zookeeper.port=2181 > hoodie.write.lock.zookeeper.lock_key=writer_lock > hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code} > > OCC config for spark datasource: > {code:java} > // Multi-writer concurrency > .option("hoodie.cleaner.policy.failed.writes", "LAZY") > .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") > .option( > "hoodie.write.lock.provider", > > org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName() > ) > .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost) > .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort) > .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock") > .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code} > h3. Steps to Reproduce: > * Start a deltastreamer job against some table Foo > * In parallel, start writing to the same table Foo using spark datasource > writer > * Note that after a few commits from each the deltastreamer is likely to > fail with the above exception when the datasource writer creates non-isolated > inflight commits > NOTE: I have not tested this with two of the same datasources (ex. two > deltastreamer jobs) > NOTE 2: Another detail that may be relevant is that the two writers are on > completely different spark clusters but I assumed this shouldn't be an issue > since we're locking using Zookeeper -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382691#comment-17382691 ] Nishith Agarwal commented on HUDI-2146: --- [~wenningd] I see that there is a conflict thrown when both inserts are started simultaneously insert 1 {code:java} scala> df3.write.format("org.apache.hudi").option(HoodieWriteConfig.TABLE_NAME, tableName).option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL).option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id").option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type").option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts").option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true").option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName).option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type").option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false").option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor").option("hoodie.write.concurrency.mode", "optimistic_concurrency_control").option("hoodie.cleaner.policy.failed.writes", "LAZY").option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider").option("hoodie.write.lock.filesystem.path", "/tmp/").option("hoodie.write.lock.hivemetastore.database", "test_db").option("hoodie.write.lock.hivemetastore.table", "hudi_test").option("hoodie.write.lock.hivemetastore.uris", "").mode(SaveMode.Append).save(tablePath) 21/07/18 01:38:55 WARN hudi.DataSourceWriteOptions$: hoodie.datasource.write.storage.type is deprecated and will be removed in a later release; Please use hoodie.datasource.write.table.type org.apache.hudi.exception.HoodieWriteConflictException: java.util.ConcurrentModificationException: Cannot resolve conflicts for overlapping writes at org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102) at org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:68) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) at org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:62) at org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:456) at org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:183) at org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121) at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:564) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:230) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
[jira] [Updated] (HUDI-1824) Spark Datasource V2/V1 (Dataset) integration with ORC
[ https://issues.apache.org/jira/browse/HUDI-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1824: -- Summary: Spark Datasource V2/V1 (Dataset) integration with ORC (was: Spark Integration with ORC) > Spark Datasource V2/V1 (Dataset) integration with ORC > -- > > Key: HUDI-1824 > URL: https://issues.apache.org/jira/browse/HUDI-1824 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: Teresa Kang >Assignee: manasa >Priority: Major > > Implement HoodieInternalRowOrcWriter for spark datasource integration with > ORC. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-765: - Fix Version/s: 0.9.0 > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: lamber-ken >Assignee: Teresa Kang >Priority: Major > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter
[ https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-764: - Status: Closed (was: Patch Available) > Implement HoodieOrcWriter > - > > Key: HUDI-764 > URL: https://issues.apache.org/jira/browse/HUDI-764 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Assignee: Teresa Kang >Priority: Critical > Labels: pull-request-available > > Implement HoodieOrcWriter > * Avro to ORC schema > * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-765) Implement OrcReaderIterator
[ https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-765: - Status: Closed (was: Patch Available) > Implement OrcReaderIterator > --- > > Key: HUDI-765 > URL: https://issues.apache.org/jira/browse/HUDI-765 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: lamber-ken >Assignee: Teresa Kang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter
[ https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-764: Assignee: (was: Teresa Kang) > Implement HoodieOrcWriter > - > > Key: HUDI-764 > URL: https://issues.apache.org/jira/browse/HUDI-764 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Priority: Critical > Labels: pull-request-available > > Implement HoodieOrcWriter > * Avro to ORC schema > * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter
[ https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-764: Assignee: Teresa Kang > Implement HoodieOrcWriter > - > > Key: HUDI-764 > URL: https://issues.apache.org/jira/browse/HUDI-764 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: lamber-ken >Assignee: Teresa Kang >Priority: Critical > Labels: pull-request-available > > Implement HoodieOrcWriter > * Avro to ORC schema > * Write record in row -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2159) Supporting Clustering and Metadata Table together
[ https://issues.apache.org/jira/browse/HUDI-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379423#comment-17379423 ] Nishith Agarwal commented on HUDI-2159: --- Thanks for the detailed analysis [~pwason]. I think it is definitely worth solving (1) from the 0.9.0 release. This is a legitimate situation that can surface up especially as users schedule ingestion at a lower frequency there is more chances of such collisions. For (2), since it is more of a perf degradation in cases of failures, we can address this right after 0.9 by landing the tailing timeline based on completion time. > Supporting Clustering and Metadata Table together > - > > Key: HUDI-2159 > URL: https://issues.apache.org/jira/browse/HUDI-2159 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Blocker > Fix For: 0.9.0 > > > I am testing clustering support for metadata enabled table and found a few > issues. > *Setup* > Pipeline 1: Ingestion pipeline with Metadata Table enabled. Runs every 30 > mins. > Pipeline 2: Clustering pipeline with long running jobs (3-4 hours) > Pipeline 3: Another clustering pipeline with long running jobs (3-4 hours) > > *Issue #1: Parallel commits on Metadata Table* > Assume the Clustering pipeline is completing T5.replacecommit and ingestion > pipeline is completing T10.commit. Metadata Table will synced at an instant > Now both the pipelines will call syncMetadataTable() which will do the > following: > # Find all un-synced instants from dataset (T5, T6 ... T10) > # Read each instant and perform a deltacommit on the Metadata Table with the > same timestamp as instant. > There is a chance that two processed perform deltacommit at T5 on the > metadata table and one will fail (instant file already exists). This will be > an exception raised and will be detected as failure of pipeline leading to > false-positive alerts. > > *Issue #2: No archiving/rollback support for failed clustering operations* > If a clustering operation fails, it leaves a left-over > T5.replacecommit.inflight. There is no automated way to rollback or archive > these. Since clustering is a long running operation in general and may be run > through multiple pipelines at the same time, automated rollback of left-over > inflights doesnt work as we cannot be sure that the process is dead. > Metadata Table sync only works in completion order. So if > T5.replacecommit.inflight is left-over, Metadata Table will not sync beyond > T5 causing a large number of LogBLocks to pile up which will have to be > merged in memory leading to deteriorating performance. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377455#comment-17377455 ] Nishith Agarwal commented on HUDI-2146: --- [~wenningd] Thanks for the detailed description. Few questions # What is `df3` in your insert1 and insert2 workloads ? Is it the same dataframe ? Can you paste the input for each insert workload please # Can you paste the output your expect vs the output you see to understand where the data loss is ? > Concurrent writes loss data > > > Key: HUDI-2146 > URL: https://issues.apache.org/jira/browse/HUDI-2146 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Blocker > Fix For: 0.9.0 > > Attachments: image-2021-07-08-00-49-30-730.png > > > Reproduction steps: > Create a Hudi table: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > import org.apache.hudi.AvroConversionUtils > val df = Seq( > (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), > (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_test" > var tablePath = "s3://.../" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > {code} > Perform two insert operations almost in the same time, each insertion > contains different data: > Insert 1: > {code:java} > val df3 = Seq( > (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), > (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), > (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), > (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) >.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") >.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") >.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") >.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) >.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") >.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") >.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") >.option("hoodie.cleaner.policy.failed.writes", "LAZY") >.option("hoodie.write.lock.provider", > "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider") >.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal") >.option("hoodie.write.lock.zookeeper.port", "2181") >.option("hoodie.write.lock.zookeeper.lock_key", tableName) >.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock") >.mode(SaveMode.Append) >.save(tablePath) > {code} > Insert 2: > {code:java} > val df3 = Seq( > (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), > (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), > (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), > (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts",
[jira] [Updated] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-2146: -- Fix Version/s: 0.9.0 > Concurrent writes loss data > > > Key: HUDI-2146 > URL: https://issues.apache.org/jira/browse/HUDI-2146 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Blocker > Fix For: 0.9.0 > > Attachments: image-2021-07-08-00-49-30-730.png > > > Reproduction steps: > Create a Hudi table: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > import org.apache.hudi.AvroConversionUtils > val df = Seq( > (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), > (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_test" > var tablePath = "s3://.../" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > {code} > Perform two insert operations almost in the same time, each insertion > contains different data: > Insert 1: > {code:java} > val df3 = Seq( > (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), > (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), > (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), > (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) >.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") >.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") >.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") >.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) >.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") >.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") >.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") >.option("hoodie.cleaner.policy.failed.writes", "LAZY") >.option("hoodie.write.lock.provider", > "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider") >.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal") >.option("hoodie.write.lock.zookeeper.port", "2181") >.option("hoodie.write.lock.zookeeper.lock_key", tableName) >.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock") >.mode(SaveMode.Append) >.save(tablePath) > {code} > Insert 2: > {code:java} > val df3 = Seq( > (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), > (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), > (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), > (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, >
[jira] [Updated] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-2146: -- Priority: Blocker (was: Major) > Concurrent writes loss data > > > Key: HUDI-2146 > URL: https://issues.apache.org/jira/browse/HUDI-2146 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Blocker > Attachments: image-2021-07-08-00-49-30-730.png > > > Reproduction steps: > Create a Hudi table: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > import org.apache.hudi.AvroConversionUtils > val df = Seq( > (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), > (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_test" > var tablePath = "s3://.../" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > {code} > Perform two insert operations almost in the same time, each insertion > contains different data: > Insert 1: > {code:java} > val df3 = Seq( > (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), > (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), > (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), > (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) >.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") >.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") >.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") >.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) >.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") >.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") >.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") >.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") >.option("hoodie.cleaner.policy.failed.writes", "LAZY") >.option("hoodie.write.lock.provider", > "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider") >.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal") >.option("hoodie.write.lock.zookeeper.port", "2181") >.option("hoodie.write.lock.zookeeper.lock_key", tableName) >.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock") >.mode(SaveMode.Append) >.save(tablePath) > {code} > Insert 2: > {code:java} > val df3 = Seq( > (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), > (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), > (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), > (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > // update hudi dataset > df3.write.format("org.apache.hudi") >.option(HoodieWriteConfig.TABLE_NAME, tableName) >.option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) >.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
[jira] [Created] (HUDI-2091) Add Uber's grafana dashboard to OSS
Nishith Agarwal created HUDI-2091: - Summary: Add Uber's grafana dashboard to OSS Key: HUDI-2091 URL: https://issues.apache.org/jira/browse/HUDI-2091 Project: Apache Hudi Issue Type: New Feature Components: metrics Reporter: Nishith Agarwal Assignee: Prashant Wason cc [~vinoth] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1537) Move validation of file listings to something that happens before each write
[ https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367807#comment-17367807 ] Nishith Agarwal commented on HUDI-1537: --- This logic is being removed. Additionally, falling back to file listing has been removed -> https://github.com/apache/hudi/pull/3079 > Move validation of file listings to something that happens before each write > > > Key: HUDI-1537 > URL: https://issues.apache.org/jira/browse/HUDI-1537 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Current way of checking, has issues dealing with log files and inflight > files. Code has comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1542) Fix Flaky test : TestHoodieMetadata#testSync
[ https://issues.apache.org/jira/browse/HUDI-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367808#comment-17367808 ] Nishith Agarwal commented on HUDI-1542: --- [~pwason] Will take this up next week. > Fix Flaky test : TestHoodieMetadata#testSync > > > Key: HUDI-1542 > URL: https://issues.apache.org/jira/browse/HUDI-1542 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Vinoth Chandar >Assignee: Prashant Wason >Priority: Blocker > Fix For: 0.9.0 > > > Only fails intermittently on CI. > {code} > [INFO] Running org.apache.hudi.metadata.TestHoodieBackedMetadata > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/travis/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > [WARN ] 2021-01-20 09:25:31,716 org.apache.spark.util.Utils - Your hostname, > localhost resolves to a loopback address: 127.0.0.1; using 10.30.0.81 instead > (on interface eth0) > [WARN ] 2021-01-20 09:25:31,725 org.apache.spark.util.Utils - Set > SPARK_LOCAL_IP if you need to bind to another address > [WARN ] 2021-01-20 09:25:32,412 org.apache.hadoop.util.NativeCodeLoader - > Unable to load native-hadoop library for your platform... using builtin-java > classes where applicable > [WARN ] 2021-01-20 09:25:36,645 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:25:36,700 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:26:30,250 > org.apache.hudi.client.AbstractHoodieWriteClient - Cannot find instant > 20210120092628 in the timeline, for rollback > [WARN ] 2021-01-20 09:26:45,980 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:26:46,568 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:26:46,580 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:27:27,853 > org.apache.hudi.client.AbstractHoodieWriteClient - Cannot find instant > 20210120092726 in the timeline, for rollback > [WARN ] 2021-01-20 09:27:43,037 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:27:46,017 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit3284615140376500245/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:28:05,357 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__REQUESTED] > [WARN ] 2021-01-20 09:28:05,887 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:06,312 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:18,402 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:28:22,013 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit4284626513859445824/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:28:40,354 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__REQUESTED] > [WARN ] 2021-01-20 09:28:40,780 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:41,162 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__INFLIGHT] > =[ 605 seconds still running ]= > [ERROR] 2021-01-20 09:28:50,683 > org.apache.hudi.timeline.service.FileSystemViewHandler - Got runtime > exception
[jira] [Commented] (HUDI-1492) Handle DeltaWriteStat correctly for storage schemes that support appends
[ https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367806#comment-17367806 ] Nishith Agarwal commented on HUDI-1492: --- Confirmed with [~pwason] that this does not affect correctness of the metadata table. > Handle DeltaWriteStat correctly for storage schemes that support appends > > > Key: HUDI-1492 > URL: https://issues.apache.org/jira/browse/HUDI-1492 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinoth Chandar >Assignee: Prashant Wason >Priority: Blocker > Fix For: 0.9.0 > > > Current implementation simply uses the > {code:java} > String pathWithPartition = hoodieWriteStat.getPath(); {code} > to write the metadata table. this is problematic, if the delta write was > merely an append. and can technically add duplicate files into the metadata > table > (not sure if this is a problem per se. but filing a Jira to track and either > close/fix ) > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1077) Integration tests to validate clustering
[ https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-1077: - Assignee: satish > Integration tests to validate clustering > > > Key: HUDI-1077 > URL: https://issues.apache.org/jira/browse/HUDI-1077 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Blocker > Fix For: 0.9.0 > > > extend test-suite module to validate clustering -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path
[ https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1839: -- Priority: Blocker (was: Major) > FSUtils getAllPartitions broken by NotSerializableException: > org.apache.hadoop.fs.Path > -- > > Key: HUDI-1839 > URL: https://issues.apache.org/jira/browse/HUDI-1839 > Project: Apache Hudi > Issue Type: Bug >Reporter: satish >Priority: Blocker > > FSUtils getAllPartitionPaths is expected to work if metadata table is enabled > or not. It can also be called inside spark context. But looks like we are > trying to improve parallelism and causing NotSerializableExceptions. There > are multiple callers using it within spark context (clustering/cleaner). > See stack trace below > 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with FAILED (diag message: User class threw exception: > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321) > at > org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67) > at > org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71) > at > org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861) > at > com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111) > at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Failed to serialize task 53, not attempting to retry it. Exception > during serialization: java.io.NotSerializableException: > org.apache.hadoop.fs.Path > Serialization stack: > - object not serializable (class: org.apache.hadoop.fs.Path, value: > hdfs://...) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, > type: class [Ljava.lang.Object;) > - object (class scala.collection.mutable.WrappedArray$ofRef, > WrappedArray(hdfs://...)) > - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition) > - object (class org.apache.spark.rdd.ParallelCollectionPartition, > org.apache.spark.rdd.ParallelCollectionPartition@735) > - field (class: org.apache.spark.scheduler.ResultTask, name: partition, > type: interface org.apache.spark.Partition) > - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0)) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063) > at
[jira] [Updated] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path
[ https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1839: -- Fix Version/s: 0.9.0 > FSUtils getAllPartitions broken by NotSerializableException: > org.apache.hadoop.fs.Path > -- > > Key: HUDI-1839 > URL: https://issues.apache.org/jira/browse/HUDI-1839 > Project: Apache Hudi > Issue Type: Bug >Reporter: satish >Priority: Blocker > Fix For: 0.9.0 > > > FSUtils getAllPartitionPaths is expected to work if metadata table is enabled > or not. It can also be called inside spark context. But looks like we are > trying to improve parallelism and causing NotSerializableExceptions. There > are multiple callers using it within spark context (clustering/cleaner). > See stack trace below > 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with FAILED (diag message: User class threw exception: > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321) > at > org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67) > at > org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71) > at > org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861) > at > com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111) > at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Failed to serialize task 53, not attempting to retry it. Exception > during serialization: java.io.NotSerializableException: > org.apache.hadoop.fs.Path > Serialization stack: > - object not serializable (class: org.apache.hadoop.fs.Path, value: > hdfs://...) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, > type: class [Ljava.lang.Object;) > - object (class scala.collection.mutable.WrappedArray$ofRef, > WrappedArray(hdfs://...)) > - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition) > - object (class org.apache.spark.rdd.ParallelCollectionPartition, > org.apache.spark.rdd.ParallelCollectionPartition@735) > - field (class: org.apache.spark.scheduler.ResultTask, name: partition, > type: interface org.apache.spark.Partition) > - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0)) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063) > at
[jira] [Commented] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path
[ https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367647#comment-17367647 ] Nishith Agarwal commented on HUDI-1839: --- [~pwason] Is this something that we have identified the root cause for ? cc [~uditme] [~satishkotha] > FSUtils getAllPartitions broken by NotSerializableException: > org.apache.hadoop.fs.Path > -- > > Key: HUDI-1839 > URL: https://issues.apache.org/jira/browse/HUDI-1839 > Project: Apache Hudi > Issue Type: Bug >Reporter: satish >Priority: Blocker > Fix For: 0.9.0 > > > FSUtils getAllPartitionPaths is expected to work if metadata table is enabled > or not. It can also be called inside spark context. But looks like we are > trying to improve parallelism and causing NotSerializableExceptions. There > are multiple callers using it within spark context (clustering/cleaner). > See stack trace below > 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with FAILED (diag message: User class threw exception: > org.apache.hudi.exception.HoodieException: Error fetching partition paths > from metadata table > at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321) > at > org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67) > at > org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71) > at > org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56) > at > org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873) > at > org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861) > at > com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111) > at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690) > Caused by: org.apache.spark.SparkException: Job aborted due to stage > failure: Failed to serialize task 53, not attempting to retry it. Exception > during serialization: java.io.NotSerializableException: > org.apache.hadoop.fs.Path > Serialization stack: > - object not serializable (class: org.apache.hadoop.fs.Path, value: > hdfs://...) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, > type: class [Ljava.lang.Object;) > - object (class scala.collection.mutable.WrappedArray$ofRef, > WrappedArray(hdfs://...)) > - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition) > - object (class org.apache.spark.rdd.ParallelCollectionPartition, > org.apache.spark.rdd.ParallelCollectionPartition@735) > - field (class: org.apache.spark.scheduler.ResultTask, name: partition, > type: interface org.apache.spark.Partition) > - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0)) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074) > at >
[jira] [Updated] (HUDI-1047) Support asynchronize clustering in CoW mode
[ https://issues.apache.org/jira/browse/HUDI-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1047: -- Fix Version/s: 0.9.0 > Support asynchronize clustering in CoW mode > --- > > Key: HUDI-1047 > URL: https://issues.apache.org/jira/browse/HUDI-1047 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode
[ https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1048: -- Fix Version/s: 0.9.0 > Support Asynchronize clustering in MoR mode > --- > > Key: HUDI-1048 > URL: https://issues.apache.org/jira/browse/HUDI-1048 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode
[ https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1048: -- Priority: Blocker (was: Major) > Support Asynchronize clustering in MoR mode > --- > > Key: HUDI-1048 > URL: https://issues.apache.org/jira/browse/HUDI-1048 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test
[ https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1706: -- Priority: Blocker (was: Major) > Test flakiness w/ multiwriter test > -- > > Key: HUDI-1706 > URL: https://issues.apache.org/jira/browse/HUDI-1706 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > > [https://api.travis-ci.com/v3/job/492130170/log.txt] > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test
[ https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1706: -- Fix Version/s: 0.9.0 > Test flakiness w/ multiwriter test > -- > > Key: HUDI-1706 > URL: https://issues.apache.org/jira/browse/HUDI-1706 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > > [https://api.travis-ci.com/v3/job/492130170/log.txt] > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2026) Add documentation for GlobalDeleteKeyGenerator
Nishith Agarwal created HUDI-2026: - Summary: Add documentation for GlobalDeleteKeyGenerator Key: HUDI-2026 URL: https://issues.apache.org/jira/browse/HUDI-2026 Project: Apache Hudi Issue Type: Sub-task Reporter: Nishith Agarwal Assignee: sivabalan narayanan [https://github.com/apache/hudi/issues/3008] {code:java} - should hard delete records from hudi table with hive sync *** FAILED *** (24 seconds, 49 milliseconds) Cause: java.lang.NoSuchMethodException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator.() [scalatest] at java.lang.Class.getConstructor0(Class.java:3110) [scalatest] at java.lang.Class.newInstance(Class.java:412) [scalatest] at org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:98) [scalatest] at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436) [scalatest] at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497) [scalatest] at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222) [scalatest] at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) [scalatest] at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) [scalatest] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) [scalatest] at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) [scalatest] at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) [scalatest] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) [scalatest] at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) [scalatest] at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) [scalatest] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) [scalatest] at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) [scalatest] at org.apach {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x
[ https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363911#comment-17363911 ] Nishith Agarwal commented on HUDI-1975: --- [~vinaypatil18] I think there are 2 options : # Shade the dropwizard inside Hudi to let Hudi use 4.1.x # Downgrade to 3.1.x and make changes for the workaround To be able to answer this, can you dig into whether shading will help ? (does prometheus package bring it's own dropwizard or is the environment expected to provide it?). Secondly, can you dig up when the 4.x upgrade was done and what was the reason for it. We can take a call then > Upgrade java-prometheus-client from 3.1.2 to 4.x > > > Key: HUDI-1975 > URL: https://issues.apache.org/jira/browse/HUDI-1975 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > > Find more details here -> https://github.com/apache/hudi/issues/2774 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-2003: -- Summary: Auto Compute Compression ratio for input data to output parquet/orc file size (was: Auto Compute Compression) > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core >Reporter: Vinay >Priority: Major > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size
[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-2003: -- Issue Type: Improvement (was: Bug) > Auto Compute Compression ratio for input data to output parquet/orc file size > - > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Vinay >Priority: Major > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363067#comment-17363067 ] Nishith Agarwal commented on HUDI-1910: --- [~vinaypatil18] Yes, that makes sense, please go ahead. > Supporting Kafka based checkpointing for HoodieDeltaStreamer > > > Key: HUDI-1910 > URL: https://issues.apache.org/jira/browse/HUDI-1910 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Nishith Agarwal >Assignee: Vinay >Priority: Major > Labels: sev:normal, triaged > > HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some > users have requested support for Kafka based checkpoints for freshness > auditing purposes. This ticket tracks any implementation for that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-2005) Audit and remove references of fs.listStatus()
[ https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662 ] Nishith Agarwal edited comment on HUDI-2005 at 6/13/21, 10:54 PM: -- 1. org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping 2. org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants 3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles 4. org.apache.hudi.table.MarkerFiles#deleteMarkerDir was (Author: nishith29): 1. org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping 2. org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants 3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles > Audit and remove references of fs.listStatus() > -- > > Key: HUDI-2005 > URL: https://issues.apache.org/jira/browse/HUDI-2005 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-2005) Audit and remove references of fs.listStatus()
[ https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662 ] Nishith Agarwal edited comment on HUDI-2005 at 6/13/21, 10:49 PM: -- 1. org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping 2. org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants 3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles was (Author: nishith29): 1. org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping 2. org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants > Audit and remove references of fs.listStatus() > -- > > Key: HUDI-2005 > URL: https://issues.apache.org/jira/browse/HUDI-2005 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2005) Audit and remove references of fs.listStatus()
[ https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-2005: - Assignee: Prashant Wason (was: Nishith Agarwal) > Audit and remove references of fs.listStatus() > -- > > Key: HUDI-2005 > URL: https://issues.apache.org/jira/browse/HUDI-2005 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-2005) Audit and remove references of fs.listStatus()
[ https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662 ] Nishith Agarwal commented on HUDI-2005: --- 1. org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping 2. org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants > Audit and remove references of fs.listStatus() > -- > > Key: HUDI-2005 > URL: https://issues.apache.org/jira/browse/HUDI-2005 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2005) Audit and remove references of fs.listStatus()
Nishith Agarwal created HUDI-2005: - Summary: Audit and remove references of fs.listStatus() Key: HUDI-2005 URL: https://issues.apache.org/jira/browse/HUDI-2005 Project: Apache Hudi Issue Type: Sub-task Reporter: Nishith Agarwal Assignee: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)
[ https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1457: -- Summary: Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames) (was: Add multi writing to Hudi tables using DFS based locking) > Add multi writing to Hudi tables using DFS based locking (only HDFS atomic > renames) > --- > > Key: HUDI-1457 > URL: https://issues.apache.org/jira/browse/HUDI-1457 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking
[ https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1457: -- Summary: Add multi writing to Hudi tables using DFS based locking (was: Add parallel writing to Hudi tables using DFS based locking) > Add multi writing to Hudi tables using DFS based locking > > > Key: HUDI-1457 > URL: https://issues.apache.org/jira/browse/HUDI-1457 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1679) Add example to docker for optimistic lock use
[ https://issues.apache.org/jira/browse/HUDI-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal resolved HUDI-1679. --- Fix Version/s: 0.8.0 Resolution: Fixed > Add example to docker for optimistic lock use > - > > Key: HUDI-1679 > URL: https://issues.apache.org/jira/browse/HUDI-1679 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull
[ https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1623: -- Fix Version/s: 0.10.0 > Support start_commit_time & end_commit_times for serializable incremental pull > -- > > Key: HUDI-1623 > URL: https://issues.apache.org/jira/browse/HUDI-1623 > Project: Apache Hudi > Issue Type: New Feature > Components: Common Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.10.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1575) Early detection by periodically checking last written commit
[ https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1575: -- Summary: Early detection by periodically checking last written commit (was: Early detection, last written commit, also check if there are more commits, try to do resolution, and abort. ) > Early detection by periodically checking last written commit > > > Key: HUDI-1575 > URL: https://issues.apache.org/jira/browse/HUDI-1575 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1575) Early detection by periodically checking last written commit
[ https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1575: -- Description: Check if there are more commits, try to do resolution, and abort for a currently running job to avoid using up resources and running a concurrent job if we already found a commit that happened in the meantime > Early detection by periodically checking last written commit > > > Key: HUDI-1575 > URL: https://issues.apache.org/jira/browse/HUDI-1575 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > > Check if there are more commits, try to do resolution, and abort for a > currently running job to avoid using up resources and running a concurrent > job if we already found a commit that happened in the meantime -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-944) Support more complete concurrency control when writing data
[ https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362627#comment-17362627 ] Nishith Agarwal commented on HUDI-944: -- Concurrent writing to HUDI tables is now supported. Closing this issue now. [~309637554] Let me know if. there is something missing from your requirements so we can open a specific ticket under the concurrent writing umbrella ticket as follow up. > Support more complete concurrency control when writing data > > > Key: HUDI-944 > URL: https://issues.apache.org/jira/browse/HUDI-944 > Project: Apache Hudi > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.9.0 > > > Now hudi just support write、compaction concurrency control. But some scenario > need write concurrency control.Such as two spark job with different data > source ,need to write to the same hudi table. > I have two Proposal: > 1. first step :support write concurrency control on different partition > but now when two client write data to different partition, will meet these > error > a、Rolling back commits failed > b、instants version already exist > {code:java} > [2020-05-25 21:20:34,732] INFO Checking for file exists > ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight > (org.apache.hudi.common.table.timeline.HoodieActiveTimeline) > Exception in thread "main" org.apache.hudi.exception.HoodieIOException: > Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > {code} > c、two client's archiving conflict > d、the read client meets "Unable to infer schema for Parquet. It must be > specified manually.;" > 2. second step:support insert、upsert、compaction concurrency control on > different isolation level such as Serializable、WriteSerializable. > hudi can design a mechanism to check the confict in > AbstractHoodieWriteClient.commit() > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-944) Support more complete concurrency control when writing data
[ https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal resolved HUDI-944. -- Fix Version/s: 0.8.0 Resolution: Fixed > Support more complete concurrency control when writing data > > > Key: HUDI-944 > URL: https://issues.apache.org/jira/browse/HUDI-944 > Project: Apache Hudi > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: liwei >Assignee: liwei >Priority: Major > Fix For: 0.9.0, 0.8.0 > > > Now hudi just support write、compaction concurrency control. But some scenario > need write concurrency control.Such as two spark job with different data > source ,need to write to the same hudi table. > I have two Proposal: > 1. first step :support write concurrency control on different partition > but now when two client write data to different partition, will meet these > error > a、Rolling back commits failed > b、instants version already exist > {code:java} > [2020-05-25 21:20:34,732] INFO Checking for file exists > ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight > (org.apache.hudi.common.table.timeline.HoodieActiveTimeline) > Exception in thread "main" org.apache.hudi.exception.HoodieIOException: > Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327) > at > org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183) > at > org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142) > at > org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88) > at > java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) > {code} > c、two client's archiving conflict > d、the read client meets "Unable to infer schema for Parquet. It must be > specified manually.;" > 2. second step:support insert、upsert、compaction concurrency control on > different isolation level such as Serializable、WriteSerializable. > hudi can design a mechanism to check the confict in > AbstractHoodieWriteClient.commit() > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1577) Document that multi-writer cannot be used within the same write client
[ https://issues.apache.org/jira/browse/HUDI-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1577: -- Fix Version/s: 0.9.0 > Document that multi-writer cannot be used within the same write client > -- > > Key: HUDI-1577 > URL: https://issues.apache.org/jira/browse/HUDI-1577 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1577) Document that multi-writer cannot be used within the same write client
[ https://issues.apache.org/jira/browse/HUDI-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1577: -- Priority: Blocker (was: Major) > Document that multi-writer cannot be used within the same write client > -- > > Key: HUDI-1577 > URL: https://issues.apache.org/jira/browse/HUDI-1577 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test
[ https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1706: -- Parent: HUDI-1456 Issue Type: Sub-task (was: Bug) > Test flakiness w/ multiwriter test > -- > > Key: HUDI-1706 > URL: https://issues.apache.org/jira/browse/HUDI-1706 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: sivabalan narayanan >Assignee: Nishith Agarwal >Priority: Major > > [https://api.travis-ci.com/v3/job/492130170/log.txt] > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1456) [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables
[ https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1456: -- Summary: [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables (was: [UMBRELLA] Concurrent Writing to Hudi tables) > [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables > -- > > Key: HUDI-1456 > URL: https://issues.apache.org/jira/browse/HUDI-1456 > Project: Apache Hudi > Issue Type: New Feature > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: hudi-umbrellas > Fix For: 0.9.0 > > Attachments: image-2020-12-14-09-48-46-946.png > > > This ticket tracks all the changes needed to support concurrency control for > Hudi tables. This work will be done in multiple phases. > # Parallel writing to Hudi tables support -> This feature will allow users > to have multiple writers mutate the tables without the ability to perform > concurrent update to the same file. > # Concurrency control at file/record level -> This feature will allow users > to have multiple writers mutate the tables with the ability to ensure > serializability at record level. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java
[ https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1698: -- Parent: HUDI-1456 Issue Type: Sub-task (was: Improvement) > Multiwriting for Flink / Java > - > > Key: HUDI-1698 > URL: https://issues.apache.org/jira/browse/HUDI-1698 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1047) Support asynchronize clustering in CoW mode
[ https://issues.apache.org/jira/browse/HUDI-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1047: -- Summary: Support asynchronize clustering in CoW mode (was: Support synchronize clustering in CoW mode) > Support asynchronize clustering in CoW mode > --- > > Key: HUDI-1047 > URL: https://issues.apache.org/jira/browse/HUDI-1047 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode
[ https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1048: -- Summary: Support Asynchronize clustering in MoR mode (was: Support synchronize clustering in MoR mode) > Support Asynchronize clustering in MoR mode > --- > > Key: HUDI-1048 > URL: https://issues.apache.org/jira/browse/HUDI-1048 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: leesf >Assignee: leesf >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1077) Integration tests to validate clustering
[ https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1077: -- Fix Version/s: 0.9.0 > Integration tests to validate clustering > > > Key: HUDI-1077 > URL: https://issues.apache.org/jira/browse/HUDI-1077 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Priority: Blocker > Fix For: 0.9.0 > > > extend test-suite module to validate clustering -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1353) Incremental timeline support for pending clustering operations
[ https://issues.apache.org/jira/browse/HUDI-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1353: -- Priority: Blocker (was: Major) > Incremental timeline support for pending clustering operations > -- > > Key: HUDI-1353 > URL: https://issues.apache.org/jira/browse/HUDI-1353 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Assignee: satish >Priority: Blocker > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1468) incremental read support with clustering
[ https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-1468: - Assignee: liwei > incremental read support with clustering > > > Key: HUDI-1468 > URL: https://issues.apache.org/jira/browse/HUDI-1468 > Project: Apache Hudi > Issue Type: Sub-task > Components: Incremental Pull >Affects Versions: 0.9.0 >Reporter: satish >Assignee: liwei >Priority: Blocker > Fix For: 0.9.0 > > > As part of clustering, metadata such as hoodie_commit_time changes for > records that are clustered. This is specific to > SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to > carry commit_time from original record to support incremental queries. > Also, incremental queries dont work with 'replacecommit' used by clustering > HUDI-1264. Change incremental query to work for replacecommits created by > Clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1468) incremental read support with clustering
[ https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1468: -- Priority: Blocker (was: Major) > incremental read support with clustering > > > Key: HUDI-1468 > URL: https://issues.apache.org/jira/browse/HUDI-1468 > Project: Apache Hudi > Issue Type: Sub-task > Components: Incremental Pull >Affects Versions: 0.9.0 >Reporter: satish >Priority: Blocker > Fix For: 0.9.0 > > > As part of clustering, metadata such as hoodie_commit_time changes for > records that are clustered. This is specific to > SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to > carry commit_time from original record to support incremental queries. > Also, incremental queries dont work with 'replacecommit' used by clustering > HUDI-1264. Change incremental query to work for replacecommits created by > Clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1077) Integration tests to validate clustering
[ https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1077: -- Priority: Blocker (was: Major) > Integration tests to validate clustering > > > Key: HUDI-1077 > URL: https://issues.apache.org/jira/browse/HUDI-1077 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: satish >Priority: Blocker > > extend test-suite module to validate clustering -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1482) async clustering for spark streaming
[ https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1482: -- Priority: Blocker (was: Major) > async clustering for spark streaming > > > Key: HUDI-1482 > URL: https://issues.apache.org/jira/browse/HUDI-1482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1483) async clustering for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1483: -- Fix Version/s: 0.9.0 > async clustering for deltastreamer > -- > > Key: HUDI-1483 > URL: https://issues.apache.org/jira/browse/HUDI-1483 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1482) async clustering for spark streaming
[ https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1482: -- Fix Version/s: 0.9.0 > async clustering for spark streaming > > > Key: HUDI-1482 > URL: https://issues.apache.org/jira/browse/HUDI-1482 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: liwei >Assignee: liwei >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1500) support incremental read clustering commit in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1500: -- Priority: Blocker (was: Major) > support incremental read clustering commit in deltastreamer > > > Key: HUDI-1500 > URL: https://issues.apache.org/jira/browse/HUDI-1500 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: liwei >Assignee: satish >Priority: Blocker > > now in DeltaSync.readFromSource() can not read last instant as replace > commit, such as clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1500) support incremental read clustering commit in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1500: -- Fix Version/s: 0.9.0 > support incremental read clustering commit in deltastreamer > > > Key: HUDI-1500 > URL: https://issues.apache.org/jira/browse/HUDI-1500 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: liwei >Assignee: satish >Priority: Blocker > Fix For: 0.9.0 > > > now in DeltaSync.readFromSource() can not read last instant as replace > commit, such as clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1483) async clustering for deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1483: -- Priority: Blocker (was: Major) > async clustering for deltastreamer > -- > > Key: HUDI-1483 > URL: https://issues.apache.org/jira/browse/HUDI-1483 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: liwei >Assignee: liwei >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1500) support incremental read clustering commit in deltastreamer
[ https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-1500: - Assignee: satish > support incremental read clustering commit in deltastreamer > > > Key: HUDI-1500 > URL: https://issues.apache.org/jira/browse/HUDI-1500 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: liwei >Assignee: satish >Priority: Major > > now in DeltaSync.readFromSource() can not read last instant as replace > commit, such as clustering. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.
[ https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal reassigned HUDI-1937: - Assignee: liwei > When clustering fail, generating unfinished replacecommit timeline. > --- > > Key: HUDI-1937 > URL: https://issues.apache.org/jira/browse/HUDI-1937 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: taylor liao >Assignee: liwei >Priority: Blocker > Fix For: 0.9.0 > > > When clustering fail, generating unfinished replacecommit. > Restart job will generate delta commit. if the commit contain clustering > group file, the task will fail. > "Not allowed to update the clustering file group %s > For pending clustering operations, we are not going to support update for > now." > Need to ensure that the unfinished replacecommit file is deleted, or perform > clustering first, and then generate delta commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.
[ https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1937: -- Parent: HUDI-1042 Issue Type: Sub-task (was: Bug) > When clustering fail, generating unfinished replacecommit timeline. > --- > > Key: HUDI-1937 > URL: https://issues.apache.org/jira/browse/HUDI-1937 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: taylor liao >Priority: Blocker > Fix For: 0.9.0 > > > When clustering fail, generating unfinished replacecommit. > Restart job will generate delta commit. if the commit contain clustering > group file, the task will fail. > "Not allowed to update the clustering file group %s > For pending clustering operations, we are not going to support update for > now." > Need to ensure that the unfinished replacecommit file is deleted, or perform > clustering first, and then generate delta commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.
[ https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1937: -- Priority: Blocker (was: Critical) > When clustering fail, generating unfinished replacecommit timeline. > --- > > Key: HUDI-1937 > URL: https://issues.apache.org/jira/browse/HUDI-1937 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: taylor liao >Priority: Blocker > > When clustering fail, generating unfinished replacecommit. > Restart job will generate delta commit. if the commit contain clustering > group file, the task will fail. > "Not allowed to update the clustering file group %s > For pending clustering operations, we are not going to support update for > now." > Need to ensure that the unfinished replacecommit file is deleted, or perform > clustering first, and then generate delta commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.
[ https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362624#comment-17362624 ] Nishith Agarwal commented on HUDI-1937: --- [~satish] [~309637554] Can one of you take a look at this ? > When clustering fail, generating unfinished replacecommit timeline. > --- > > Key: HUDI-1937 > URL: https://issues.apache.org/jira/browse/HUDI-1937 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: taylor liao >Priority: Critical > > When clustering fail, generating unfinished replacecommit. > Restart job will generate delta commit. if the commit contain clustering > group file, the task will fail. > "Not allowed to update the clustering file group %s > For pending clustering operations, we are not going to support update for > now." > Need to ensure that the unfinished replacecommit file is deleted, or perform > clustering first, and then generate delta commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.
[ https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1937: -- Fix Version/s: 0.9.0 > When clustering fail, generating unfinished replacecommit timeline. > --- > > Key: HUDI-1937 > URL: https://issues.apache.org/jira/browse/HUDI-1937 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.8.0 >Reporter: taylor liao >Priority: Blocker > Fix For: 0.9.0 > > > When clustering fail, generating unfinished replacecommit. > Restart job will generate delta commit. if the commit contain clustering > group file, the task will fail. > "Not allowed to update the clustering file group %s > For pending clustering operations, we are not going to support update for > now." > Need to ensure that the unfinished replacecommit file is deleted, or perform > clustering first, and then generate delta commit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
[ https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362623#comment-17362623 ] Nishith Agarwal commented on HUDI-1309: --- [~vbalaji] Is this something you still see ? > Listing Metadata unreadable in S3 as the log block is deemed corrupted > -- > > Key: HUDI-1309 > URL: https://issues.apache.org/jira/browse/HUDI-1309 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Blocker > > When running metadata list-partitions CLI command, I am seeing the below > messages and the partition list is empty. Was expecting 10K partitions. > > {code:java} > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > 44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 > 44567 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1537) Move validation of file listings to something that happens before each write
[ https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362622#comment-17362622 ] Nishith Agarwal commented on HUDI-1537: --- [~pwason] Is validation of file listing applicable ? > Move validation of file listings to something that happens before each write > > > Key: HUDI-1537 > URL: https://issues.apache.org/jira/browse/HUDI-1537 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Current way of checking, has issues dealing with log files and inflight > files. Code has comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
[ https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1309: -- Priority: Blocker (was: Major) > Listing Metadata unreadable in S3 as the log block is deemed corrupted > -- > > Key: HUDI-1309 > URL: https://issues.apache.org/jira/browse/HUDI-1309 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Blocker > > When running metadata list-partitions CLI command, I am seeing the below > messages and the partition list is empty. Was expecting 10K partitions. > > {code:java} > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > 44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in > HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 > 44567 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1649) Bugs with Metadata Table in 0.7 release
[ https://issues.apache.org/jira/browse/HUDI-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362621#comment-17362621 ] Nishith Agarwal commented on HUDI-1649: --- [~pwason] Are you going to open a PR to address all of these issues together next week ? > Bugs with Metadata Table in 0.7 release > --- > > Key: HUDI-1649 > URL: https://issues.apache.org/jira/browse/HUDI-1649 > Project: Apache Hudi > Issue Type: Sub-task >Affects Versions: 0.9.0 >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Blocker > Labels: sev:high, user-support-issues > Fix For: 0.9.0 > > > We have discovered the following issues while using the Metadata Table code > in production: > > *Issue 1: Automatic rollbacks during commit get a timestamp which is out of > order* > Suppose commit C1 failed. The next commit will try to rollback C1 > automatically. This will create the following two instances C2.commit and > R3.rollback. Hence, the rollback will have a timestamp > the commit which > occurs after it. > This is because of how the code is implemented in > AbstractHoodieWriteClient.startCommitWithTime() where the timestamp of the > next commit is chosen before the timestamp of the rollback instant. > > *Issue 2: Syncing of rollbacks is not working* > Due to the above HUDI issue, syncing of rollbacks in Metadata Table does not > work correctly. > Assume the timeline as follows: > Dataset Timeline: C1 C2. C3 > Metadata Timeline: DC1 DC2. (dc=delta-commit) > > Suppose the next commit C4 fails. When C5 is attempted, C4 will be > automatically tolled back. Due to the issue #1, the timelines will become as > follows: > Dataset Timeline: C1 C2. C3. C5 R6 > Metadata Timeline: DC1 DC2 > Now if the Metadata Table is synced (AbstractHoodieWriteClient.postCommit), > the code will end up processing C5 first and then R6 which will mean that the > file rolled back in R6 will be committed to the metadata table as deleted > files. There is logic within > HoodieTableMetadataUtils.processRollbackMetadata() to ignore R6 in this > scenario but it does not work because of the issue #1. > > *Issue #3: Rollback instants are deleted inline* > Current rollback code deleted older instants inline. The delete logic keeps > oldest ten instants (hardcoded) and removes all more-recent rollback > instants. Furthermore, the deletion ONLY deletes the rollback.complete and > does not remove the corresponding rollback.inflight files. > Hence, will many rollbacks the following timeline is possible > Timeline: C1. C2 C3 C4. R5.inflight C5 C6 C7 ... > (there are 9 previous rollback instants to R5). > > *Issue #4: Metadata Table reader does not show correct view of the metadata* > Assume the timeline is as in Issue #3 with a leftover rollback.inflight > instant. Also assume that the metadata table is synced only till C4. The > MetadataTableWriter will not sync any more instants to the Metadata Table > since an incomplete instant is present next. > The same sync logic is also used by the MetadataReader to perform the > in-memory merge of timeline. Hence, the reader will also not consider C5, C6 > and C7 thereby providing an incorrect and older view of the FileSlices and > FileGroups. > > Any future ingestion into this table MAY insert data into older versions of > the FileSlices which will end up being a data loss when queried. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1537) Move validation of file listings to something that happens before each write
[ https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1537: -- Priority: Blocker (was: Major) > Move validation of file listings to something that happens before each write > > > Key: HUDI-1537 > URL: https://issues.apache.org/jira/browse/HUDI-1537 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Current way of checking, has issues dealing with log files and inflight > files. Code has comments. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1542) Fix Flaky test : TestHoodieMetadata#testSync
[ https://issues.apache.org/jira/browse/HUDI-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1542: -- Priority: Blocker (was: Major) > Fix Flaky test : TestHoodieMetadata#testSync > > > Key: HUDI-1542 > URL: https://issues.apache.org/jira/browse/HUDI-1542 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Vinoth Chandar >Priority: Blocker > Fix For: 0.9.0 > > > Only fails intermittently on CI. > {code} > [INFO] Running org.apache.hudi.metadata.TestHoodieBackedMetadata > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/home/travis/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > [WARN ] 2021-01-20 09:25:31,716 org.apache.spark.util.Utils - Your hostname, > localhost resolves to a loopback address: 127.0.0.1; using 10.30.0.81 instead > (on interface eth0) > [WARN ] 2021-01-20 09:25:31,725 org.apache.spark.util.Utils - Set > SPARK_LOCAL_IP if you need to bind to another address > [WARN ] 2021-01-20 09:25:32,412 org.apache.hadoop.util.NativeCodeLoader - > Unable to load native-hadoop library for your platform... using builtin-java > classes where applicable > [WARN ] 2021-01-20 09:25:36,645 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:25:36,700 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:26:30,250 > org.apache.hudi.client.AbstractHoodieWriteClient - Cannot find instant > 20210120092628 in the timeline, for rollback > [WARN ] 2021-01-20 09:26:45,980 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:26:46,568 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:26:46,580 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:27:27,853 > org.apache.hudi.client.AbstractHoodieWriteClient - Cannot find instant > 20210120092726 in the timeline, for rollback > [WARN ] 2021-01-20 09:27:43,037 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:27:46,017 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit3284615140376500245/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:28:05,357 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__REQUESTED] > [WARN ] 2021-01-20 09:28:05,887 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:06,312 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092805__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:18,402 > org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system > instance used in previous test-run > [WARN ] 2021-01-20 09:28:22,013 > org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata table was not > found at path /tmp/junit4284626513859445824/dataset/.hoodie/metadata > [WARN ] 2021-01-20 09:28:40,354 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__REQUESTED] > [WARN ] 2021-01-20 09:28:40,780 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__INFLIGHT] > [WARN ] 2021-01-20 09:28:41,162 org.apache.hudi.common.util.ClusteringUtils > - No content found in requested file for instant > [==>20210120092840__replacecommit__INFLIGHT] > =[ 605 seconds still running ]= > [ERROR] 2021-01-20 09:28:50,683 > org.apache.hudi.timeline.service.FileSystemViewHandler - Got runtime > exception servicing request >
[jira] [Resolved] (HUDI-1962) Add a blog/docs for shuffle paralelism
[ https://issues.apache.org/jira/browse/HUDI-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal resolved HUDI-1962. --- Resolution: Fixed > Add a blog/docs for shuffle paralelism > -- > > Key: HUDI-1962 > URL: https://issues.apache.org/jira/browse/HUDI-1962 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1962) Add a blog/docs for shuffle paralelism
[ https://issues.apache.org/jira/browse/HUDI-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362619#comment-17362619 ] Nishith Agarwal commented on HUDI-1962: --- Added a FAQ -> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowtotuneshuffleparallelismofHudijobs? > Add a blog/docs for shuffle paralelism > -- > > Key: HUDI-1962 > URL: https://issues.apache.org/jira/browse/HUDI-1962 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1959) Add links to small file handling and clustering to the config section
[ https://issues.apache.org/jira/browse/HUDI-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362615#comment-17362615 ] Nishith Agarwal commented on HUDI-1959: --- Added a FAQ here -> https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles > Add links to small file handling and clustering to the config section > - > > Key: HUDI-1959 > URL: https://issues.apache.org/jira/browse/HUDI-1959 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > > Users are confused on how to ensure small files are not created during > ingestion > Small file handling isn't very clear to users and they complain that > ingestion has slowed down > Clustering usage isn't clear and how to use that with deltastreamer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1959) Add links to small file handling and clustering to the config section
[ https://issues.apache.org/jira/browse/HUDI-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal resolved HUDI-1959. --- Fix Version/s: 0.9.0 Resolution: Fixed > Add links to small file handling and clustering to the config section > - > > Key: HUDI-1959 > URL: https://issues.apache.org/jira/browse/HUDI-1959 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.9.0 > > > Users are confused on how to ensure small files are not created during > ingestion > Small file handling isn't very clear to users and they complain that > ingestion has slowed down > Clustering usage isn't clear and how to use that with deltastreamer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-1960) Add documentation to be able to disable parquet configs
[ https://issues.apache.org/jira/browse/HUDI-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal resolved HUDI-1960. --- Fix Version/s: 0.9.0 Resolution: Fixed > Add documentation to be able to disable parquet configs > --- > > Key: HUDI-1960 > URL: https://issues.apache.org/jira/browse/HUDI-1960 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Fix For: 0.9.0 > > > https://github.com/apache/hudi/issues/2265 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1960) Add documentation to be able to disable parquet configs
[ https://issues.apache.org/jira/browse/HUDI-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362607#comment-17362607 ] Nishith Agarwal commented on HUDI-1960: --- Added a FAQ here -> https://cwiki.apache.org/confluence/display/HUDI/FAQ > Add documentation to be able to disable parquet configs > --- > > Key: HUDI-1960 > URL: https://issues.apache.org/jira/browse/HUDI-1960 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > > https://github.com/apache/hudi/issues/2265 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x
[ https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362604#comment-17362604 ] Nishith Agarwal commented on HUDI-1975: --- [~vinaypatil18] It looks like even the latest prometheus 0.11.x depends on 3.1.x, see here -> [https://github.com/prometheus/client_java/blob/master/simpleclient_dropwizard/pom.xml#L43] To fix the issue, we may have to try downgrading the dropwizard to 3.x. Can you check what are side effects of doing this ? We can discuss follow up steps here > Upgrade java-prometheus-client from 3.1.2 to 4.x > > > Key: HUDI-1975 > URL: https://issues.apache.org/jira/browse/HUDI-1975 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > > Find more details here -> https://github.com/apache/hudi/issues/2774 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x
[ https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1975: -- Description: Find more details here -> https://github.com/apache/hudi/issues/2774 > Upgrade java-prometheus-client from 3.1.2 to 4.x > > > Key: HUDI-1975 > URL: https://issues.apache.org/jira/browse/HUDI-1975 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > > Find more details here -> https://github.com/apache/hudi/issues/2774 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Deleted] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset
[ https://issues.apache.org/jira/browse/HUDI-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal deleted HUDI-1945: -- > Support Hudi to read from Kafka Consumer Group Offset > - > > Key: HUDI-1945 > URL: https://issues.apache.org/jira/browse/HUDI-1945 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinay >Assignee: Vinay >Priority: Major > > Currently, Hudi provides options to read from latest or earliest. We should > even provide users an option to read from group offset as well. > This change will be in `KafkaOffsetGen` where we can add a method to support > this functionality -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362600#comment-17362600 ] Nishith Agarwal commented on HUDI-1910: --- [~vinaypatil18] Thanks for sharing your approach. The first level of configs in deltastreamer are meant to be for generic use-cases that apply to general purpose ingestion activities. Whenever we want to add a specific use-case, we add configurable classes and then add a parameter something like this -> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L158.] This has 2 advantages # If you want to dynamically enable or disable such configs without code changes or deployment, it can be done if you are keeping your properties file updated with configs. # Keep users away from many custom configs (which most users might not care) by not floating them as top level configs in deltasteramer (they way you suggested). I think we should consider the first approach. > Supporting Kafka based checkpointing for HoodieDeltaStreamer > > > Key: HUDI-1910 > URL: https://issues.apache.org/jira/browse/HUDI-1910 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Nishith Agarwal >Assignee: Vinay >Priority: Major > Labels: sev:normal, triaged > > HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some > users have requested support for Kafka based checkpoints for freshness > auditing purposes. This ticket tracks any implementation for that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1909) Skip the commits with empty files for flink streaming reader
[ https://issues.apache.org/jira/browse/HUDI-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1909: -- Description: Log warnings instead of throwing to make the reader more robust. https://github.com/apache/hudi/issues/2950 was:Log warnings instead of throwing to make the reader more robust. > Skip the commits with empty files for flink streaming reader > > > Key: HUDI-1909 > URL: https://issues.apache.org/jira/browse/HUDI-1909 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: Danny Chen >Assignee: Vinay >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > Log warnings instead of throwing to make the reader more robust. > > https://github.com/apache/hudi/issues/2950 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1998) Provide a way to find list of commits through a pythonic API
Nishith Agarwal created HUDI-1998: - Summary: Provide a way to find list of commits through a pythonic API Key: HUDI-1998 URL: https://issues.apache.org/jira/browse/HUDI-1998 Project: Apache Hudi Issue Type: New Feature Components: Writer Core Reporter: Nishith Agarwal TimelineUtils is a java API using which one can get the latest commit or instantiate HoodieActiveTImeline. Users are looking to perform the same through some python API https://github.com/apache/hudi/issues/2987 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1997) Fix hoodie.datasource.hive_sync.auto_create_database documentation
Nishith Agarwal created HUDI-1997: - Summary: Fix hoodie.datasource.hive_sync.auto_create_database documentation Key: HUDI-1997 URL: https://issues.apache.org/jira/browse/HUDI-1997 Project: Apache Hudi Issue Type: Bug Components: Docs Reporter: Nishith Agarwal Fix For: 0.9.0 hoodie.datasource.hive_sync.auto_create_database is supposed to be defaulting to true according to docs but actually defaults to false for 0.7 & 0.8 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server
[ https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359760#comment-17359760 ] Nishith Agarwal commented on HUDI-1138: --- Okay, thanks for sharing this info. > Re-implement marker files via timeline server > - > > Key: HUDI-1138 > URL: https://issues.apache.org/jira/browse/HUDI-1138 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.9.0 > > > Even as you can argue that RFC-15/consolidated metadata, removes the need for > deleting partial files written due to spark task failures/stage retries. It > will still leave extra files inside the table (and users will pay for it > every month) and we need the marker mechanism to be able to delete these > partial files. > Here we explore if we can improve the current marker file mechanism, that > creates one marker file per data file written, by > Delegating the createMarker() call to the driver/timeline server, and have it > create marker metadata into a single file handle, that is flushed for > durability guarantees > > P.S: I was tempted to think Spark listener mechanism can help us deal with > failed tasks, but it has no guarantees. the writer job could die without > deleting a partial file. i.e it can improve things, but cant provide > guarantees -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1827) Add ORC support in Bootstrap Op
[ https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358358#comment-17358358 ] Nishith Agarwal commented on HUDI-1827: --- [~manasaks] You approach sounds good to me. For marking the baseFileFormat, you can pass the config like this .option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "") You can add a new config here -> [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala] Once you've added it there, it will be passed down to the above datasource code. > Add ORC support in Bootstrap Op > --- > > Key: HUDI-1827 > URL: https://issues.apache.org/jira/browse/HUDI-1827 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: Teresa Kang >Assignee: manasa >Priority: Major > > SparkBootstrapCommitActionExecutor assumes parquet format right now, need to > support ORC as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-1827) Add ORC support in Bootstrap Op
[ https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358358#comment-17358358 ] Nishith Agarwal edited comment on HUDI-1827 at 6/7/21, 5:03 AM: [~manasaks] You approach sounds good to me. For marking the baseFileFormat, you can pass the config like this .option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "") You can add a new config here -> [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala] Once you've added it there, it will be passed down to the above datasource code. was (Author: nishith29): [~manasaks] You approach sounds good to me. For marking the baseFileFormat, you can pass the config like this .option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "") You can add a new config here -> [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala] Once you've added it there, it will be passed down to the above datasource code. > Add ORC support in Bootstrap Op > --- > > Key: HUDI-1827 > URL: https://issues.apache.org/jira/browse/HUDI-1827 > Project: Apache Hudi > Issue Type: Sub-task > Components: Storage Management >Reporter: Teresa Kang >Assignee: manasa >Priority: Major > > SparkBootstrapCommitActionExecutor assumes parquet format right now, need to > support ORC as well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1977) Fix Hudi-CLI show table spark-sql
Nishith Agarwal created HUDI-1977: - Summary: Fix Hudi-CLI show table spark-sql Key: HUDI-1977 URL: https://issues.apache.org/jira/browse/HUDI-1977 Project: Apache Hudi Issue Type: Task Components: CLI Reporter: Nishith Agarwal https://github.com/apache/hudi/issues/2955 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1976) Upgrade hive, jackson, log4j, hadoop to remove vulnerability
Nishith Agarwal created HUDI-1976: - Summary: Upgrade hive, jackson, log4j, hadoop to remove vulnerability Key: HUDI-1976 URL: https://issues.apache.org/jira/browse/HUDI-1976 Project: Apache Hudi Issue Type: Task Components: Hive Integration Reporter: Nishith Agarwal [https://github.com/apache/hudi/issues/2827] [https://github.com/apache/hudi/issues/2826] [https://github.com/apache/hudi/issues/2824|https://github.com/apache/hudi/issues/2826] [https://github.com/apache/hudi/issues/2823|https://github.com/apache/hudi/issues/2826] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1976) Upgrade hive, jackson, log4j, hadoop to remove vulnerability
[ https://issues.apache.org/jira/browse/HUDI-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1976: -- Fix Version/s: 0.9.0 Priority: Blocker (was: Major) > Upgrade hive, jackson, log4j, hadoop to remove vulnerability > > > Key: HUDI-1976 > URL: https://issues.apache.org/jira/browse/HUDI-1976 > Project: Apache Hudi > Issue Type: Task > Components: Hive Integration >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > > [https://github.com/apache/hudi/issues/2827] > [https://github.com/apache/hudi/issues/2826] > [https://github.com/apache/hudi/issues/2824|https://github.com/apache/hudi/issues/2826] > [https://github.com/apache/hudi/issues/2823|https://github.com/apache/hudi/issues/2826] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1592) Metadata listing fails for non partitoned dataset
[ https://issues.apache.org/jira/browse/HUDI-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1592: -- Fix Version/s: 0.9.0 Priority: Blocker (was: Major) > Metadata listing fails for non partitoned dataset > - > > Key: HUDI-1592 > URL: https://issues.apache.org/jira/browse/HUDI-1592 > Project: Apache Hudi > Issue Type: Bug > Components: Storage Management >Affects Versions: 0.7.0 >Reporter: sivabalan narayanan >Assignee: Prashant Wason >Priority: Blocker > Labels: sev:high, user-support-issues > Fix For: 0.9.0 > > > https://github.com/apache/hudi/issues/2507 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x
[ https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1975: -- Fix Version/s: 0.9.0 Priority: Blocker (was: Major) > Upgrade java-prometheus-client from 3.1.2 to 4.x > > > Key: HUDI-1975 > URL: https://issues.apache.org/jira/browse/HUDI-1975 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x
Nishith Agarwal created HUDI-1975: - Summary: Upgrade java-prometheus-client from 3.1.2 to 4.x Key: HUDI-1975 URL: https://issues.apache.org/jira/browse/HUDI-1975 Project: Apache Hudi Issue Type: Task Reporter: Nishith Agarwal -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1974) Run pyspark and validate that it works correctly with all hudi versions
[ https://issues.apache.org/jira/browse/HUDI-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1974: -- Fix Version/s: 0.9.0 > Run pyspark and validate that it works correctly with all hudi versions > --- > > Key: HUDI-1974 > URL: https://issues.apache.org/jira/browse/HUDI-1974 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > Fix For: 0.9.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1974) Run pyspark and validate that it works correctly with all hudi versions
[ https://issues.apache.org/jira/browse/HUDI-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nishith Agarwal updated HUDI-1974: -- Priority: Blocker (was: Major) > Run pyspark and validate that it works correctly with all hudi versions > --- > > Key: HUDI-1974 > URL: https://issues.apache.org/jira/browse/HUDI-1974 > Project: Apache Hudi > Issue Type: Task >Reporter: Nishith Agarwal >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005)