from:"Nishith Agarwal \(Jira\)"

[jira] [Created] (HUDI-3658) Add Hudi Uber Meetup on March 1st

2022-03-18 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-3658:
-

 Summary: Add Hudi Uber Meetup on March 1st
 Key: HUDI-3658
 URL: https://issues.apache.org/jira/browse/HUDI-3658
 Project: Apache Hudi
  Issue Type: Task
Reporter: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3256) Add Links to Hudi Meetup Jan 2022

2022-01-16 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-3256:
-

 Summary: Add Links to Hudi Meetup Jan 2022
 Key: HUDI-3256
 URL: https://issues.apache.org/jira/browse/HUDI-3256
 Project: Apache Hudi
  Issue Type: Task
Reporter: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-1576) Add ability to perform archival synchronously

2022-01-14 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17476506#comment-17476506
 ] 

Nishith Agarwal commented on HUDI-1576:
---

[~guoyihua] Yes, the idea was to detach archiving from being inline to async. 
Although, even today, archiving happens after the "COMMIT" is successfully 
completed and the file has been created on disk. So, introducing a new action 
is not needed. I think archival can just run async and keep archiving contents 
without the need to create any action since that may be an overkill. One 
side-effect I see is that we still need a way to track the progress and 
activity of archiving on a table. Since the .archive folder has this history, 
it should be fine. That's my opinion.

> Add ability to perform archival synchronously
> -
>
> Key: HUDI-1576
> URL: https://issues.apache.org/jira/browse/HUDI-1576
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Currently, archival runs inline. We want to move archival to a table service 
> like cleaning, compaction etc..
> and treat it like that. of course, no new action will be introduced. 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer

2021-10-07 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425912#comment-17425912
 ] 

Nishith Agarwal commented on HUDI-2275:
---

[~dave_hagman] To ensure that the checkpoints from deltastreamer commits are 
carried over when a concurrent datasource spark job is running, one need to 
enable the following configuration: 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java#L371]

Can you please check if you have enabled this config ?

 

 

> HoodieDeltaStreamerException when using OCC and a second concurrent writer
> --
>
> Key: HUDI-2275
> URL: https://issues.apache.org/jira/browse/HUDI-2275
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Affects Versions: 0.9.0
>Reporter: Dave Hagman
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
>  I am trying to utilize [Optimistic Concurrency 
> Control|https://hudi.apache.org/docs/concurrency_control] in order to allow 
> two writers to update a single table simultaneously. The two writers are:
>  * Writer A: Deltastreamer job consuming continuously from Kafka
>  * Writer B: A spark datasource-based writer that is consuming parquet files 
> out of S3
>  * Table Type: Copy on Write
>  
> After a few commits from each writer the deltastreamer will fail with the 
> following exception:
>  
> {code:java}
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find 
> previous checkpoint. Please double check if this table was indeed built via 
> delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, 
> Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={
>  "partitionToWriteStats" : {
>  ...{code}
>  
> What appears to be happening is a lack of commit isolation between the two 
> writers
>  Writer B (spark datasource writer) will land commits which are eventually 
> picked up by Writer A (Delta Streamer). This is an issue because the Delta 
> Streamer needs checkpoint information which the spark datasource of course 
> does not include in its commits. My understanding was that OCC was built for 
> this very purpose (among others). 
> OCC config for Delta Streamer:
> {code:java}
> hoodie.write.concurrency.mode=optimistic_concurrency_control
>  hoodie.cleaner.policy.failed.writes=LAZY
>  
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
>  hoodie.write.lock.zookeeper.url=
>  hoodie.write.lock.zookeeper.port=2181
>  hoodie.write.lock.zookeeper.lock_key=writer_lock
>  hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code}
>  
> OCC config for spark datasource:
> {code:java}
> // Multi-writer concurrency
>  .option("hoodie.cleaner.policy.failed.writes", "LAZY")
>  .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>  .option(
>  "hoodie.write.lock.provider",
>  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName()
>  )
>  .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost)
>  .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort)
>  .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock")
>  .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code}
> h3. Steps to Reproduce:
>  * Start a deltastreamer job against some table Foo
>  * In parallel, start writing to the same table Foo using spark datasource 
> writer
>  * Note that after a few commits from each the deltastreamer is likely to 
> fail with the above exception when the datasource writer creates non-isolated 
> inflight commits
> NOTE: I have not tested this with two of the same datasources (ex. two 
> deltastreamer jobs)
> NOTE 2: Another detail that may be relevant is that the two writers are on 
> completely different spark clusters but I assumed this shouldn't be an issue 
> since we're locking using Zookeeper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2146) Concurrent writes loss data

2021-07-17 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17382691#comment-17382691
 ] 

Nishith Agarwal commented on HUDI-2146:
---

[~wenningd] I see that there is a conflict thrown when both inserts are started 
simultaneously 

insert 1
{code:java}
scala> df3.write.format("org.apache.hudi").option(HoodieWriteConfig.TABLE_NAME, 
tableName).option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL).option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY,
 
DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL).option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY,
 "event_id").option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
"event_type").option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
"event_ts").option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, 
"true").option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
tableName).option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
"event_type").option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, 
"false").option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
"org.apache.hudi.hive.MultiPartKeysValueExtractor").option("hoodie.write.concurrency.mode",
 
"optimistic_concurrency_control").option("hoodie.cleaner.policy.failed.writes", 
"LAZY").option("hoodie.write.lock.provider", 
"org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider").option("hoodie.write.lock.filesystem.path",
 "/tmp/").option("hoodie.write.lock.hivemetastore.database", 
"test_db").option("hoodie.write.lock.hivemetastore.table", 
"hudi_test").option("hoodie.write.lock.hivemetastore.uris", 
"").mode(SaveMode.Append).save(tablePath)
21/07/18 01:38:55 WARN hudi.DataSourceWriteOptions$: 
hoodie.datasource.write.storage.type is deprecated and will be removed in a 
later release; Please use hoodie.datasource.write.table.type
org.apache.hudi.exception.HoodieWriteConflictException: 
java.util.ConcurrentModificationException: Cannot resolve conflicts for 
overlapping writes
  at 
org.apache.hudi.client.transaction.SimpleConcurrentFileWritesConflictResolutionStrategy.resolveConflict(SimpleConcurrentFileWritesConflictResolutionStrategy.java:102)
  at 
org.apache.hudi.client.utils.TransactionUtils.lambda$resolveWriteConflictIfAny$0(TransactionUtils.java:68)
  at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
  at 
java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
  at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
  at 
org.apache.hudi.client.utils.TransactionUtils.resolveWriteConflictIfAny(TransactionUtils.java:62)
  at 
org.apache.hudi.client.SparkRDDWriteClient.preCommit(SparkRDDWriteClient.java:456)
  at 
org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:183)
  at 
org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:121)
  at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:564)
  at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:230)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:163)
  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)

[jira] [Updated] (HUDI-1824) Spark Datasource V2/V1 (Dataset) integration with ORC

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1824:
--
Summary: Spark Datasource V2/V1 (Dataset) integration with ORC  (was: 
Spark Integration with ORC)

> Spark Datasource V2/V1 (Dataset) integration with ORC
> --
>
> Key: HUDI-1824
> URL: https://issues.apache.org/jira/browse/HUDI-1824
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: Teresa Kang
>Assignee: manasa
>Priority: Major
>
> Implement HoodieInternalRowOrcWriter for spark datasource integration with 
> ORC.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-765:
-
Fix Version/s: 0.9.0

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Major
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-764) Implement HoodieOrcWriter

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-764:
-
Status: Closed  (was: Patch Available)

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Critical
>  Labels: pull-request-available
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-765) Implement OrcReaderIterator

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-765:
-
Status: Closed  (was: Patch Available)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-764:


Assignee: (was: Teresa Kang)

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-764) Implement HoodieOrcWriter

2021-07-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-764:


Assignee: Teresa Kang

> Implement HoodieOrcWriter
> -
>
> Key: HUDI-764
> URL: https://issues.apache.org/jira/browse/HUDI-764
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: Teresa Kang
>Priority: Critical
>  Labels: pull-request-available
>
> Implement HoodieOrcWriter
> * Avro to ORC schema
> * Write record in row



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2159) Supporting Clustering and Metadata Table together

2021-07-12 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379423#comment-17379423
 ] 

Nishith Agarwal commented on HUDI-2159:
---

Thanks for the detailed analysis [~pwason]. I think it is definitely worth 
solving (1) from the 0.9.0 release. This is a legitimate situation that can 
surface up especially as users schedule ingestion at a lower frequency there is 
more chances of such collisions.

For (2), since it is more of a perf degradation in cases of failures, we can 
address this right after 0.9 by landing the tailing timeline based on 
completion time. 

> Supporting Clustering and Metadata Table together
> -
>
> Key: HUDI-2159
> URL: https://issues.apache.org/jira/browse/HUDI-2159
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.9.0
>
>
> I am testing clustering support for metadata enabled table and found a few 
> issues.
> *Setup*
> Pipeline 1: Ingestion pipeline with Metadata Table enabled. Runs every 30 
> mins. 
> Pipeline 2: Clustering pipeline with long running jobs (3-4 hours)
> Pipeline 3: Another clustering pipeline with long running jobs (3-4 hours)
>  
> *Issue #1: Parallel commits on Metadata Table*
> Assume the Clustering pipeline is completing T5.replacecommit and ingestion 
> pipeline is completing T10.commit. Metadata Table will synced at an instant 
>  Now both the pipelines will call syncMetadataTable() which will do the 
> following:
>  # Find all un-synced instants from dataset (T5, T6 ... T10)
>  # Read each instant and perform a deltacommit on the Metadata Table with the 
> same timestamp as instant.
> There is a chance that two processed perform deltacommit at T5 on the 
> metadata table and one will fail (instant file already exists). This will be 
> an exception raised and will be detected as failure of pipeline leading to 
> false-positive alerts.
>  
> *Issue #2: No archiving/rollback support for failed clustering operations*
> If a clustering operation fails, it leaves a left-over 
> T5.replacecommit.inflight. There is no automated way to rollback or archive 
> these. Since clustering is a long running operation in general and may be run 
> through multiple pipelines at the same time, automated rollback of left-over 
> inflights doesnt work as we cannot be sure that the process is dead.
> Metadata Table sync only works in completion order. So if 
> T5.replacecommit.inflight is left-over, Metadata Table will not sync beyond 
> T5 causing a large number of LogBLocks to pile up which will have to be 
> merged in memory leading to deteriorating performance.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377455#comment-17377455
 ] 

Nishith Agarwal commented on HUDI-2146:
---

[~wenningd] Thanks for the detailed description. Few questions 
 # What is `df3` in your insert1 and insert2 workloads ? Is it the same 
dataframe ? Can you paste the input for each insert workload please
 # Can you paste the output your expect vs the output you see to understand 
where the data loss is ? 

> Concurrent writes loss data 
> 
>
> Key: HUDI-2146
> URL: https://issues.apache.org/jira/browse/HUDI-2146
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Blocker
> Fix For: 0.9.0
>
> Attachments: image-2021-07-08-00-49-30-730.png
>
>
> Reproduction steps:
> Create a Hudi table:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.hudi.AvroConversionUtils
> val df = Seq(
>   (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "s3://.../" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> {code}
> Perform two insert operations almost in the same time, each insertion 
> contains different data:
> Insert 1:
> {code:java}
> val df3 = Seq(
>   (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
>   (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
>   (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
>   (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>.option("hoodie.cleaner.policy.failed.writes", "LAZY")
>.option("hoodie.write.lock.provider", 
> "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
>.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal")
>.option("hoodie.write.lock.zookeeper.port", "2181")
>.option("hoodie.write.lock.zookeeper.lock_key", tableName)
>.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Insert 2:
> {code:java}
> val df3 = Seq(
>   (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
>   (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
>   (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
>   (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts",

[jira] [Updated] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-2146:
--
Fix Version/s: 0.9.0

> Concurrent writes loss data 
> 
>
> Key: HUDI-2146
> URL: https://issues.apache.org/jira/browse/HUDI-2146
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Blocker
> Fix For: 0.9.0
>
> Attachments: image-2021-07-08-00-49-30-730.png
>
>
> Reproduction steps:
> Create a Hudi table:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.hudi.AvroConversionUtils
> val df = Seq(
>   (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "s3://.../" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> {code}
> Perform two insert operations almost in the same time, each insertion 
> contains different data:
> Insert 1:
> {code:java}
> val df3 = Seq(
>   (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
>   (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
>   (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
>   (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>.option("hoodie.cleaner.policy.failed.writes", "LAZY")
>.option("hoodie.write.lock.provider", 
> "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
>.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal")
>.option("hoodie.write.lock.zookeeper.port", "2181")
>.option("hoodie.write.lock.zookeeper.lock_key", tableName)
>.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Insert 2:
> {code:java}
> val df3 = Seq(
>   (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
>   (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
>   (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
>   (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
>

[jira] [Updated] (HUDI-2146) Concurrent writes loss data

2021-07-08 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-2146:
--
Priority: Blocker  (was: Major)

> Concurrent writes loss data 
> 
>
> Key: HUDI-2146
> URL: https://issues.apache.org/jira/browse/HUDI-2146
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Blocker
> Attachments: image-2021-07-08-00-49-30-730.png
>
>
> Reproduction steps:
> Create a Hudi table:
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.hudi.AvroConversionUtils
> val df = Seq(
>   (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"),
>   (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_test"
> var tablePath = "s3://.../" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> {code}
> Perform two insert operations almost in the same time, each insertion 
> contains different data:
> Insert 1:
> {code:java}
> val df3 = Seq(
>   (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"),
>   (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"),
>   (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"),
>   (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>.option("hoodie.cleaner.policy.failed.writes", "LAZY")
>.option("hoodie.write.lock.provider", 
> "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider")
>.option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal")
>.option("hoodie.write.lock.zookeeper.port", "2181")
>.option("hoodie.write.lock.zookeeper.lock_key", tableName)
>.option("hoodie.write.lock.zookeeper.base_path", "/occ_lock")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Insert 2:
> {code:java}
> val df3 = Seq(
>   (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"),
>   (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"),
>   (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"),
>   (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)

[jira] [Created] (HUDI-2091) Add Uber's grafana dashboard to OSS

2021-06-28 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-2091:
-

 Summary: Add Uber's grafana dashboard to OSS
 Key: HUDI-2091
 URL: https://issues.apache.org/jira/browse/HUDI-2091
 Project: Apache Hudi
  Issue Type: New Feature
  Components: metrics
Reporter: Nishith Agarwal
Assignee: Prashant Wason


cc [~vinoth]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1537) Move validation of file listings to something that happens before each write

2021-06-22 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367807#comment-17367807
 ] 

Nishith Agarwal commented on HUDI-1537:
---

This logic is being removed. Additionally, falling back to file listing has 
been removed -> https://github.com/apache/hudi/pull/3079

> Move validation of file listings to something that happens before each write
> 
>
> Key: HUDI-1537
> URL: https://issues.apache.org/jira/browse/HUDI-1537
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Current way of checking, has issues dealing with log files and inflight 
> files. Code has comments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1542) Fix Flaky test : TestHoodieMetadata#testSync

2021-06-22 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367808#comment-17367808
 ] 

Nishith Agarwal commented on HUDI-1542:
---

[~pwason] Will take this up next week.

> Fix Flaky test : TestHoodieMetadata#testSync
> 
>
> Key: HUDI-1542
> URL: https://issues.apache.org/jira/browse/HUDI-1542
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Only fails intermittently on CI.
> {code}
> [INFO] Running org.apache.hudi.metadata.TestHoodieBackedMetadata
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [WARN ] 2021-01-20 09:25:31,716 org.apache.spark.util.Utils  - Your hostname, 
> localhost resolves to a loopback address: 127.0.0.1; using 10.30.0.81 instead 
> (on interface eth0)
> [WARN ] 2021-01-20 09:25:31,725 org.apache.spark.util.Utils  - Set 
> SPARK_LOCAL_IP if you need to bind to another address
> [WARN ] 2021-01-20 09:25:32,412 org.apache.hadoop.util.NativeCodeLoader  - 
> Unable to load native-hadoop library for your platform... using builtin-java 
> classes where applicable
> [WARN ] 2021-01-20 09:25:36,645 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:25:36,700 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:30,250 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092628 in the timeline, for rollback
> [WARN ] 2021-01-20 09:26:45,980 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:26:46,568 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:46,580 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:27:27,853 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092726 in the timeline, for rollback
> [WARN ] 2021-01-20 09:27:43,037 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:27:46,017 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3284615140376500245/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:05,357 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:05,887 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:06,312 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:18,402 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:28:22,013 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit4284626513859445824/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:40,354 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:40,780 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:41,162 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> =[ 605 seconds still running ]=
> [ERROR] 2021-01-20 09:28:50,683 
> org.apache.hudi.timeline.service.FileSystemViewHandler  - Got runtime 
> exception

[jira] [Commented] (HUDI-1492) Handle DeltaWriteStat correctly for storage schemes that support appends

2021-06-22 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367806#comment-17367806
 ] 

Nishith Agarwal commented on HUDI-1492:
---

Confirmed with [~pwason] that this does not affect correctness of the metadata 
table. 

> Handle DeltaWriteStat correctly for storage schemes that support appends
> 
>
> Key: HUDI-1492
> URL: https://issues.apache.org/jira/browse/HUDI-1492
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Current implementation simply uses the
> {code:java}
> String pathWithPartition = hoodieWriteStat.getPath(); {code}
> to write the metadata table. this is problematic, if the delta write was 
> merely an append. and can technically add duplicate files into the metadata 
> table 
> (not sure if this is a problem per se. but filing a Jira to track and either 
> close/fix ) 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1077) Integration tests to validate clustering

2021-06-22 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1077:
-

Assignee: satish

> Integration tests to validate clustering
> 
>
> Key: HUDI-1077
> URL: https://issues.apache.org/jira/browse/HUDI-1077
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> extend test-suite module to validate clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path

2021-06-22 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1839:
--
Priority: Blocker  (was: Major)

> FSUtils getAllPartitions broken by NotSerializableException: 
> org.apache.hadoop.fs.Path
> --
>
> Key: HUDI-1839
> URL: https://issues.apache.org/jira/browse/HUDI-1839
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: satish
>Priority: Blocker
>
> FSUtils getAllPartitionPaths is expected to work if metadata table is enabled 
> or not. It can also be called inside spark context. But looks like we are 
> trying to improve parallelism and causing NotSerializableExceptions. There 
> are multiple callers using it within spark context (clustering/cleaner).
> See stack trace below
> 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
>  at 
> org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
>  at 
> org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
>  at 
> org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
>  at 
> com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
>  at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Failed to serialize task 53, not attempting to retry it. Exception 
> during serialization: java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path
>  Serialization stack:
>  - object not serializable (class: org.apache.hadoop.fs.Path, value: 
> hdfs://...)
>  - element of array (index: 0)
>  - array (class [Ljava.lang.Object;, size 1)
>  - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, 
> type: class [Ljava.lang.Object;)
>  - object (class scala.collection.mutable.WrappedArray$ofRef, 
> WrappedArray(hdfs://...))
>  - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
>  - object (class org.apache.spark.rdd.ParallelCollectionPartition, 
> org.apache.spark.rdd.ParallelCollectionPartition@735)
>  - field (class: org.apache.spark.scheduler.ResultTask, name: partition, 
> type: interface org.apache.spark.Partition)
>  - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063)
>  at

[jira] [Updated] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path

2021-06-22 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1839:
--
Fix Version/s: 0.9.0

> FSUtils getAllPartitions broken by NotSerializableException: 
> org.apache.hadoop.fs.Path
> --
>
> Key: HUDI-1839
> URL: https://issues.apache.org/jira/browse/HUDI-1839
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> FSUtils getAllPartitionPaths is expected to work if metadata table is enabled 
> or not. It can also be called inside spark context. But looks like we are 
> trying to improve parallelism and causing NotSerializableExceptions. There 
> are multiple callers using it within spark context (clustering/cleaner).
> See stack trace below
> 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
>  at 
> org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
>  at 
> org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
>  at 
> org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
>  at 
> com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
>  at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Failed to serialize task 53, not attempting to retry it. Exception 
> during serialization: java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path
>  Serialization stack:
>  - object not serializable (class: org.apache.hadoop.fs.Path, value: 
> hdfs://...)
>  - element of array (index: 0)
>  - array (class [Ljava.lang.Object;, size 1)
>  - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, 
> type: class [Ljava.lang.Object;)
>  - object (class scala.collection.mutable.WrappedArray$ofRef, 
> WrappedArray(hdfs://...))
>  - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
>  - object (class org.apache.spark.rdd.ParallelCollectionPartition, 
> org.apache.spark.rdd.ParallelCollectionPartition@735)
>  - field (class: org.apache.spark.scheduler.ResultTask, name: partition, 
> type: interface org.apache.spark.Partition)
>  - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2063)
>  at

[jira] [Commented] (HUDI-1839) FSUtils getAllPartitions broken by NotSerializableException: org.apache.hadoop.fs.Path

2021-06-22 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367647#comment-17367647
 ] 

Nishith Agarwal commented on HUDI-1839:
---

[~pwason] Is this something that we have identified the root cause for ?

cc [~uditme] [~satishkotha]

> FSUtils getAllPartitions broken by NotSerializableException: 
> org.apache.hadoop.fs.Path
> --
>
> Key: HUDI-1839
> URL: https://issues.apache.org/jira/browse/HUDI-1839
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> FSUtils getAllPartitionPaths is expected to work if metadata table is enabled 
> or not. It can also be called inside spark context. But looks like we are 
> trying to improve parallelism and causing NotSerializableExceptions. There 
> are multiple callers using it within spark context (clustering/cleaner).
> See stack trace below
> 21/04/20 17:28:44 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with FAILED (diag message: User class threw exception: 
> org.apache.hudi.exception.HoodieException: Error fetching partition paths 
> from metadata table
>  at org.apache.hudi.common.fs.FSUtils.getAllPartitionPaths(FSUtils.java:321)
>  at 
> org.apache.hudi.table.action.cluster.strategy.PartitionAwareClusteringPlanStrategy.generateClusteringPlan(PartitionAwareClusteringPlanStrategy.java:67)
>  at 
> org.apache.hudi.table.action.cluster.SparkClusteringPlanActionExecutor.createClusteringPlan(SparkClusteringPlanActionExecutor.java:71)
>  at 
> org.apache.hudi.table.action.cluster.BaseClusteringPlanActionExecutor.execute(BaseClusteringPlanActionExecutor.java:56)
>  at 
> org.apache.hudi.table.HoodieSparkCopyOnWriteTable.scheduleClustering(HoodieSparkCopyOnWriteTable.java:160)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClusteringAtInstant(AbstractHoodieWriteClient.java:873)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.scheduleClustering(AbstractHoodieWriteClient.java:861)
>  at 
> com.uber.data.efficiency.hudi.HudiRewriter.rewriteDataUsingHudi(HudiRewriter.java:111)
>  at com.uber.data.efficiency.hudi.HudiRewriter.main(HudiRewriter.java:50)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
>  Caused by: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Failed to serialize task 53, not attempting to retry it. Exception 
> during serialization: java.io.NotSerializableException: 
> org.apache.hadoop.fs.Path
>  Serialization stack:
>  - object not serializable (class: org.apache.hadoop.fs.Path, value: 
> hdfs://...)
>  - element of array (index: 0)
>  - array (class [Ljava.lang.Object;, size 1)
>  - field (class: scala.collection.mutable.WrappedArray$ofRef, name: array, 
> type: class [Ljava.lang.Object;)
>  - object (class scala.collection.mutable.WrappedArray$ofRef, 
> WrappedArray(hdfs://...))
>  - writeObject data (class: org.apache.spark.rdd.ParallelCollectionPartition)
>  - object (class org.apache.spark.rdd.ParallelCollectionPartition, 
> org.apache.spark.rdd.ParallelCollectionPartition@735)
>  - field (class: org.apache.spark.scheduler.ResultTask, name: partition, 
> type: interface org.apache.spark.Partition)
>  - object (class org.apache.spark.scheduler.ResultTask, ResultTask(1, 0))
>  at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1904)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1892)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1891)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>  at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1891)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:935)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:935)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2125)
>  at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2074)
>  at 
>

[jira] [Updated] (HUDI-1047) Support asynchronize clustering in CoW mode

2021-06-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1047:
--
Fix Version/s: 0.9.0

> Support asynchronize clustering in CoW mode
> ---
>
> Key: HUDI-1047
> URL: https://issues.apache.org/jira/browse/HUDI-1047
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode

2021-06-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1048:
--
Fix Version/s: 0.9.0

> Support Asynchronize clustering in MoR mode
> ---
>
> Key: HUDI-1048
> URL: https://issues.apache.org/jira/browse/HUDI-1048
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode

2021-06-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1048:
--
Priority: Blocker  (was: Major)

> Support Asynchronize clustering in MoR mode
> ---
>
> Key: HUDI-1048
> URL: https://issues.apache.org/jira/browse/HUDI-1048
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-06-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1706:
--
Priority: Blocker  (was: Major)

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-06-15 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1706:
--
Fix Version/s: 0.9.0

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2026) Add documentation for GlobalDeleteKeyGenerator

2021-06-15 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-2026:
-

 Summary: Add documentation for GlobalDeleteKeyGenerator
 Key: HUDI-2026
 URL: https://issues.apache.org/jira/browse/HUDI-2026
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Nishith Agarwal
Assignee: sivabalan narayanan


[https://github.com/apache/hudi/issues/3008]

 
{code:java}
 - should hard delete records from hudi table with hive sync *** FAILED *** (24 
seconds, 49 milliseconds)
Cause: java.lang.NoSuchMethodException: 
org.apache.hudi.keygen.GlobalDeleteKeyGenerator.()
[scalatest]   at java.lang.Class.getConstructor0(Class.java:3110)
[scalatest]   at java.lang.Class.newInstance(Class.java:412)
[scalatest]   at 
org.apache.hudi.hive.HoodieHiveClient.(HoodieHiveClient.java:98)
[scalatest]   at org.apache.hudi.hive.HiveSyncTool.(HiveSyncTool.java:69)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$.org$apache$hudi$HoodieSparkSqlWriter$$syncHive(HoodieSparkSqlWriter.scala:391)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:440)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$metaSync$2.apply(HoodieSparkSqlWriter.scala:436)
[scalatest]   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$.metaSync(HoodieSparkSqlWriter.scala:436)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:497)
[scalatest]   at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:222)
[scalatest]   at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145)
[scalatest]   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[scalatest]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[scalatest]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[scalatest]   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[scalatest]   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
[scalatest]   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[scalatest]   at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
[scalatest]   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[scalatest]   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
[scalatest]   at org.apach
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-06-15 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363911#comment-17363911
 ] 

Nishith Agarwal commented on HUDI-1975:
---

[~vinaypatil18] I think there are 2 options : 
 # Shade the dropwizard inside Hudi to let Hudi use 4.1.x 
 # Downgrade to 3.1.x and make changes for the workaround 

To be able to answer this, can you dig into whether shading will help ? (does 
prometheus package bring it's own dropwizard or is the environment expected to 
provide it?). Secondly, can you dig up when the 4.x upgrade was done and what 
was the reason for it. 

 

We can take a call then

> Upgrade java-prometheus-client from 3.1.2 to 4.x
> 
>
> Key: HUDI-1975
> URL: https://issues.apache.org/jira/browse/HUDI-1975
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Find more details here -> https://github.com/apache/hudi/issues/2774



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2021-06-14 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-2003:
--
Summary: Auto Compute Compression ratio for input data to output 
parquet/orc file size  (was: Auto Compute Compression)

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinay
>Priority: Major
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2021-06-14 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-2003:
--
Issue Type: Improvement  (was: Bug)

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinay
>Priority: Major
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer

2021-06-14 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363067#comment-17363067
 ] 

Nishith Agarwal commented on HUDI-1910:
---

[~vinaypatil18] Yes, that makes sense, please go ahead.

> Supporting Kafka based checkpointing for HoodieDeltaStreamer
> 
>
> Key: HUDI-1910
> URL: https://issues.apache.org/jira/browse/HUDI-1910
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Nishith Agarwal
>Assignee: Vinay
>Priority: Major
>  Labels: sev:normal, triaged
>
> HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some 
> users have requested support for Kafka based checkpoints for freshness 
> auditing purposes. This ticket tracks any implementation for that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-2005) Audit and remove references of fs.listStatus()

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662
 ] 

Nishith Agarwal edited comment on HUDI-2005 at 6/13/21, 10:54 PM:
--

1. 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping

2. 
org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants

3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles

4. org.apache.hudi.table.MarkerFiles#deleteMarkerDir

 


was (Author: nishith29):
1. 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping

2. 
org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants

3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles

> Audit and remove references of fs.listStatus()
> --
>
> Key: HUDI-2005
> URL: https://issues.apache.org/jira/browse/HUDI-2005
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-2005) Audit and remove references of fs.listStatus()

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662
 ] 

Nishith Agarwal edited comment on HUDI-2005 at 6/13/21, 10:49 PM:
--

1. 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping

2. 
org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants

3. org.apache.hudi.utilities.sources.helpers.DFSPathSelector#listEligibleFiles


was (Author: nishith29):
1. 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping

2. 
org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants

> Audit and remove references of fs.listStatus()
> --
>
> Key: HUDI-2005
> URL: https://issues.apache.org/jira/browse/HUDI-2005
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-2005) Audit and remove references of fs.listStatus()

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-2005:
-

Assignee: Prashant Wason  (was: Nishith Agarwal)

> Audit and remove references of fs.listStatus()
> --
>
> Key: HUDI-2005
> URL: https://issues.apache.org/jira/browse/HUDI-2005
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-2005) Audit and remove references of fs.listStatus()

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362662#comment-17362662
 ] 

Nishith Agarwal commented on HUDI-2005:
---

1. 
org.apache.hudi.metadata.HoodieBackedTableMetadataWriter#getPartitionsToFilesMapping

2. 
org.apache.hudi.client.heartbeat.HoodieHeartbeatClient#getAllExistingHeartbeatInstants

> Audit and remove references of fs.listStatus()
> --
>
> Key: HUDI-2005
> URL: https://issues.apache.org/jira/browse/HUDI-2005
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2005) Audit and remove references of fs.listStatus()

2021-06-13 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-2005:
-

 Summary: Audit and remove references of fs.listStatus()
 Key: HUDI-2005
 URL: https://issues.apache.org/jira/browse/HUDI-2005
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking (only HDFS atomic renames)

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1457:
--
Summary: Add multi writing to Hudi tables using DFS based locking (only 
HDFS atomic renames)  (was: Add multi writing to Hudi tables using DFS based 
locking)

> Add multi writing to Hudi tables using DFS based locking (only HDFS atomic 
> renames)
> ---
>
> Key: HUDI-1457
> URL: https://issues.apache.org/jira/browse/HUDI-1457
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1457) Add multi writing to Hudi tables using DFS based locking

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1457:
--
Summary: Add multi writing to Hudi tables using DFS based locking  (was: 
Add parallel writing to Hudi tables using DFS based locking)

> Add multi writing to Hudi tables using DFS based locking
> 
>
> Key: HUDI-1457
> URL: https://issues.apache.org/jira/browse/HUDI-1457
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1679) Add example to docker for optimistic lock use

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal resolved HUDI-1679.
---
Fix Version/s: 0.8.0
   Resolution: Fixed

> Add example to docker for optimistic lock use
> -
>
> Key: HUDI-1679
> URL: https://issues.apache.org/jira/browse/HUDI-1679
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1623) Support start_commit_time & end_commit_times for serializable incremental pull

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1623:
--
Fix Version/s: 0.10.0

> Support start_commit_time & end_commit_times for serializable incremental pull
> --
>
> Key: HUDI-1623
> URL: https://issues.apache.org/jira/browse/HUDI-1623
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1575) Early detection by periodically checking last written commit

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1575:
--
Summary: Early detection by periodically checking last written commit  
(was: Early detection, last written commit, also check if there are more 
commits, try to do resolution, and abort. )

> Early detection by periodically checking last written commit
> 
>
> Key: HUDI-1575
> URL: https://issues.apache.org/jira/browse/HUDI-1575
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1575) Early detection by periodically checking last written commit

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1575:
--
Description: Check if there are more commits, try to do resolution, and 
abort for a currently running job to avoid using up resources and running a 
concurrent job if we already found a commit that happened in the meantime

> Early detection by periodically checking last written commit
> 
>
> Key: HUDI-1575
> URL: https://issues.apache.org/jira/browse/HUDI-1575
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> Check if there are more commits, try to do resolution, and abort for a 
> currently running job to avoid using up resources and running a concurrent 
> job if we already found a commit that happened in the meantime



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-944) Support more complete concurrency control when writing data

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362627#comment-17362627
 ] 

Nishith Agarwal commented on HUDI-944:
--

Concurrent writing to HUDI tables is now supported. Closing this issue now. 
[~309637554] Let me know if. there is something missing from your requirements 
so we can open a specific ticket under the concurrent writing umbrella ticket 
as follow up. 

> Support more complete  concurrency control when writing data
> 
>
> Key: HUDI-944
> URL: https://issues.apache.org/jira/browse/HUDI-944
> Project: Apache Hudi
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.9.0
>
>
> Now hudi just support write、compaction concurrency control. But some scenario 
> need write concurrency control.Such as two spark job with different data 
> source ,need to write to the same hudi table.
> I have two Proposal：
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these 
> error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists 
> ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight 
> (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be 
> specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on 
> different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in 
> AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-944) Support more complete concurrency control when writing data

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal resolved HUDI-944.
--
Fix Version/s: 0.8.0
   Resolution: Fixed

> Support more complete  concurrency control when writing data
> 
>
> Key: HUDI-944
> URL: https://issues.apache.org/jira/browse/HUDI-944
> Project: Apache Hudi
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.9.0, 0.8.0
>
>
> Now hudi just support write、compaction concurrency control. But some scenario 
> need write concurrency control.Such as two spark job with different data 
> source ,need to write to the same hudi table.
> I have two Proposal：
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these 
> error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists 
> ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight 
> (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be 
> specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on 
> different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in 
> AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1577) Document that multi-writer cannot be used within the same write client

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1577:
--
Fix Version/s: 0.9.0

> Document that multi-writer cannot be used within the same write client
> --
>
> Key: HUDI-1577
> URL: https://issues.apache.org/jira/browse/HUDI-1577
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1577) Document that multi-writer cannot be used within the same write client

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1577:
--
Priority: Blocker  (was: Major)

> Document that multi-writer cannot be used within the same write client
> --
>
> Key: HUDI-1577
> URL: https://issues.apache.org/jira/browse/HUDI-1577
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1706) Test flakiness w/ multiwriter test

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1706:
--
Parent: HUDI-1456
Issue Type: Sub-task  (was: Bug)

> Test flakiness w/ multiwriter test
> --
>
> Key: HUDI-1706
> URL: https://issues.apache.org/jira/browse/HUDI-1706
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: sivabalan narayanan
>Assignee: Nishith Agarwal
>Priority: Major
>
> [https://api.travis-ci.com/v3/job/492130170/log.txt]
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1456) [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1456:
--
Summary: [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables  (was: 
[UMBRELLA] Concurrent Writing to Hudi tables)

> [UMBRELLA] Concurrent Writing (multiwriter) to Hudi tables
> --
>
> Key: HUDI-1456
> URL: https://issues.apache.org/jira/browse/HUDI-1456
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.9.0
>
> Attachments: image-2020-12-14-09-48-46-946.png
>
>
> This ticket tracks all the changes needed to support concurrency control for 
> Hudi tables. This work will be done in multiple phases. 
>  # Parallel writing to Hudi tables support -> This feature will allow users 
> to have multiple writers mutate the tables without the ability to perform 
> concurrent update to the same file. 
>  # Concurrency control at file/record level -> This feature will allow users 
> to have multiple writers mutate the tables with the ability to ensure 
> serializability at record level.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1698) Multiwriting for Flink / Java

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1698:
--
Parent: HUDI-1456
Issue Type: Sub-task  (was: Improvement)

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1047) Support asynchronize clustering in CoW mode

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1047:
--
Summary: Support asynchronize clustering in CoW mode  (was: Support 
synchronize clustering in CoW mode)

> Support asynchronize clustering in CoW mode
> ---
>
> Key: HUDI-1047
> URL: https://issues.apache.org/jira/browse/HUDI-1047
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1048) Support Asynchronize clustering in MoR mode

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1048:
--
Summary: Support Asynchronize clustering in MoR mode  (was: Support 
synchronize clustering in MoR mode)

> Support Asynchronize clustering in MoR mode
> ---
>
> Key: HUDI-1048
> URL: https://issues.apache.org/jira/browse/HUDI-1048
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1077) Integration tests to validate clustering

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1077:
--
Fix Version/s: 0.9.0

> Integration tests to validate clustering
> 
>
> Key: HUDI-1077
> URL: https://issues.apache.org/jira/browse/HUDI-1077
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> extend test-suite module to validate clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1353) Incremental timeline support for pending clustering operations

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1353:
--
Priority: Blocker  (was: Major)

> Incremental timeline support for pending clustering operations
> --
>
> Key: HUDI-1353
> URL: https://issues.apache.org/jira/browse/HUDI-1353
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1468) incremental read support with clustering

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1468:
-

Assignee: liwei

> incremental read support with clustering
> 
>
> Key: HUDI-1468
> URL: https://issues.apache.org/jira/browse/HUDI-1468
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Affects Versions: 0.9.0
>Reporter: satish
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.9.0
>
>
> As part of clustering, metadata such as hoodie_commit_time changes for 
> records that are clustered. This is specific to 
> SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to 
> carry commit_time from original record to support incremental queries.
> Also, incremental queries dont work with 'replacecommit' used by clustering 
> HUDI-1264. Change incremental query to work for replacecommits created by 
> Clustering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1468) incremental read support with clustering

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1468:
--
Priority: Blocker  (was: Major)

> incremental read support with clustering
> 
>
> Key: HUDI-1468
> URL: https://issues.apache.org/jira/browse/HUDI-1468
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Incremental Pull
>Affects Versions: 0.9.0
>Reporter: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> As part of clustering, metadata such as hoodie_commit_time changes for 
> records that are clustered. This is specific to 
> SparkBulkInsertBasedRunClusteringStrategy implementation. Figure out a way to 
> carry commit_time from original record to support incremental queries.
> Also, incremental queries dont work with 'replacecommit' used by clustering 
> HUDI-1264. Change incremental query to work for replacecommits created by 
> Clustering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1077) Integration tests to validate clustering

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1077:
--
Priority: Blocker  (was: Major)

> Integration tests to validate clustering
> 
>
> Key: HUDI-1077
> URL: https://issues.apache.org/jira/browse/HUDI-1077
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Priority: Blocker
>
> extend test-suite module to validate clustering



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1482) async clustering for spark streaming

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1482:
--
Priority: Blocker  (was: Major)

> async clustering for spark streaming
> 
>
> Key: HUDI-1482
> URL: https://issues.apache.org/jira/browse/HUDI-1482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1483) async clustering for deltastreamer

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1483:
--
Fix Version/s: 0.9.0

> async clustering for deltastreamer
> --
>
> Key: HUDI-1483
> URL: https://issues.apache.org/jira/browse/HUDI-1483
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1482) async clustering for spark streaming

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1482:
--
Fix Version/s: 0.9.0

> async clustering for spark streaming
> 
>
> Key: HUDI-1482
> URL: https://issues.apache.org/jira/browse/HUDI-1482
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1500) support incremental read clustering commit in deltastreamer

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1500:
--
Priority: Blocker  (was: Major)

> support incremental read clustering  commit in deltastreamer
> 
>
> Key: HUDI-1500
> URL: https://issues.apache.org/jira/browse/HUDI-1500
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: liwei
>Assignee: satish
>Priority: Blocker
>
> now in DeltaSync.readFromSource() can  not read last instant as replace 
> commit, such as clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1500) support incremental read clustering commit in deltastreamer

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1500:
--
Fix Version/s: 0.9.0

> support incremental read clustering  commit in deltastreamer
> 
>
> Key: HUDI-1500
> URL: https://issues.apache.org/jira/browse/HUDI-1500
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: liwei
>Assignee: satish
>Priority: Blocker
> Fix For: 0.9.0
>
>
> now in DeltaSync.readFromSource() can  not read last instant as replace 
> commit, such as clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1483) async clustering for deltastreamer

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1483:
--
Priority: Blocker  (was: Major)

> async clustering for deltastreamer
> --
>
> Key: HUDI-1483
> URL: https://issues.apache.org/jira/browse/HUDI-1483
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1500) support incremental read clustering commit in deltastreamer

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1500:
-

Assignee: satish

> support incremental read clustering  commit in deltastreamer
> 
>
> Key: HUDI-1500
> URL: https://issues.apache.org/jira/browse/HUDI-1500
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: liwei
>Assignee: satish
>Priority: Major
>
> now in DeltaSync.readFromSource() can  not read last instant as replace 
> commit, such as clustering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1937:
-

Assignee: liwei

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Assignee: liwei
>Priority: Blocker
> Fix For: 0.9.0
>
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1937:
--
Parent: HUDI-1042
Issue Type: Sub-task  (was: Bug)

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Priority: Blocker
> Fix For: 0.9.0
>
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1937:
--
Priority: Blocker  (was: Critical)

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Priority: Blocker
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362624#comment-17362624
 ] 

Nishith Agarwal commented on HUDI-1937:
---

[~satish] [~309637554] Can one of you take a look at this ?

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Priority: Critical
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1937) When clustering fail, generating unfinished replacecommit timeline.

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1937:
--
Fix Version/s: 0.9.0

> When clustering fail, generating unfinished replacecommit timeline.
> ---
>
> Key: HUDI-1937
> URL: https://issues.apache.org/jira/browse/HUDI-1937
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.8.0
>Reporter: taylor liao
>Priority: Blocker
> Fix For: 0.9.0
>
>
> When clustering fail, generating unfinished replacecommit.
>  Restart job will generate delta commit. if the commit contain clustering 
> group file, the task will fail.
>  "Not allowed to update the clustering file group %s
>  For pending clustering operations, we are not going to support update for 
> now."
>  Need to ensure that the unfinished replacecommit file is deleted, or perform 
> clustering first, and then generate delta commit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362623#comment-17362623
 ] 

Nishith Agarwal commented on HUDI-1309:
---

[~vbalaji] Is this something you still see ? 

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Blocker
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> {code:java}
>  36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
>  36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
>  36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
>  44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
>  44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
>  44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1537) Move validation of file listings to something that happens before each write

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362622#comment-17362622
 ] 

Nishith Agarwal commented on HUDI-1537:
---

[~pwason] Is validation of file listing applicable ?

> Move validation of file listings to something that happens before each write
> 
>
> Key: HUDI-1537
> URL: https://issues.apache.org/jira/browse/HUDI-1537
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Current way of checking, has issues dealing with log files and inflight 
> files. Code has comments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1309:
--
Priority: Blocker  (was: Major)

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Blocker
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> {code:java}
>  36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
>  36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
>  36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
>  44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
>  44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
>  44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1649) Bugs with Metadata Table in 0.7 release

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362621#comment-17362621
 ] 

Nishith Agarwal commented on HUDI-1649:
---

[~pwason] Are you going to open a PR to address all of these issues together 
next week ?

> Bugs with Metadata Table in 0.7 release
> ---
>
> Key: HUDI-1649
> URL: https://issues.apache.org/jira/browse/HUDI-1649
> Project: Apache Hudi
>  Issue Type: Sub-task
>Affects Versions: 0.9.0
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> We have discovered the following issues while using the Metadata Table code 
> in production:
>  
> *Issue 1: Automatic rollbacks during commit get a timestamp which is out of 
> order*
> Suppose commit C1 failed. The next commit will try to rollback C1 
> automatically. This will create the following two instances C2.commit and 
> R3.rollback. Hence, the rollback will have a timestamp > the commit which 
> occurs after it. 
> This is because of how the code is implemented in 
> AbstractHoodieWriteClient.startCommitWithTime() where the timestamp of the 
> next commit is chosen before the timestamp of the rollback instant.
>  
> *Issue 2: Syncing of rollbacks is not working*
> Due to the above HUDI issue, syncing of rollbacks in Metadata Table does not 
> work correctly. 
> Assume the timeline as follows: 
> Dataset Timeline: C1  C2. C3
> Metadata Timeline: DC1 DC2.  (dc=delta-commit)
>  
> Suppose the next commit C4 fails. When C5 is attempted, C4 will be 
> automatically tolled back. Due to the issue #1, the timelines will become as 
> follows:
> Dataset Timeline: C1  C2. C3.  C5  R6 
> Metadata Timeline: DC1 DC2 
> Now if the Metadata Table is synced (AbstractHoodieWriteClient.postCommit), 
> the code will end up processing C5 first and then R6 which will mean that the 
> file rolled back in R6 will be committed to the metadata table as deleted 
> files. There is logic within 
> HoodieTableMetadataUtils.processRollbackMetadata() to ignore R6 in this 
> scenario but it does not work because of the issue #1.
>   
> *Issue #3: Rollback instants are deleted inline*
> Current rollback code deleted older instants inline. The delete logic keeps 
> oldest ten instants (hardcoded) and removes all more-recent rollback 
> instants. Furthermore, the deletion ONLY deletes the rollback.complete and 
> does not remove the corresponding rollback.inflight files. 
> Hence, will many rollbacks the following timeline is possible
> Timeline: C1. C2 C3 C4. R5.inflight C5 C6 C7 ...
> (there are 9 previous rollback instants to R5).
>  
> *Issue #4: Metadata Table reader does not show correct view of the metadata*
> Assume the timeline is as in Issue #3 with a leftover rollback.inflight 
> instant. Also assume that the metadata table is synced only till C4. The 
> MetadataTableWriter will not sync any more instants to the Metadata Table 
> since an incomplete instant is present next.
> The same sync logic is also used by the MetadataReader to perform the 
> in-memory merge of timeline. Hence, the reader will also not consider C5, C6 
> and C7 thereby providing an incorrect and older view of the FileSlices and 
> FileGroups. 
>  
> Any future ingestion into this table MAY insert data into older versions of 
> the FileSlices which will end up being a data loss when queried.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1537) Move validation of file listings to something that happens before each write

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1537:
--
Priority: Blocker  (was: Major)

> Move validation of file listings to something that happens before each write
> 
>
> Key: HUDI-1537
> URL: https://issues.apache.org/jira/browse/HUDI-1537
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Current way of checking, has issues dealing with log files and inflight 
> files. Code has comments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1542) Fix Flaky test : TestHoodieMetadata#testSync

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1542:
--
Priority: Blocker  (was: Major)

> Fix Flaky test : TestHoodieMetadata#testSync
> 
>
> Key: HUDI-1542
> URL: https://issues.apache.org/jira/browse/HUDI-1542
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Only fails intermittently on CI.
> {code}
> [INFO] Running org.apache.hudi.metadata.TestHoodieBackedMetadata
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/travis/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.6.2/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> [WARN ] 2021-01-20 09:25:31,716 org.apache.spark.util.Utils  - Your hostname, 
> localhost resolves to a loopback address: 127.0.0.1; using 10.30.0.81 instead 
> (on interface eth0)
> [WARN ] 2021-01-20 09:25:31,725 org.apache.spark.util.Utils  - Set 
> SPARK_LOCAL_IP if you need to bind to another address
> [WARN ] 2021-01-20 09:25:32,412 org.apache.hadoop.util.NativeCodeLoader  - 
> Unable to load native-hadoop library for your platform... using builtin-java 
> classes where applicable
> [WARN ] 2021-01-20 09:25:36,645 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:25:36,700 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit6813339032540265368/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:30,250 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092628 in the timeline, for rollback
> [WARN ] 2021-01-20 09:26:45,980 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:26:46,568 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:26:46,580 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit5544286531112563801/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:27:27,853 
> org.apache.hudi.client.AbstractHoodieWriteClient  - Cannot find instant 
> 20210120092726 in the timeline, for rollback
> [WARN ] 2021-01-20 09:27:43,037 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:27:46,017 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit3284615140376500245/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:05,357 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:05,887 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:06,312 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092805__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:18,402 
> org.apache.hudi.testutils.HoodieClientTestHarness  - Closing file-system 
> instance used in previous test-run
> [WARN ] 2021-01-20 09:28:22,013 
> org.apache.hudi.metadata.HoodieBackedTableMetadata  - Metadata table was not 
> found at path /tmp/junit4284626513859445824/dataset/.hoodie/metadata
> [WARN ] 2021-01-20 09:28:40,354 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__REQUESTED]
> [WARN ] 2021-01-20 09:28:40,780 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> [WARN ] 2021-01-20 09:28:41,162 org.apache.hudi.common.util.ClusteringUtils  
> - No content found in requested file for instant 
> [==>20210120092840__replacecommit__INFLIGHT]
> =[ 605 seconds still running ]=
> [ERROR] 2021-01-20 09:28:50,683 
> org.apache.hudi.timeline.service.FileSystemViewHandler  - Got runtime 
> exception servicing request 
>

[jira] [Resolved] (HUDI-1962) Add a blog/docs for shuffle paralelism

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal resolved HUDI-1962.
---
Resolution: Fixed

> Add a blog/docs for shuffle paralelism
> --
>
> Key: HUDI-1962
> URL: https://issues.apache.org/jira/browse/HUDI-1962
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1962) Add a blog/docs for shuffle paralelism

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362619#comment-17362619
 ] 

Nishith Agarwal commented on HUDI-1962:
---

Added a FAQ -> 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowtotuneshuffleparallelismofHudijobs?

> Add a blog/docs for shuffle paralelism
> --
>
> Key: HUDI-1962
> URL: https://issues.apache.org/jira/browse/HUDI-1962
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1959) Add links to small file handling and clustering to the config section

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362615#comment-17362615
 ] 

Nishith Agarwal commented on HUDI-1959:
---

Added a FAQ here -> 
https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoItoavoidcreatingtonsofsmallfiles

> Add links to small file handling and clustering to the config section
> -
>
> Key: HUDI-1959
> URL: https://issues.apache.org/jira/browse/HUDI-1959
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> Users are confused on how to ensure small files are not created during 
> ingestion
> Small file handling isn't very clear to users and they complain that 
> ingestion has slowed down
> Clustering usage isn't clear and how to use that with deltastreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1959) Add links to small file handling and clustering to the config section

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal resolved HUDI-1959.
---
Fix Version/s: 0.9.0
   Resolution: Fixed

> Add links to small file handling and clustering to the config section
> -
>
> Key: HUDI-1959
> URL: https://issues.apache.org/jira/browse/HUDI-1959
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.9.0
>
>
> Users are confused on how to ensure small files are not created during 
> ingestion
> Small file handling isn't very clear to users and they complain that 
> ingestion has slowed down
> Clustering usage isn't clear and how to use that with deltastreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1960) Add documentation to be able to disable parquet configs

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal resolved HUDI-1960.
---
Fix Version/s: 0.9.0
   Resolution: Fixed

> Add documentation to be able to disable parquet configs
> ---
>
> Key: HUDI-1960
> URL: https://issues.apache.org/jira/browse/HUDI-1960
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.9.0
>
>
> https://github.com/apache/hudi/issues/2265



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1960) Add documentation to be able to disable parquet configs

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362607#comment-17362607
 ] 

Nishith Agarwal commented on HUDI-1960:
---

Added a FAQ here -> https://cwiki.apache.org/confluence/display/HUDI/FAQ

> Add documentation to be able to disable parquet configs
> ---
>
> Key: HUDI-1960
> URL: https://issues.apache.org/jira/browse/HUDI-1960
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> https://github.com/apache/hudi/issues/2265



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362604#comment-17362604
 ] 

Nishith Agarwal commented on HUDI-1975:
---

[~vinaypatil18] It looks like even the latest prometheus 0.11.x depends on 
3.1.x, see here -> 
[https://github.com/prometheus/client_java/blob/master/simpleclient_dropwizard/pom.xml#L43]

 

To fix the issue, we may have to try downgrading the dropwizard to 3.x. Can you 
check what are side effects of doing this ? We can discuss follow up steps here

> Upgrade java-prometheus-client from 3.1.2 to 4.x
> 
>
> Key: HUDI-1975
> URL: https://issues.apache.org/jira/browse/HUDI-1975
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Find more details here -> https://github.com/apache/hudi/issues/2774



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1975:
--
Description: Find more details here -> 
https://github.com/apache/hudi/issues/2774

> Upgrade java-prometheus-client from 3.1.2 to 4.x
> 
>
> Key: HUDI-1975
> URL: https://issues.apache.org/jira/browse/HUDI-1975
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Find more details here -> https://github.com/apache/hudi/issues/2774



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Deleted] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset

2021-06-13 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal deleted HUDI-1945:
--


> Support Hudi to read from Kafka Consumer Group Offset
> -
>
> Key: HUDI-1945
> URL: https://issues.apache.org/jira/browse/HUDI-1945
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Vinay
>Assignee: Vinay
>Priority: Major
>
> Currently, Hudi provides options to read from latest or earliest. We should 
> even provide users an option to read from group offset as well.
> This change will be in `KafkaOffsetGen` where we can add a method to support 
> this functionality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer

2021-06-13 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17362600#comment-17362600
 ] 

Nishith Agarwal commented on HUDI-1910:
---

[~vinaypatil18] Thanks for sharing your approach. The first level of configs in 
deltastreamer are meant to be for generic use-cases that apply to general 
purpose ingestion activities. Whenever we want to add a specific use-case, we 
add configurable classes and then add a parameter something like this -> 
[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L158.]
 

 

This has 2 advantages 
 # If you want to dynamically enable or disable such configs without code 
changes or deployment, it can be done if you are keeping your properties file 
updated with configs. 
 # Keep users away from many custom configs (which most users might not care) 
by not floating them as top level configs in deltasteramer (they way you 
suggested). 

I think we should consider the first approach.

> Supporting Kafka based checkpointing for HoodieDeltaStreamer
> 
>
> Key: HUDI-1910
> URL: https://issues.apache.org/jira/browse/HUDI-1910
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Nishith Agarwal
>Assignee: Vinay
>Priority: Major
>  Labels: sev:normal, triaged
>
> HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some 
> users have requested support for Kafka based checkpoints for freshness 
> auditing purposes. This ticket tracks any implementation for that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1909) Skip the commits with empty files for flink streaming reader

2021-06-11 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1909:
--
Description: 
Log warnings instead of throwing to make the reader more robust.

 

https://github.com/apache/hudi/issues/2950

  was:Log warnings instead of throwing to make the reader more robust.


> Skip the commits with empty files for flink streaming reader
> 
>
> Key: HUDI-1909
> URL: https://issues.apache.org/jira/browse/HUDI-1909
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Vinay
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Log warnings instead of throwing to make the reader more robust.
>  
> https://github.com/apache/hudi/issues/2950



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1998) Provide a way to find list of commits through a pythonic API

2021-06-11 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1998:
-

 Summary: Provide a way to find list of commits through a pythonic 
API 
 Key: HUDI-1998
 URL: https://issues.apache.org/jira/browse/HUDI-1998
 Project: Apache Hudi
  Issue Type: New Feature
  Components: Writer Core
Reporter: Nishith Agarwal


TimelineUtils is a java API using which one can get the latest commit or 
instantiate HoodieActiveTImeline. Users are looking to perform the same through 
some python API

 

https://github.com/apache/hudi/issues/2987



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1997) Fix hoodie.datasource.hive_sync.auto_create_database documentation

2021-06-11 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1997:
-

 Summary: Fix hoodie.datasource.hive_sync.auto_create_database 
documentation 
 Key: HUDI-1997
 URL: https://issues.apache.org/jira/browse/HUDI-1997
 Project: Apache Hudi
  Issue Type: Bug
  Components: Docs
Reporter: Nishith Agarwal
 Fix For: 0.9.0


hoodie.datasource.hive_sync.auto_create_database is supposed to be defaulting 
to true according to docs but actually defaults to false for 0.7 & 0.8



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1138) Re-implement marker files via timeline server

2021-06-08 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359760#comment-17359760
 ] 

Nishith Agarwal commented on HUDI-1138:
---

Okay, thanks for sharing this info. 

> Re-implement marker files via timeline server
> -
>
> Key: HUDI-1138
> URL: https://issues.apache.org/jira/browse/HUDI-1138
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Even as you can argue that RFC-15/consolidated metadata, removes the need for 
> deleting partial files written due to spark task failures/stage retries. It 
> will still leave extra files inside the table (and users will pay for it 
> every month) and we need the marker mechanism to be able to delete these 
> partial files. 
> Here we explore if we can improve the current marker file mechanism, that 
> creates one marker file per data file written, by 
> Delegating the createMarker() call to the driver/timeline server, and have it 
> create marker metadata into a single file handle, that is flushed for 
> durability guarantees
>  
> P.S: I was tempted to think Spark listener mechanism can help us deal with 
> failed tasks, but it has no guarantees. the writer job could die without 
> deleting a partial file. i.e it can improve things, but cant provide 
> guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1827) Add ORC support in Bootstrap Op

2021-06-06 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358358#comment-17358358
 ] 

Nishith Agarwal commented on HUDI-1827:
---

[~manasaks] You approach sounds good to me. For marking the baseFileFormat, you 
can pass the config like this 

 

.option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "")

 

You can add a new config here -> 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala]

 

Once you've added it there, it will be passed down to the above datasource 
code. 

> Add ORC support in Bootstrap Op
> ---
>
> Key: HUDI-1827
> URL: https://issues.apache.org/jira/browse/HUDI-1827
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: Teresa Kang
>Assignee: manasa
>Priority: Major
>
> SparkBootstrapCommitActionExecutor assumes parquet format right now, need to 
> support ORC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-1827) Add ORC support in Bootstrap Op

2021-06-06 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358358#comment-17358358
 ] 

Nishith Agarwal edited comment on HUDI-1827 at 6/7/21, 5:03 AM:


[~manasaks] You approach sounds good to me. For marking the baseFileFormat, you 
can pass the config like this 

.option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "") 

You can add a new config here -> 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala]

Once you've added it there, it will be passed down to the above datasource 
code. 


was (Author: nishith29):
[~manasaks] You approach sounds good to me. For marking the baseFileFormat, you 
can pass the config like this 

 

.option(HoodieBootstrapConfig.HOODIE_BASE_FILE_FORMAT_PROP_NAME, "")

 

You can add a new config here -> 
[https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala]

 

Once you've added it there, it will be passed down to the above datasource 
code. 

> Add ORC support in Bootstrap Op
> ---
>
> Key: HUDI-1827
> URL: https://issues.apache.org/jira/browse/HUDI-1827
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: Teresa Kang
>Assignee: manasa
>Priority: Major
>
> SparkBootstrapCommitActionExecutor assumes parquet format right now, need to 
> support ORC as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1977) Fix Hudi-CLI show table spark-sql

2021-06-05 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1977:
-

 Summary: Fix Hudi-CLI show table spark-sql 
 Key: HUDI-1977
 URL: https://issues.apache.org/jira/browse/HUDI-1977
 Project: Apache Hudi
  Issue Type: Task
  Components: CLI
Reporter: Nishith Agarwal


https://github.com/apache/hudi/issues/2955



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1976) Upgrade hive, jackson, log4j, hadoop to remove vulnerability

2021-06-05 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1976:
-

 Summary: Upgrade hive, jackson, log4j, hadoop to remove 
vulnerability
 Key: HUDI-1976
 URL: https://issues.apache.org/jira/browse/HUDI-1976
 Project: Apache Hudi
  Issue Type: Task
  Components: Hive Integration
Reporter: Nishith Agarwal


[https://github.com/apache/hudi/issues/2827]

[https://github.com/apache/hudi/issues/2826]

[https://github.com/apache/hudi/issues/2824|https://github.com/apache/hudi/issues/2826]

[https://github.com/apache/hudi/issues/2823|https://github.com/apache/hudi/issues/2826]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1976) Upgrade hive, jackson, log4j, hadoop to remove vulnerability

2021-06-05 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1976:
--
Fix Version/s: 0.9.0
 Priority: Blocker  (was: Major)

> Upgrade hive, jackson, log4j, hadoop to remove vulnerability
> 
>
> Key: HUDI-1976
> URL: https://issues.apache.org/jira/browse/HUDI-1976
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>
> [https://github.com/apache/hudi/issues/2827]
> [https://github.com/apache/hudi/issues/2826]
> [https://github.com/apache/hudi/issues/2824|https://github.com/apache/hudi/issues/2826]
> [https://github.com/apache/hudi/issues/2823|https://github.com/apache/hudi/issues/2826]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1592) Metadata listing fails for non partitoned dataset

2021-06-05 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1592:
--
Fix Version/s: 0.9.0
 Priority: Blocker  (was: Major)

> Metadata listing fails for non partitoned dataset
> -
>
> Key: HUDI-1592
> URL: https://issues.apache.org/jira/browse/HUDI-1592
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Storage Management
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: sev:high, user-support-issues
> Fix For: 0.9.0
>
>
> https://github.com/apache/hudi/issues/2507



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-06-05 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1975:
--
Fix Version/s: 0.9.0
 Priority: Blocker  (was: Major)

> Upgrade java-prometheus-client from 3.1.2 to 4.x
> 
>
> Key: HUDI-1975
> URL: https://issues.apache.org/jira/browse/HUDI-1975
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1975) Upgrade java-prometheus-client from 3.1.2 to 4.x

2021-06-05 Thread Nishith Agarwal (Jira)

Nishith Agarwal created HUDI-1975:
-

 Summary: Upgrade java-prometheus-client from 3.1.2 to 4.x
 Key: HUDI-1975
 URL: https://issues.apache.org/jira/browse/HUDI-1975
 Project: Apache Hudi
  Issue Type: Task
Reporter: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1974) Run pyspark and validate that it works correctly with all hudi versions

2021-06-05 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1974:
--
Fix Version/s: 0.9.0

> Run pyspark and validate that it works correctly with all hudi versions
> ---
>
> Key: HUDI-1974
> URL: https://issues.apache.org/jira/browse/HUDI-1974
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1974) Run pyspark and validate that it works correctly with all hudi versions

2021-06-05 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1974:
--
Priority: Blocker  (was: Major)

> Run pyspark and validate that it works correctly with all hudi versions
> ---
>
> Key: HUDI-1974
> URL: https://issues.apache.org/jira/browse/HUDI-1974
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Nishith Agarwal
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 3 4 5 >

1 - 100 of 425 matches

Mail list logo