[jira] [Assigned] (HUDI-6762) Remove usages of MetadataRecordsGenerationParams

2024-04-03 Thread Vova Kolmakov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov reassigned HUDI-6762:
---

Assignee: Vova Kolmakov

> Remove usages of MetadataRecordsGenerationParams
> 
>
> Key: HUDI-6762
> URL: https://issues.apache.org/jira/browse/HUDI-6762
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vova Kolmakov
>Priority: Minor
> Fix For: 1.0.0
>
>
> MetadataRecordsGenerationParams is deprecated. We already rely on table 
> config for enabled mdt partition types. See if we can remove this POJO.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6762) Remove usages of MetadataRecordsGenerationParams

2024-04-03 Thread Vova Kolmakov (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vova Kolmakov updated HUDI-6762:

Status: In Progress  (was: Open)

> Remove usages of MetadataRecordsGenerationParams
> 
>
> Key: HUDI-6762
> URL: https://issues.apache.org/jira/browse/HUDI-6762
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Vova Kolmakov
>Priority: Minor
> Fix For: 1.0.0
>
>
> MetadataRecordsGenerationParams is deprecated. We already rely on table 
> config for enabled mdt partition types. See if we can remove this POJO.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10947:
URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036200025

   
   ## CI report:
   
   * 06a18d985e2b13159bcca2c1639c1376e871e3f8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [Feature Inquiry] index for randomized upserts [hudi]

2024-04-03 Thread via GitHub


pravin1406 opened a new issue, #10961:
URL: https://github.com/apache/hudi/issues/10961

   Hi Team,
   
   We are migrating our applications to 0.14.1 with spark as ingestion engine. 
Just wanted to know if the newly available indexes, BUCKET and RECORD_INDEX how 
do they perform w.r.t to randomized upserts/deletes ? Is SIMPLE indexing still 
the advised index to be used ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10947:
URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036146639

   
   ## CI report:
   
   * 85cbde75f0f652274dc28f940cd0a159096b6aad Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23065)
 
   * 06a18d985e2b13159bcca2c1639c1376e871e3f8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23096)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7564] Fix HiveSyncConfig inconsistency [hudi]

2024-04-03 Thread via GitHub


TengHuo commented on PR #10951:
URL: https://github.com/apache/hudi/pull/10951#issuecomment-2036117075

   > We did some tests today and found out why 
`hoodie.datasource.hive_sync.support_timestamp=true`.
   > 
   > When performing an alter schema, and hive sync is performed via spark's 
external catalogue in: 
`org.apache.spark.sql.hudi.command.AlterTableCommand#commitWithSchema`, Spark 
syncs TIMESTAMP types as TIMESTAMP.
   > 
   > ```scala
   > sparkSession.sessionState.catalog
   >   .externalCatalog
   >   .alterTableDataSchema(db, tableName, dataSparkSchema)
   > ```
   > 
   > If this is defaulted to `false`, after altering the schema (via spark-sql) 
of a table containing a `TIMESTAMP` column, the type on HMS will change from 
`LONG` back to `TIMESTAMP` (via spark's external catalogue API).
   > 
   > This will cause subsequent hive-syncs to fail when they try to sync 
`TIMESTAMP` as `LONG`, which is not ideal.
   > 
   > I think it's best that we ensure consistency between Spark, i will submit 
another PR to change the default back to `true`, I will then add a 
documentation there to explain why.
   > 
   > As for the trino/presto error, they will just have to fix it on their end.
   > 
   > # Conclusion
   > The reason for this discrepancy is due to Spark's external catalogue API, 
which syncs `TIMESTAMP` types as `TIMESTAMP` to hive.
   > 
   > Given that Hudi has multiple entrypoints, it make sense that Spark 
introduced this inconsistency.
   > 
   > While I am not sure why hive-sync-tool defaulted the `support_timestamp` 
as `false`, I think it's best we just document this.
   
   In this case, cross engine scenario may be impacted when Hudi Flink user use 
`TIMESTAMP` type, Hive sync in Flink pipeline will sync it as `LONG` by default.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10947:
URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036102345

   
   ## CI report:
   
   * 85cbde75f0f652274dc28f940cd0a159096b6aad Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23065)
 
   * 06a18d985e2b13159bcca2c1639c1376e871e3f8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


wombatu-kun commented on code in PR #10949:
URL: https://github.com/apache/hudi/pull/10949#discussion_r1550791467


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
 option("hoodie.schema.on.read.enable","true").
 option("hoodie.datasource.write.reconcile.schema","true").
 option(DataSourceWriteOptions.TABLE_NAME.key(), tableName).
+option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), 
classOf[OverwriteWithLatestAvroPayload].getName).

Review Comment:
   This test was written for `OverwriteWithLatestAvroPayload` and it checks 
OverwriteWithLatest behavior, i think it's better to let it be. Or may be add 
the same test and make it pass by using new default payloads. But... 
   All other tests use default payload and pass. Why it is not enough?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


danny0405 commented on code in PR #10949:
URL: https://github.com/apache/hudi/pull/10949#discussion_r1550775053


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
 option("hoodie.schema.on.read.enable","true").
 option("hoodie.datasource.write.reconcile.schema","true").
 option(DataSourceWriteOptions.TABLE_NAME.key(), tableName).
+option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), 
classOf[OverwriteWithLatestAvroPayload].getName).

Review Comment:
   Kind of think we should modify the test to make it pass by using new default 
payloads.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-7573) Metadata Table Improvements

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen reassigned HUDI-7573:


Assignee: Danny Chen

> Metadata Table Improvements
> ---
>
> Key: HUDI-7573
> URL: https://issues.apache.org/jira/browse/HUDI-7573
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: metadata
>Reporter: Vinoth Chandar
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7510) Loosen the compaction scheduling and rollback check for MDT

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7510:
-
Epic Link: HUDI-7573  (was: HUDI-6640)

> Loosen the compaction scheduling and rollback check for MDT
> ---
>
> Key: HUDI-7510
> URL: https://issues.apache.org/jira/browse/HUDI-7510
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core, metadata, table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7552:
-
Epic Link: HUDI-7573  (was: HUDI-6640)

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7552:
-
Remaining Estimate: 2m
 Original Estimate: 2m

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7552:
-
Remaining Estimate: 2h  (was: 2m)
 Original Estimate: 2h  (was: 2m)

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7552:
-
Story Points: 4

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7572:
-
Epic Link: HUDI-7573  (was: HUDI-6640)

> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> After change to [loosen the compaction for 
> MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the 
> same compaction instant time got used to schedule for multiple times, we 
> better optimize the compactor to avoid empty compaction plan generation.
> Note: although we have a active timeline check to avoid the repetative 
> scheduling, there is still little chance the compaction already got archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7573) Metadata Table Improvements

2024-04-03 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-7573:


 Summary: Metadata Table Improvements
 Key: HUDI-7573
 URL: https://issues.apache.org/jira/browse/HUDI-7573
 Project: Apache Hudi
  Issue Type: Epic
  Components: metadata
Reporter: Vinoth Chandar
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HUDI-1698) Multiwriting for Flink / Java

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1698.
--

> Multiwriting for Flink / Java
> -
>
> Key: HUDI-1698
> URL: https://issues.apache.org/jira/browse/HUDI-1698
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: flink, writer-core
>Reporter: Nishith Agarwal
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7552:
-
Reviewers: Ethan Guo

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution

2024-04-03 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7045:
--
Story Points: 12  (was: 6)

> Fix new file format and reader for schema evolution
> ---
>
> Key: HUDI-7045
> URL: https://issues.apache.org/jira/browse/HUDI-7045
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> When this is implemented, parquet readers should not be created in 
> HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can 
> uncomment/add the code from this commit: 
> [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7572:
-
Epic Link: HUDI-6640

> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> After change to [loosen the compaction for 
> MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the 
> same compaction instant time got used to schedule for multiple times, we 
> better optimize the compactor to avoid empty compaction plan generation.
> Note: although we have a active timeline check to avoid the repetative 
> scheduling, there is still little chance the compaction already got archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-03 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833792#comment-17833792
 ] 

Vinoth Chandar commented on HUDI-6787:
--

If we change the Class inheritance of the Hudi input formats, for e.g 
subclassing from MapredParquetInputFormat  - Hive may generate unoptimized 
plans or execution plans. 

Do we change this in this PR?

> Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> --
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7029) Enhance CREATE INDEX syntax for functional index

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7029:
--
Priority: Minor  (was: Major)

> Enhance CREATE INDEX syntax for functional index
> 
>
> Key: HUDI-7029
> URL: https://issues.apache.org/jira/browse/HUDI-7029
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Minor
> Fix For: 1.0.0
>
>
> Currently, user can create index using sql as follows: 
> `create index idx_datestr on $tableName using column_stats(ts) 
> options(func='from_unixtime', format='-MM-dd')`
> Ideally, we would to simplify this further as follows:
> `create index idx_datestr on $tableName using column_stats(from_unixtime(ts, 
> format='-MM-dd'))`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7548) Close any gaps on Indexing (bloom index, col stats, agg_stats, record index with support for non-unique keys,

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7548:
--
Story Points: 22

> Close any gaps on  Indexing (bloom index, col stats, agg_stats, record index 
> with support for non-unique keys,
> --
>
> Key: HUDI-7548
> URL: https://issues.apache.org/jira/browse/HUDI-7548
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Vinoth Chandar
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> * reads through SQL
>  * writers through all supported means
>  * async index create/drop w/ multiple writers 
>  * Index updates are handled correctly
>  * flexible compaction 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7025) Merge Index and Functional Index Config

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7025:
--
Priority: Minor  (was: Major)

> Merge Index and Functional Index Config
> ---
>
> Key: HUDI-7025
> URL: https://issues.apache.org/jira/browse/HUDI-7025
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Minor
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> There is {{INDEX}} sub-group name in `ConfigGroups`. Functional index configs 
> can be consolidated within that.
>  
> https://github.com/apache/hudi/pull/9872#discussion_r1377115549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7144) Support query for tables written as partitionBy but synced as non-partitioned

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7144:
--
Story Points: 2

> Support query for tables written as partitionBy but synced as non-partitioned
> -
>
> Key: HUDI-7144
> URL: https://issues.apache.org/jira/browse/HUDI-7144
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> In HUDI-7023, we added support to sync any table as non-partitioned table and 
> yet be able to query via Spark with the same performance benefits of 
> partitioned table.
> This ticket extends the functionality end-to-end. If a user executes  
> `spark.write.format("hudi").options(options).partitionBy(partCol).save(basePath)`,
>  then do logical partitioning and sync as non-partitioned table to the 
> catalog. Yet be able to query efficiently,.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6787:
-
Summary: Hive Integrate FileGroupReader with 
HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive  
(was: Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
RealtimeCompactedRecordReader for Hive)

> Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> --
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution

2024-04-03 Thread Jonathan Vexler (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Vexler updated HUDI-7045:
--
Story Points: 6

> Fix new file format and reader for schema evolution
> ---
>
> Key: HUDI-7045
> URL: https://issues.apache.org/jira/browse/HUDI-7045
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> When this is implemented, parquet readers should not be created in 
> HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can 
> uncomment/add the code from this commit: 
> [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7572:
-
Description: 
After change to [loosen the compaction for 
MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the 
same compaction instant time got used to schedule for multiple times, we better 
optimize the compactor to avoid empty compaction plan generation.

Note: although we have a active timeline check to avoid the repetative 
scheduling, there is still little chance the compaction already got archived.

  was:
After change to loosen the compaction for MDT, there is rare case the same 
compaction instant time got used to schedule for multiple times, we better 
optimize the compactor to avoid empty compaction plan generation.

Note: although we have a active timeline check to avoid the repetative 
scheduling, there is still little chance the compaction already got archived.


> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> After change to [loosen the compaction for 
> MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the 
> same compaction instant time got used to schedule for multiple times, we 
> better optimize the compactor to avoid empty compaction plan generation.
> Note: although we have a active timeline check to avoid the repetative 
> scheduling, there is still little chance the compaction already got archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7146) Implement secondary index

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7146:
--
Status: In Progress  (was: Open)

> Implement secondary index
> -
>
> Key: HUDI-7146
> URL: https://issues.apache.org/jira/browse/HUDI-7146
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> # Secondary index schema should be flexible enough to accommodate various 
> kinds of secondary index. 
>  # Reuse as much as possible the existing framework for indexing.
>  # Merge with existing index config and introduce as less configs as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7145) Support for grouping values for same key in HFile

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7145.
-
Resolution: Done

> Support for grouping values for same key in HFile
> -
>
> Key: HUDI-7145
> URL: https://issues.apache.org/jira/browse/HUDI-7145
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Hudi writes metadata table (MT) base files in HFile format. HFile stores 
> sorted key-value pairs. For the existing MT partitions, the key is guaranteed 
> to be unique. However, for secondary index, it is very likely that the same 
> value of secondary index field is in multiple files.
> This ticket is to microbenchmark two approaches of storing secondary index:
>  # Group all values for a key and then store key-value pairs where each value 
> in this pair is a collection. For example, say column c1 is the secondary 
> index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this 
> approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) 
> v2: [f2].
>  # Since each key-value pair is unique as a whole, so store each key-value 
> pair separately (still lexicographically sorted). So, in this approach, we 
> have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
> The benchmark should capture storage overhead and lookup latency of one 
> approach over the other.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7570) Update RFC with details on API changes

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7570:
--
Status: In Progress  (was: Open)

> Update RFC with details on API changes
> --
>
> Key: HUDI-7570
> URL: https://issues.apache.org/jira/browse/HUDI-7570
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Given that secondary index can have duplicate keys, the existing 
> `HoodieMergedLogRecordScanner` is insufficient to handle duplicates because 
> it depends on `ExternalSpillableMap` which can only hold unique keys. RFC 
> should clarify how the merged log record scanner will change. We should not 
> be leaking any details to merge handle.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7045:
-
Reviewers: Ethan Guo

> Fix new file format and reader for schema evolution
> ---
>
> Key: HUDI-7045
> URL: https://issues.apache.org/jira/browse/HUDI-7045
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jonathan Vexler
>Assignee: Jonathan Vexler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> When this is implemented, parquet readers should not be created in 
> HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can 
> uncomment/add the code from this commit: 
> [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7146) Implement secondary index

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7146:
--
Story Points: 8

> Implement secondary index
> -
>
> Key: HUDI-7146
> URL: https://issues.apache.org/jira/browse/HUDI-7146
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> # Secondary index schema should be flexible enough to accommodate various 
> kinds of secondary index. 
>  # Reuse as much as possible the existing framework for indexing.
>  # Merge with existing index config and introduce as less configs as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7145) Support for grouping values for same key in HFile

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7145:
--
Story Points: 6

> Support for grouping values for same key in HFile
> -
>
> Key: HUDI-7145
> URL: https://issues.apache.org/jira/browse/HUDI-7145
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Hudi writes metadata table (MT) base files in HFile format. HFile stores 
> sorted key-value pairs. For the existing MT partitions, the key is guaranteed 
> to be unique. However, for secondary index, it is very likely that the same 
> value of secondary index field is in multiple files.
> This ticket is to microbenchmark two approaches of storing secondary index:
>  # Group all values for a key and then store key-value pairs where each value 
> in this pair is a collection. For example, say column c1 is the secondary 
> index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this 
> approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) 
> v2: [f2].
>  # Since each key-value pair is unique as a whole, so store each key-value 
> pair separately (still lexicographically sorted). So, in this approach, we 
> have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
> The benchmark should capture storage overhead and lookup latency of one 
> approach over the other.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7572:
-
Description: 
After change to loosen the compaction for MDT, there is rare case the same 
compaction instant time got used to schedule for multiple times, we better 
optimize the compactor to avoid empty compaction plan generation.

Note: although we have a active timeline check to avoid the repetative 
scheduling, there is still little chance the compaction already got archived.

> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>
> After change to loosen the compaction for MDT, there is rare case the same 
> compaction instant time got used to schedule for multiple times, we better 
> optimize the compactor to avoid empty compaction plan generation.
> Note: although we have a active timeline check to avoid the repetative 
> scheduling, there is still little chance the compaction already got archived.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7145) Support for grouping values for same key in HFile

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7145:
--
Status: Patch Available  (was: In Progress)

> Support for grouping values for same key in HFile
> -
>
> Key: HUDI-7145
> URL: https://issues.apache.org/jira/browse/HUDI-7145
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Hudi writes metadata table (MT) base files in HFile format. HFile stores 
> sorted key-value pairs. For the existing MT partitions, the key is guaranteed 
> to be unique. However, for secondary index, it is very likely that the same 
> value of secondary index field is in multiple files.
> This ticket is to microbenchmark two approaches of storing secondary index:
>  # Group all values for a key and then store key-value pairs where each value 
> in this pair is a collection. For example, say column c1 is the secondary 
> index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this 
> approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) 
> v2: [f2].
>  # Since each key-value pair is unique as a whole, so store each key-value 
> pair separately (still lexicographically sorted). So, in this approach, we 
> have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2.
> The benchmark should capture storage overhead and lookup latency of one 
> approach over the other.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7146) Implement secondary index

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7146:
--
Status: Open  (was: In Progress)

> Implement secondary index
> -
>
> Key: HUDI-7146
> URL: https://issues.apache.org/jira/browse/HUDI-7146
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> # Secondary index schema should be flexible enough to accommodate various 
> kinds of secondary index. 
>  # Reuse as much as possible the existing framework for indexing.
>  # Merge with existing index config and introduce as less configs as possible.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7552:
-
Status: Patch Available  (was: In Progress)

> Remove the suffix for MDT table service instants
> 
>
> Key: HUDI-7552
> URL: https://issues.apache.org/jira/browse/HUDI-7552
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: core
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> We wanna remove the very specific design for MDT so that it's behavior is in 
> sync with the DT.
>  
> The criteria for simplification:
> {code:java}
> 1. use the instant timestamp from DT to commit to the MDT as much as possible 
> for any delta_commit on MDT.
> 2. for table sercives like cleaning, compaction and log_compaction, the 
> timestamp is auto-generated.
> 3. avoid to trigger multiple commits to MDT for one DT action. {code}
> The async index instant suffix is kept because there are some validation 
> logic that needs special filtering on these instants, the suffix is kind of a 
> "tag" for filtering. We should refacor that out in the future if we have a 
> better solution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)
Danny Chen created HUDI-7572:


 Summary: Avoid to schedule empty compaction plan without log files
 Key: HUDI-7572
 URL: https://issues.apache.org/jira/browse/HUDI-7572
 Project: Apache Hudi
  Issue Type: Improvement
  Components: table-service
Reporter: Danny Chen
Assignee: Danny Chen
 Fix For: 1.0.0






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files

2024-04-03 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-7572:
-
Sprint: Sprint 2024-03-25

> Avoid to schedule empty compaction plan without log files
> -
>
> Key: HUDI-7572
> URL: https://issues.apache.org/jira/browse/HUDI-7572
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


wombatu-kun commented on code in PR #10949:
URL: https://github.com/apache/hudi/pull/10949#discussion_r1550714037


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
 option("hoodie.schema.on.read.enable","true").
 option("hoodie.datasource.write.reconcile.schema","true").
 option(DataSourceWriteOptions.TABLE_NAME.key(), tableName).
+option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), 
classOf[OverwriteWithLatestAvroPayload].getName).

Review Comment:
   Yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


danny0405 commented on code in PR #10949:
URL: https://github.com/apache/hudi/pull/10949#discussion_r1550687858


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala:
##
@@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase {
 option("hoodie.schema.on.read.enable","true").
 option("hoodie.datasource.write.reconcile.schema","true").
 option(DataSourceWriteOptions.TABLE_NAME.key(), tableName).
+option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), 
classOf[OverwriteWithLatestAvroPayload].getName).

Review Comment:
   So these changes are made only to make the test pass?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


danny0405 commented on code in PR #8338:
URL: https://github.com/apache/hudi/pull/8338#discussion_r1550685909


##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##
@@ -1134,6 +1136,11 @@ public PropertyBuilder set(Map props) {
   return this;
 }
 
+public PropertyBuilder setHoodieIndexConf(Properties hoodieIndexConf) {
+  this.hoodieIndexConf = hoodieIndexConf;
+  return this;

Review Comment:
   Currently the index config is not a table config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


danny0405 commented on code in PR #8338:
URL: https://github.com/apache/hudi/pull/8338#discussion_r1550685079


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java:
##
@@ -171,6 +178,38 @@ public void testWriteMergeOnReadWithCompaction(String 
indexType) throws Exceptio
 testWriteToHoodie(conf, "mor_write_with_compact", 1, EXPECTED);
   }
 
+  @Test
+  public void testVerifyConsistencyOfBucketNum() throws Exception {
+String path = tempFile.getAbsolutePath();
+Configuration conf = TestConfigurations.getDefaultConf(path);
+conf.setString(FlinkOptions.INDEX_TYPE, "BUCKET");
+conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 4);

Review Comment:
   Maybe we should move this test into `TestStreamWriteOperatorCoordinator`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError: org.apache.hudi.configuration.FlinkOptions [hudi]

2024-04-03 Thread via GitHub


danny0405 commented on issue #8366:
URL: https://github.com/apache/hudi/issues/8366#issuecomment-2035876822

   It looks like the hudi flink bundle jar is not correctly loaded in the 
classpath.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7564] Revert hive sync inconsistency and reason for it (#10959)

2024-04-03 Thread danny0405
This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 398c9a23c84 [HUDI-7564] Revert hive sync inconsistency and reason for 
it (#10959)
398c9a23c84 is described below

commit 398c9a23c84a54aecfea8e6c7948f198785710c5
Author: voonhous 
AuthorDate: Thu Apr 4 08:41:39 2024 +0800

[HUDI-7564] Revert hive sync inconsistency and reason for it (#10959)
---
 .../main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala   | 4 +++-
 .../src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java  | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
index 734afd79252..dbac496022f 100644
--- 
a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
+++ 
b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
@@ -480,7 +480,9 @@ trait ProvidesHoodieConfig extends Logging {
   hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS, 
props.getString(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key))
 }
 
hiveSyncConfig.setDefaultValue(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS,
 classOf[MultiPartKeysValueExtractor].getName)
-
hiveSyncConfig.setDefaultValue(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE,
 HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.defaultValue())
+// This is hardcoded to true to ensure consistency as Spark syncs 
TIMESTAMP types as TIMESTAMP by default
+// via Spark's externalCatalog API, which is used by 
AlterHoodieTableCommand.
+
hiveSyncConfig.setDefaultValue(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE,
 "true")
 if (hiveSyncConfig.useBucketSync())
   hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_BUCKET_SYNC_SPEC,
 
HiveSyncConfig.getBucketSpec(props.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD.key),
diff --git 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java
 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java
index 74cb90de020..8f31cae29bc 100644
--- 
a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java
+++ 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java
@@ -90,7 +90,8 @@ public class HiveSyncConfigHolder {
   .defaultValue("false")
   .markAdvanced()
   .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is 
converted to hive ‘timestamp’ type. "
-  + "Disabled by default for backward compatibility.");
+  + "Disabled by default for backward compatibility. \n"
+  + "NOTE: On Spark entrypoints, this is defaulted to TRUE");
   public static final ConfigProperty HIVE_TABLE_PROPERTIES = 
ConfigProperty
   .key("hoodie.datasource.hive_sync.table_properties")
   .noDefaultValue()



Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]

2024-04-03 Thread via GitHub


danny0405 merged PR #10959:
URL: https://github.com/apache/hudi/pull/10959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7507:
-
Fix Version/s: (was: 1.0.0)

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 1.0.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0, 1.0.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services

2024-04-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-7507:
--
Fix Version/s: 0.15.0

>  ongoing concurrent writers with smaller timestamp can cause issues with 
> table services
> ---
>
> Key: HUDI-7507
> URL: https://issues.apache.org/jira/browse/HUDI-7507
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: table-service
>Reporter: Krishen Bhan
>Priority: Major
> Fix For: 0.15.0
>
> Attachments: Flowchart (1).png, Flowchart.png
>
>
> Although HUDI operations hold a table lock when creating a .requested 
> instant, because HUDI writers do not generate a timestamp and create a 
> .requsted plan in the same transaction, there can be a scenario where 
>  # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp 
> (x - 1)
>  # Job 1 schedules and creates requested file with instant timestamp (x)
>  # Job 2 schedules and creates requested file with instant timestamp (x-1)
>  # Both jobs continue running
> If one job is writing a commit and the other is a table service, this can 
> cause issues:
>  * 
>  ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then 
> when Job 1 runs before Job 2 and can create a compaction plan for all instant 
> times (up to (x) ) that doesn’t include instant time (x-1) .  Later Job 2 
> will create instant time (x-1), but timeline will be in a corrupted state 
> since compaction plan was supposed to include (x-1)
>  ** There is a similar issue with clean. If Job2 is a long-running commit 
> (that was stuck/delayed for a while before creating its .requested plan) and 
> Job 1 is a clean, then Job 1 can perform a clean that updates the 
> earliest-commit-to-retain without waiting for the inflight instant by Job 2 
> at (x-1) to complete. This causes Job2 to be "skipped" by clean.
> [Edit] I added a diagram to visualize the issue, specifically the second 
> scenario with clean
> !Flowchart (1).png!
>  
> One way this can be resolved is by combining the operations of generating 
> instant time and creating a requested file in the same HUDI table 
> transaction. Specifically, executing the following steps whenever any instant 
> (commit, table service, etc) is scheduled
>  # Acquire table lock
>  # Look at the latest instant C on the active timeline (completed or not). 
> Generate a timestamp after C
>  # Create the plan and requested file using this new timestamp ( that is 
> greater than C)
>  # Release table lock
> Unfortunately this has the following drawbacks
>  * Every operation must now hold the table lock when computing its plan, even 
> if its an expensive operation and will take a while
>  * Users of HUDI cannot easily set their own instant time of an operation, 
> and this restriction would break any public APIs that allow this
> -An alternate approach (suggested by- [~pwason] -) was to instead have all 
> operations including table services perform conflict resolution checks before 
> committing. For example, clean and compaction would generate their plan as 
> usual. But when creating a transaction to write a .requested file, right 
> before creating the file they should check if another lower timestamp instant 
> has appeared in the timeline. And if so, they should fail/abort without 
> creating the plan. Commit operations would also be updated/verified to have 
> similar check, before creating a .requested file (during a transaction) the 
> commit operation will check if a table service plan (clean/compact) with a 
> greater instant time has been created. And if so, would abort/fail. This 
> avoids the drawbacks of the first approach, but will lead to more transient 
> failures that users have to handle.-
>  
> An alternate approach is to have every operation abort creating a .requested 
> file unless it has the latest timestamp. Specifically, for any instant type, 
> whenever an operation is about to create a .requested plan on timeline, it 
> should take the table lock and assert that there are no other instants on 
> timeline (inflight or otherwise) that are greater than it. If that assertion 
> fails, then throw a retry-able conflict resolution exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7503:
-
Description: 
Some external workflow schedulers can accidentally (or) misbehave and schedule 
duplicate executions of the same compaction plan. We need a way to guard 
against this inside Hudi (vs user taking a lock externally). In such a world,  
2 instance of the job concurrently call 
`org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
compaction instant. 

This is since one writer might execute the instant and create an inflight, 
while the other writer sees the inflight and tries to roll it back before 
re-attempting to execute it (since it will assume said inflight was a 
previously failed compaction attempt).

This logic should be updated such that only one writer will actually execute 
the compaction plan at a time (and the others will fail/abort).

One approach is to use a transaction (base table lock) in conjunction with 
heartbeating, to ensure that the writer triggers a heartbeat before executing 
compaction, and any concurrent writers will use the heartbeat to check wether 
the compaction is currently being executed by another writer. Specifically , 
the compact API should execute the following steps
 # Get the instant to compact C (as usual)
 # Start a transaction
 # Checks if C has an active heartbeat, if so finish transaction and throw 
exception
 # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
has been started before by another job)
 # Finish transaction
 # Run the existing compact API logic on C 
 # If execution succeeds, clean up heartbeat file . If it fails do nothing (as 
the heartbeat will anyway be automatically expired later).

Note that this approach only holds the table lock temporarily, when 
checking/starting the heartbeat

Also, this flow can be applied to execution of clean plans and other table 
services

  was:
Currently it is not safe for 2+ writers to concurrently call 
`org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
compaction instant. This is since one writer might execute the instant and 
create an inflight, while the other writer sees the inflight and tries to roll 
it back before re-attempting to execute it (since it will assume said inflight 
was a previously failed compaction attempt).

This logic should be updated such that only one writer will actually execute 
the compaction plan at a time (and the others will fail/abort).

One approach is to use a transaction (base table lock) in conjunction with 
heartbeating, to ensure that the writer triggers a heartbeat before executing 
compaction, and any concurrent writers will use the heartbeat to check wether 
the compaction is currently being executed by another writer. Specifically , 
the compact API should execute the following steps
 # Get the instant to compact C (as usual)
 # Start a transaction
 # Checks if C has an active heartbeat, if so finish transaction and throw 
exception
 # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
has been started before by another job)
 # Finish transaction
 # Run the existing compact API logic on C 
 # If execution succeeds, clean up heartbeat file . If it fails do nothing (as 
the heartbeat will anyway be automatically expired later).

Note that this approach only holds the table lock temporarily, when 
checking/starting the heartbeat

Also, this flow can be applied to execution of clean plans and other table 
services


> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>Priority: Minor
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat 

[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7503:
-
Fix Version/s: 0.15.0
   1.0.0

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>Priority: Minor
> Fix For: 0.15.0, 1.0.0
>
>
> Some external workflow schedulers can accidentally (or) misbehave and 
> schedule duplicate executions of the same compaction plan. We need a way to 
> guard against this inside Hudi (vs user taking a lock externally). In such a 
> world,  2 instance of the job concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. 
> This is since one writer might execute the instant and create an inflight, 
> while the other writer sees the inflight and tries to roll it back before 
> re-attempting to execute it (since it will assume said inflight was a 
> previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset

2024-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-7503:
-
Summary: Concurrent executions of table service plan should not corrupt 
dataset  (was: concurrent executions of table service plan should not corrupt 
dataset)

> Concurrent executions of table service plan should not corrupt dataset
> --
>
> Key: HUDI-7503
> URL: https://issues.apache.org/jira/browse/HUDI-7503
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: compaction, table-service
>Reporter: Krishen Bhan
>Priority: Minor
>
> Currently it is not safe for 2+ writers to concurrently call 
> `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same 
> compaction instant. This is since one writer might execute the instant and 
> create an inflight, while the other writer sees the inflight and tries to 
> roll it back before re-attempting to execute it (since it will assume said 
> inflight was a previously failed compaction attempt).
> This logic should be updated such that only one writer will actually execute 
> the compaction plan at a time (and the others will fail/abort).
> One approach is to use a transaction (base table lock) in conjunction with 
> heartbeating, to ensure that the writer triggers a heartbeat before executing 
> compaction, and any concurrent writers will use the heartbeat to check wether 
> the compaction is currently being executed by another writer. Specifically , 
> the compact API should execute the following steps
>  # Get the instant to compact C (as usual)
>  # Start a transaction
>  # Checks if C has an active heartbeat, if so finish transaction and throw 
> exception
>  # Start a heartbeat for C (this will implicitly re-start the heartbeat if it 
> has been started before by another job)
>  # Finish transaction
>  # Run the existing compact API logic on C 
>  # If execution succeeds, clean up heartbeat file . If it fails do nothing 
> (as the heartbeat will anyway be automatically expired later).
> Note that this approach only holds the table lock temporarily, when 
> checking/starting the heartbeat
> Also, this flow can be applied to execution of clean plans and other table 
> services



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035796882

   
   ## CI report:
   
   * 865526e2bb6d40e51fe7b72bb5313701efb6df19 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23095)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError: org.apache.hudi.configuration.FlinkOptions [hudi]

2024-04-03 Thread via GitHub


ankit0811 commented on issue #8366:
URL: https://github.com/apache/hudi/issues/8366#issuecomment-2035796844

   piggybacking on this issue
   @danny0405 I still see this exception thrown when running the listed example 
for flink version `1.15.2` and hudi version `0.14.0`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035703493

   
   ## CI report:
   
   * a20e9d4c236a04becc36724f22972c8eb925c15d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23080)
 
   * 865526e2bb6d40e51fe7b72bb5313701efb6df19 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23095)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10954:
URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035693521

   
   ## CI report:
   
   * a20e9d4c236a04becc36724f22972c8eb925c15d Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23080)
 
   * 865526e2bb6d40e51fe7b72bb5313701efb6df19 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive

2024-04-03 Thread Jonathan Vexler (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833719#comment-17833719
 ] 

Jonathan Vexler commented on HUDI-6787:
---

Use [https://github.com/apache/hudi/pull/5786] as guide for testing hive3 with 
docker demo

> Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and 
> RealtimeCompactedRecordReader for Hive
> -
>
> Key: HUDI-6787
> URL: https://issues.apache.org/jira/browse/HUDI-6787
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Ethan Guo
>Assignee: Jonathan Vexler
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10960:
URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035289254

   
   ## CI report:
   
   * 5fb7cb53038a810807489ff17b52b8568b6925d5 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23094)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10960:
URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035161250

   
   ## CI report:
   
   * 5fb7cb53038a810807489ff17b52b8568b6925d5 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23094)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10960:
URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035142907

   
   ## CI report:
   
   * 5fb7cb53038a810807489ff17b52b8568b6925d5 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode

2024-04-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7571:
-
Labels: pull-request-available  (was: )

> Add api to get exception details in HoodieMetadataTableValidator with 
> ignoreFailed mode
> ---
>
> Key: HUDI-7571
> URL: https://issues.apache.org/jira/browse/HUDI-7571
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Major
>  Labels: pull-request-available
>
> When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure 
> and continues the validation. This jira aims to add api to get list of 
> exceptions and an api to check if validation exception was thrown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]

2024-04-03 Thread via GitHub


lokeshj1703 opened a new pull request, #10960:
URL: https://github.com/apache/hudi/pull/10960

   ### Change Logs
   
   When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure 
and continues the validation. This jira aims to add api to get list of 
exceptions and an api to check if validation exception was thrown.
   
   ### Impact
   
   NA
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   NA
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode

2024-04-03 Thread Lokesh Jain (Jira)
Lokesh Jain created HUDI-7571:
-

 Summary: Add api to get exception details in 
HoodieMetadataTableValidator with ignoreFailed mode
 Key: HUDI-7571
 URL: https://issues.apache.org/jira/browse/HUDI-7571
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Lokesh Jain
Assignee: Lokesh Jain


When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure and 
continues the validation. This jira aims to add api to get list of exceptions 
and an api to check if validation exception was thrown.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2035012033

   
   ## CI report:
   
   * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5432] Fixing rollback block handling in LogRecordReader [hudi]

2024-04-03 Thread via GitHub


nsivabalan commented on PR #7649:
URL: https://github.com/apache/hudi/pull/7649#issuecomment-2034923879

   this may not be valid anymore. 
   we made rollbacks eager and ensure we rollback any failed writes in MDT 
before starting a new commit. 
   
https://github.com/apache/hudi/blob/bf723f56cd0d379f951a5a2d535502f326d1bc78/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java#L1177
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5432] Fixing rollback block handling in LogRecordReader [hudi]

2024-04-03 Thread via GitHub


nsivabalan closed pull request #7649: [HUDI-5432] Fixing rollback block 
handling in LogRecordReader
URL: https://github.com/apache/hudi/pull/7649


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #8338:
URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034903046

   
   ## CI report:
   
   * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN
   * fbbfddc71d0aefd947dcb21bb412c12571f357d2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23092)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034893011

   
   ## CI report:
   
   * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077)
 
   * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7559] [1/n] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]

2024-04-03 Thread via GitHub


codope commented on code in PR #10947:
URL: https://github.com/apache/hudi/pull/10947#discussion_r1549935529


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala:
##
@@ -180,9 +157,40 @@ class RecordLevelIndexSupport(spark: SparkSession,
   }
 
   /**
-   * Return true if metadata table is enabled and record index metadata 
partition is available.
+   * Returns the attribute and literal pair given the operands of a binary 
operator. The pair is returned only if one of
+   * the operand is an attribute and other is literal. In other cases it 
returns an empty Option.
+   * @param expression1 - Left operand of the binary operator
+   * @param expression2 - Right operand of the binary operator
+   * @return Attribute and literal pair
*/
-  def isIndexAvailable: Boolean = {
-metadataConfig.enabled && 
metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX)
+  private def getAttributeLiteralTuple(expression1: Expression, expression2: 
Expression): Option[(AttributeReference, Literal)] = {
+expression1 match {
+  case attr: AttributeReference => expression2 match {
+case literal: Literal =>
+  Option.apply(attr, literal)
+case _ =>
+  Option.empty
+  }
+  case literal: Literal => expression2 match {
+case attr: AttributeReference =>
+  Option.apply(attr, literal)
+case _ =>
+  Option.empty
+  }
+  case _ => Option.empty
+}
+  }
+
+  /**
+   * Matches the configured simple record key with the input attribute name.
+   * @param attributeName The attribute name provided in the query
+   * @return true if input attribute name matches the configured simple record 
key
+   */
+  private def attributeMatchesRecordKey(attributeName: String, recordKeyOpt: 
Option[String]): Boolean = {

Review Comment:
   What was wrong with prev implementation which did call `getRecordKeyConfig` 
inside this method? Or is it just refactoring?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala:
##
@@ -180,9 +157,40 @@ class RecordLevelIndexSupport(spark: SparkSession,
   }
 
   /**
-   * Return true if metadata table is enabled and record index metadata 
partition is available.
+   * Returns the attribute and literal pair given the operands of a binary 
operator. The pair is returned only if one of
+   * the operand is an attribute and other is literal. In other cases it 
returns an empty Option.
+   * @param expression1 - Left operand of the binary operator
+   * @param expression2 - Right operand of the binary operator
+   * @return Attribute and literal pair
*/
-  def isIndexAvailable: Boolean = {
-metadataConfig.enabled && 
metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX)
+  private def getAttributeLiteralTuple(expression1: Expression, expression2: 
Expression): Option[(AttributeReference, Literal)] = {

Review Comment:
   can we unti test this method?



##
hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/hudi/TestRecordLevelIndexSupport.scala:
##
@@ -0,0 +1,60 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi
+
+import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField
+import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, 
FromUnixTime, Literal}
+import org.apache.spark.sql.types.StringType
+import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue}
+import org.junit.jupiter.api.Test
+
+import java.util.TimeZone
+
+class TestRecordLevelIndexSupport {
+  @Test
+  def testFilterQueryWithRecordKey(): Unit = {

Review Comment:
   Good that we have the test for equalTo. Can we also some for In and Not In?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala:
##
@@ -143,22 +104,38 @@ class RecordLevelIndexSupport(spark: SparkSession,
 }
   }
 
+  /**
+   * Return true if metadata table is enabled and record index metadata 
partition is 

Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034872367

   
   ## CI report:
   
   * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077)
 
   * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2034872494

   
   ## CI report:
   
   * 66f7add237e807bc7ad7a870ee39f3c60762b728 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


wombatu-kun commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034830732

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]

2024-04-03 Thread via GitHub


wombatu-kun commented on code in PR #10942:
URL: https://github.com/apache/hudi/pull/10942#discussion_r1549898777


##
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/execution/bulkinsert/JavaGlobalSortPartitioner.java:
##
@@ -31,12 +32,21 @@
  *
  * @param  HoodieRecordPayload type
  */
-public class JavaGlobalSortPartitioner
-implements BulkInsertPartitioner>> {
+public class JavaGlobalSortPartitioner implements 
BulkInsertPartitioner>> {
+
+  public JavaGlobalSortPartitioner() {
+  }
+
+  /**
+   * Constructor to create as UserDefinedBulkInsertPartitioner class via 
reflection
+   * @param config HoodieWriteConfig

Review Comment:
   @danny0405 can you please assign this PR to @nsivabalan to summon him to 
this discussion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034768935

   
   ## CI report:
   
   * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077)
 
   * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #8338:
URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034755617

   
   ## CI report:
   
   * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN
   * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091)
 
   * fbbfddc71d0aefd947dcb21bb412c12571f357d2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23092)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10949:
URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034744354

   
   ## CI report:
   
   * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077)
 
   * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #8338:
URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034733630

   
   ## CI report:
   
   * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN
   * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091)
 
   * fbbfddc71d0aefd947dcb21bb412c12571f357d2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-203479

   
   ## CI report:
   
   * 66f7add237e807bc7ad7a870ee39f3c60762b728 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23086)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10957:
URL: https://github.com/apache/hudi/pull/10957#issuecomment-2034721873

   
   ## CI report:
   
   * 66f7add237e807bc7ad7a870ee39f3c60762b728 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]

2024-04-03 Thread via GitHub


jonvex commented on code in PR #10954:
URL: https://github.com/apache/hudi/pull/10954#discussion_r1549758875


##
hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala:
##
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.hadoop.conf.Configuration
+import org.apache.hadoop.fs.Path
+import org.apache.hadoop.mapreduce.lib.input.FileSplit
+import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType}
+import org.apache.parquet.filter2.compat.FilterCompat
+import org.apache.parquet.filter2.predicate.FilterApi
+import 
org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS
+import org.apache.parquet.hadoop.{ParquetFileReader, ParquetInputFormat, 
ParquetRecordReader}
+import org.apache.spark.TaskContext
+import org.apache.spark.sql.SparkSession
+import org.apache.spark.sql.avro.AvroDeserializer
+import org.apache.spark.sql.catalyst.InternalRow
+import 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection
+import org.apache.spark.sql.catalyst.expressions.{Cast, JoinedRow, UnsafeRow}
+import org.apache.spark.sql.catalyst.util.DateTimeUtils
+import org.apache.spark.sql.execution.datasources.{PartitionedFile, 
RecordReaderIterator}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.sources.Filter
+import org.apache.spark.sql.types.{AtomicType, StructField, StructType}
+import org.apache.spark.util.SerializableConfiguration
+
+import java.net.URI
+
+object Spark24HoodieParquetReader {
+
+  /**
+   * Get properties needed to read a parquet file
+   *
+   * @param vectorized true if vectorized reading is not prohibited due to 
schema, reading mode, etc
+   * @param sqlConfthe [[SQLConf]] used for the read
+   * @param optionspassed as a param to the file format
+   * @param hadoopConf some configs will be set for the hadoopConf
+   * @return map of properties needed for reading a parquet file
+   */
+  def getPropsForReadingParquet(vectorized: Boolean,
+sqlConf: SQLConf,
+options: Map[String, String],
+hadoopConf: Configuration): Map[String, 
String] = {
+//set hadoopconf
+hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, 
classOf[ParquetReadSupport].getName)
+hadoopConf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, 
sqlConf.sessionLocalTimeZone)
+hadoopConf.setBoolean(SQLConf.CASE_SENSITIVE.key, 
sqlConf.caseSensitiveAnalysis)
+hadoopConf.setBoolean(SQLConf.PARQUET_BINARY_AS_STRING.key, 
sqlConf.isParquetBinaryAsString)
+hadoopConf.setBoolean(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key, 
sqlConf.isParquetINT96AsTimestamp)
+
+Map(
+  "enableVectorizedReader" -> vectorized.toString,
+  "enableParquetFilterPushDown" -> sqlConf.parquetFilterPushDown.toString,
+  "pushDownDate" -> sqlConf.parquetFilterPushDownDate.toString,
+  "pushDownTimestamp" -> sqlConf.parquetFilterPushDownTimestamp.toString,
+  "pushDownDecimal" -> sqlConf.parquetFilterPushDownDecimal.toString,
+  "pushDownInFilterThreshold" -> 
sqlConf.parquetFilterPushDownInFilterThreshold.toString,
+  "pushDownStringStartWith" -> 
sqlConf.parquetFilterPushDownStringStartWith.toString,
+  "isCaseSensitive" -> sqlConf.caseSensitiveAnalysis.toString,
+  "timestampConversion" -> 
sqlConf.isParquetINT96TimestampConversion.toString,
+  "enableOffHeapColumnVector" -> 
sqlConf.offHeapColumnVectorEnabled.toString,
+  "capacity" -> sqlConf.parquetVectorizedReaderBatchSize.toString,
+  "returningBatch" -> sqlConf.parquetVectorizedReaderEnabled.toString,
+  "enableRecordFilter" -> sqlConf.parquetRecordFilterEnabled.toString,
+  "timeZoneId" -> sqlConf.sessionLocalTimeZone
+)
+  }
+
+  /**
+   * Read an individual parquet file
+   * Code from ParquetFileFormat#buildReaderWithPartitionValues from Spark 
v2.4.8 adapted here
+   *
+   * @param fileparquet file to read
+   * @param 

Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10959:
URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034585539

   
   ## CI report:
   
   * 912797cefdd31067dde0c43e2b5c537d73d2b084 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7569:
--
Status: In Progress  (was: Open)

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7569:
--
Status: Patch Available  (was: In Progress)

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (HUDI-7569) Fix wrong result while using RLI for pruning files

2024-04-03 Thread Sagar Sumit (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit closed HUDI-7569.
-
Resolution: Fixed

> Fix wrong result while using RLI for pruning files
> --
>
> Key: HUDI-7569
> URL: https://issues.apache.org/jira/browse/HUDI-7569
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinaykumar Bhat
>Assignee: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Data skipping (pruning files) for RLI is supported only when the query 
> predicate has `EqualTo` or `In` expressions/filters on the record-key column. 
> However, the logic for detecting valid `In` expression/filter on record-key 
> has bugs. It tries to prune files assuming that `In` expression/filter can 
> reference only record-key column even when the `In` query is based on other 
> columns.
>  
> For example, a query of the foem `select * from trips_table where driver in 
> ('abc', 'xyz')` has the potential to return wrong results if the record-key 
> for this table also has values 'abc' or 'xyz' for some rows of the table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [I] [SUPPORT] Slashes in partition columns [hudi]

2024-04-03 Thread via GitHub


eshu commented on issue #10754:
URL: https://github.com/apache/hudi/issues/10754#issuecomment-2034546730

   @ad1happy2go It does not work in my example. Did you tried it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



(hudi) branch master updated: [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation (#10778)

2024-04-03 Thread jonvex
This is an automated email from the ASF dual-hosted git repository.

jonvex pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new bf723f56cd0 [HUDI-7486] Classify schema exceptions when converting 
from avro to spark row representation (#10778)
bf723f56cd0 is described below

commit bf723f56cd0d379f951a5a2d535502f326d1bc78
Author: Jon Vexler 
AuthorDate: Wed Apr 3 08:50:12 2024 -0400

[HUDI-7486] Classify schema exceptions when converting from avro to spark 
row representation (#10778)

* make exceptions more specific

* use hudi avro exception

* Address review comments

* fix unnecessary changes

* add exception wrapping

* style

* address review comments

* remove . from config

* address review comments

* fix merge

* fix checkstyle

* Update 
hudi-common/src/main/java/org/apache/hudi/exception/HoodieRecordCreationException.java

Co-authored-by: Y Ethan Guo 

* Update 
hudi-common/src/main/java/org/apache/hudi/exception/HoodieAvroSchemaException.java

Co-authored-by: Y Ethan Guo 

* add javadoc to exception wrapper

-

Co-authored-by: Jonathan Vexler <=>
Co-authored-by: Y Ethan Guo 
---
 .../org/apache/hudi/AvroConversionUtils.scala  | 14 +--
 .../scala/org/apache/hudi/HoodieSparkUtils.scala   | 20 ++---
 .../hudi/util/ExceptionWrappingIterator.scala  | 44 +++
 .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 10 ++---
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java | 25 +++
 .../hudi/exception/HoodieAvroSchemaException.java  | 31 ++
 .../exception/HoodieRecordCreationException.java   | 32 ++
 .../org/apache/hudi/HoodieSparkSqlWriter.scala | 14 ---
 .../utilities/config/HoodieStreamerConfig.java |  7 
 .../apache/hudi/utilities/sources/RowSource.java   |  9 +++-
 .../utilities/streamer/HoodieStreamerUtils.java| 24 +++
 .../utilities/streamer/SourceFormatAdapter.java|  9 +++-
 .../hudi/utilities/sources/TestAvroDFSSource.java  |  3 +-
 .../hudi/utilities/sources/TestCsvDFSSource.java   |  3 +-
 .../hudi/utilities/sources/TestJsonDFSSource.java  | 49 +-
 .../utilities/sources/TestParquetDFSSource.java|  3 +-
 .../sources/AbstractDFSSourceTestBase.java |  7 +++-
 17 files changed, 257 insertions(+), 47 deletions(-)

diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
index 55877938f8c..95962d1ca44 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
@@ -23,6 +23,7 @@ import org.apache.avro.generic.GenericRecord
 import org.apache.avro.{JsonProperties, Schema}
 import org.apache.hudi.HoodieSparkUtils.sparkAdapter
 import org.apache.hudi.avro.AvroSchemaUtils
+import org.apache.hudi.exception.SchemaCompatibilityException
 import org.apache.hudi.internal.schema.HoodieSchemaException
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
@@ -58,9 +59,16 @@ object AvroConversionUtils {
*/
   def createInternalRowToAvroConverter(rootCatalystType: StructType, 
rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = {
 val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, 
rootAvroType, nullable)
-row => serializer
-  .serialize(row)
-  .asInstanceOf[GenericRecord]
+row => {
+  try {
+serializer
+  .serialize(row)
+  .asInstanceOf[GenericRecord]
+  } catch {
+case e: HoodieSchemaException => throw e
+case e => throw new SchemaCompatibilityException("Failed to convert 
spark record into avro record", e)
+  }
+}
   }
 
   /**
diff --git 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
index 03d977f6fc9..6de5de8842e 100644
--- 
a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
+++ 
b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala
@@ -18,25 +18,25 @@
 
 package org.apache.hudi
 
+import org.apache.avro.Schema
+import org.apache.avro.generic.GenericRecord
+import org.apache.hadoop.fs.Path
 import org.apache.hudi.HoodieConversionUtils.toScalaOption
 import org.apache.hudi.avro.{AvroSchemaUtils, HoodieAvroUtils}
 import org.apache.hudi.client.utils.SparkRowSerDe
 import org.apache.hudi.common.model.HoodieRecord
 import org.apache.hudi.hadoop.fs.CachingPath
-
-import org.apache.avro.Schema
-import 

Re: [PR] [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation [hudi]

2024-04-03 Thread via GitHub


jonvex merged PR #10778:
URL: https://github.com/apache/hudi/pull/10778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #8338:
URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034415412

   
   ## CI report:
   
   * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN
   * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10959:
URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034405993

   
   ## CI report:
   
   * 912797cefdd31067dde0c43e2b5c537d73d2b084 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23090)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #8338:
URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034400310

   
   ## CI report:
   
   * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN
   * 8081f81d180126d9c407eac821dbfbd7f5ae28f2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16284)
 
   * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]

2024-04-03 Thread via GitHub


hudi-bot commented on PR #10959:
URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034389641

   
   ## CI report:
   
   * 912797cefdd31067dde0c43e2b5c537d73d2b084 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] - org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool [hudi]

2024-04-03 Thread via GitHub


ad1happy2go commented on issue #10361:
URL: https://github.com/apache/hudi/issues/10361#issuecomment-2034384415

   @limadiego Gentle ping here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Optimized code ComparableVersion [hudi]

2024-04-03 Thread via GitHub


ad1happy2go commented on issue #10933:
URL: https://github.com/apache/hudi/issues/10933#issuecomment-2034376325

   @balloon72 Did you got a chance to provide more details here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] Insert and Delete in a Single Operation. [hudi]

2024-04-03 Thread via GitHub


ad1happy2go commented on issue #10958:
URL: https://github.com/apache/hudi/issues/10958#issuecomment-2034372822

   @lucianondolenc I dont think that is possible. as if we use 'upsert' 
operation type, then it won't allow dups and maintain uniqueness and if we use 
operation type 'insert' then it won't value the '_hoodie_is_deleted' field. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7564] Fix HiveSyncConfig inconsistency [hudi]

2024-04-03 Thread via GitHub


voonhous commented on PR #10951:
URL: https://github.com/apache/hudi/pull/10951#issuecomment-2034289199

   @danny0405 PR to revert this change + add docs:
   
   https://github.com/apache/hudi/pull/10959


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]

2024-04-03 Thread via GitHub


voonhous opened a new pull request, #10959:
URL: https://github.com/apache/hudi/pull/10959

   ### Change Logs
   
   Reverting the hive-sync inconsistency as described in 
https://github.com/apache/hudi/pull/10951#issuecomment-2034230672.
   
   TLDR, this inconsistency was introduced to ensure that hive-sync's behaviour 
is inline with Spark's externalCatalog table schema sync, which is used in 
AlterTableHoodieCommand.
   
   Hive-sync is used to create the _ro and _rt tables of MOR.
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   Updated the config description for 
`hoodie.datasource.hive_sync.support_timestamp` to document that this is an 
intended inconsistency.
   
   
   ### Contributor's checklist
   
   - [X] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (HUDI-7564) Fix HiveSync configuration inconsistencies

2024-04-03 Thread voon (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833501#comment-17833501
 ] 

voon commented on HUDI-7564:


 
h1.  
The reason for this discrepancy is due to Spark's external catalogue API, which 
syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive.
Given that Hudi has multiple entrypoints, it make sense that Spark introduced 
this inconsistency.
While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as 
{{{}false{}}}, I think it's best we just document this.

> Fix HiveSync configuration inconsistencies
> --
>
> Key: HUDI-7564
> URL: https://issues.apache.org/jira/browse/HUDI-7564
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> *hoodie.datasource.hive_sync.support_timestamp* is required to be *false* 
> such that *TIMESTAMP (MICROS)* columns will be synced onto HMS as *LONG* 
> types.
>  
> While this is not visible to hive-console/spark-sql console with the 
> {_}show-create-database{_}/{_}describe-table{_} command, HMS will store the 
> timestamp type as:
>  
> {code:java}
> support_timestamp=false LONG 
> support_timestamp=true  TIMESTAMP{code}
>  
> By overriding this to {*}true{*}, Trino/Presto queries will fail with this 
> error as it is reliant on HMS information:
> {code:java}
> Caused by: io.prestosql.jdbc.$internal.client.FailureInfo$FailureException: 
> Expected field to be long, actual timestamp(9) (field 0)
> at 
> io.trino.plugin.hive.GenericHiveRecordCursor.validateType(GenericHiveRecordCursor.java:569)
> at 
> io.trino.plugin.hive.GenericHiveRecordCursor.getLong(GenericHiveRecordCursor.java:274)
> at 
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:106)
> at io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120)
> at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299)
> at io.trino.operator.Driver.processInternal(Driver.java:395)
> at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
> at io.trino.operator.Driver.tryWithLock(Driver.java:694)
> at io.trino.operator.Driver.process(Driver.java:290)
> at io.trino.operator.Driver.processForDuration(Driver.java:261)
> at 
> io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:911)
> at 
> io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:188)
> at 
> io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:569)
> at 
> io.trino.$gen.Trino_trino426_sql_hudi_di07_00120240326_074936_2.run(Unknown
>  Source)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> 2024-04-02 17:32:21 (UTC+8) INFO - Clear session property for connection.
> 2024-04-02 17:32:21 (UTC+8) ERROR- Task Execution failed with 
> CommonException: Query failed (#20240402_093220_06724_cg4jg): Expected field 
> to be long, actual timestamp(9) (field 0) {code}
> To demonstrate that the default support_timestamp config is not true via 
> spark-sql:
> {code:java}
> -- EXECUTE THESE QUERIES IN SPARK
> -- Create a table 
> create table if not exists dev_hudi.timestamp_issue (
>   int_col   bigint,
>   `timestamp_col` TIMESTAMP
> ) using hudi 
> tblproperties (
>   type = 'mor',
>   primaryKey = 'int_col'
>  );
> -- Perform an insert to trigger hive sync to create _ro and _rt tables 
> insert into dev_hudi.timestamp_issue select
>           1 as int_col,
>           to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;
> -- Execute a query to verify that data has been written
> select * from dev_hudi.timestamp_issue_rt;
> -- Set support_timestamp to it's supposed default value (false)
> set hoodie.datasource.hive_sync.support_timestamp=false;
> -- Perform an insert again (Will throw an error)
> insert into dev_hudi.timestamp_issue select
>           1 as int_col,
>           to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;{code}
> The last insert query will throw the error below, showing that 
> {*}support_timestamp{*}'s default value is {*}true{*}. 
> {code:java}
> Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception 
> when hive syncing timestamp_issue
>     at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:190)
>     at 
> org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58)
>     ... 64 more
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert 
> field Type from TIMESTAMP to bigint 

[jira] [Comment Edited] (HUDI-7564) Fix HiveSync configuration inconsistencies

2024-04-03 Thread voon (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833501#comment-17833501
 ] 

voon edited comment on HUDI-7564 at 4/3/24 11:10 AM:
-

The reason for this discrepancy is due to Spark's external catalogue API, which 
syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive.
Given that Hudi has multiple entrypoints, it make sense that Spark introduced 
this inconsistency.
While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as 
{{{}false{}}}, I think it's best we just document this.


was (Author: JIRAUSER294635):
 
h1.  
The reason for this discrepancy is due to Spark's external catalogue API, which 
syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive.
Given that Hudi has multiple entrypoints, it make sense that Spark introduced 
this inconsistency.
While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as 
{{{}false{}}}, I think it's best we just document this.

> Fix HiveSync configuration inconsistencies
> --
>
> Key: HUDI-7564
> URL: https://issues.apache.org/jira/browse/HUDI-7564
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: voon
>Assignee: voon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.15.0
>
>
> *hoodie.datasource.hive_sync.support_timestamp* is required to be *false* 
> such that *TIMESTAMP (MICROS)* columns will be synced onto HMS as *LONG* 
> types.
>  
> While this is not visible to hive-console/spark-sql console with the 
> {_}show-create-database{_}/{_}describe-table{_} command, HMS will store the 
> timestamp type as:
>  
> {code:java}
> support_timestamp=false LONG 
> support_timestamp=true  TIMESTAMP{code}
>  
> By overriding this to {*}true{*}, Trino/Presto queries will fail with this 
> error as it is reliant on HMS information:
> {code:java}
> Caused by: io.prestosql.jdbc.$internal.client.FailureInfo$FailureException: 
> Expected field to be long, actual timestamp(9) (field 0)
> at 
> io.trino.plugin.hive.GenericHiveRecordCursor.validateType(GenericHiveRecordCursor.java:569)
> at 
> io.trino.plugin.hive.GenericHiveRecordCursor.getLong(GenericHiveRecordCursor.java:274)
> at 
> io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:106)
> at io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120)
> at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299)
> at io.trino.operator.Driver.processInternal(Driver.java:395)
> at io.trino.operator.Driver.lambda$process$8(Driver.java:298)
> at io.trino.operator.Driver.tryWithLock(Driver.java:694)
> at io.trino.operator.Driver.process(Driver.java:290)
> at io.trino.operator.Driver.processForDuration(Driver.java:261)
> at 
> io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:911)
> at 
> io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:188)
> at 
> io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:569)
> at 
> io.trino.$gen.Trino_trino426_sql_hudi_di07_00120240326_074936_2.run(Unknown
>  Source)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> at java.base/java.lang.Thread.run(Thread.java:833)
> 2024-04-02 17:32:21 (UTC+8) INFO - Clear session property for connection.
> 2024-04-02 17:32:21 (UTC+8) ERROR- Task Execution failed with 
> CommonException: Query failed (#20240402_093220_06724_cg4jg): Expected field 
> to be long, actual timestamp(9) (field 0) {code}
> To demonstrate that the default support_timestamp config is not true via 
> spark-sql:
> {code:java}
> -- EXECUTE THESE QUERIES IN SPARK
> -- Create a table 
> create table if not exists dev_hudi.timestamp_issue (
>   int_col   bigint,
>   `timestamp_col` TIMESTAMP
> ) using hudi 
> tblproperties (
>   type = 'mor',
>   primaryKey = 'int_col'
>  );
> -- Perform an insert to trigger hive sync to create _ro and _rt tables 
> insert into dev_hudi.timestamp_issue select
>           1 as int_col,
>           to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;
> -- Execute a query to verify that data has been written
> select * from dev_hudi.timestamp_issue_rt;
> -- Set support_timestamp to it's supposed default value (false)
> set hoodie.datasource.hive_sync.support_timestamp=false;
> -- Perform an insert again (Will throw an error)
> insert into dev_hudi.timestamp_issue select
>           1 as int_col,
>           to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;{code}
> The last insert query will throw the error below, showing that 
> {*}support_timestamp{*}'s default value is {*}true{*}. 
> 

  1   2   >