[jira] [Assigned] (HUDI-6762) Remove usages of MetadataRecordsGenerationParams
[ https://issues.apache.org/jira/browse/HUDI-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vova Kolmakov reassigned HUDI-6762: --- Assignee: Vova Kolmakov > Remove usages of MetadataRecordsGenerationParams > > > Key: HUDI-6762 > URL: https://issues.apache.org/jira/browse/HUDI-6762 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vova Kolmakov >Priority: Minor > Fix For: 1.0.0 > > > MetadataRecordsGenerationParams is deprecated. We already rely on table > config for enabled mdt partition types. See if we can remove this POJO. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6762) Remove usages of MetadataRecordsGenerationParams
[ https://issues.apache.org/jira/browse/HUDI-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vova Kolmakov updated HUDI-6762: Status: In Progress (was: Open) > Remove usages of MetadataRecordsGenerationParams > > > Key: HUDI-6762 > URL: https://issues.apache.org/jira/browse/HUDI-6762 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vova Kolmakov >Priority: Minor > Fix For: 1.0.0 > > > MetadataRecordsGenerationParams is deprecated. We already rely on table > config for enabled mdt partition types. See if we can remove this POJO. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]
hudi-bot commented on PR #10947: URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036200025 ## CI report: * 06a18d985e2b13159bcca2c1639c1376e871e3f8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [Feature Inquiry] index for randomized upserts [hudi]
pravin1406 opened a new issue, #10961: URL: https://github.com/apache/hudi/issues/10961 Hi Team, We are migrating our applications to 0.14.1 with spark as ingestion engine. Just wanted to know if the newly available indexes, BUCKET and RECORD_INDEX how do they perform w.r.t to randomized upserts/deletes ? Is SIMPLE indexing still the advised index to be used ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]
hudi-bot commented on PR #10947: URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036146639 ## CI report: * 85cbde75f0f652274dc28f940cd0a159096b6aad Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23065) * 06a18d985e2b13159bcca2c1639c1376e871e3f8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23096) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7564] Fix HiveSyncConfig inconsistency [hudi]
TengHuo commented on PR #10951: URL: https://github.com/apache/hudi/pull/10951#issuecomment-2036117075 > We did some tests today and found out why `hoodie.datasource.hive_sync.support_timestamp=true`. > > When performing an alter schema, and hive sync is performed via spark's external catalogue in: `org.apache.spark.sql.hudi.command.AlterTableCommand#commitWithSchema`, Spark syncs TIMESTAMP types as TIMESTAMP. > > ```scala > sparkSession.sessionState.catalog > .externalCatalog > .alterTableDataSchema(db, tableName, dataSparkSchema) > ``` > > If this is defaulted to `false`, after altering the schema (via spark-sql) of a table containing a `TIMESTAMP` column, the type on HMS will change from `LONG` back to `TIMESTAMP` (via spark's external catalogue API). > > This will cause subsequent hive-syncs to fail when they try to sync `TIMESTAMP` as `LONG`, which is not ideal. > > I think it's best that we ensure consistency between Spark, i will submit another PR to change the default back to `true`, I will then add a documentation there to explain why. > > As for the trino/presto error, they will just have to fix it on their end. > > # Conclusion > The reason for this discrepancy is due to Spark's external catalogue API, which syncs `TIMESTAMP` types as `TIMESTAMP` to hive. > > Given that Hudi has multiple entrypoints, it make sense that Spark introduced this inconsistency. > > While I am not sure why hive-sync-tool defaulted the `support_timestamp` as `false`, I think it's best we just document this. In this case, cross engine scenario may be impacted when Hudi Flink user use `TIMESTAMP` type, Hive sync in Flink pipeline will sync it as `LONG` by default. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7559] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]
hudi-bot commented on PR #10947: URL: https://github.com/apache/hudi/pull/10947#issuecomment-2036102345 ## CI report: * 85cbde75f0f652274dc28f940cd0a159096b6aad Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23065) * 06a18d985e2b13159bcca2c1639c1376e871e3f8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
wombatu-kun commented on code in PR #10949: URL: https://github.com/apache/hudi/pull/10949#discussion_r1550791467 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { option("hoodie.schema.on.read.enable","true"). option("hoodie.datasource.write.reconcile.schema","true"). option(DataSourceWriteOptions.TABLE_NAME.key(), tableName). +option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), classOf[OverwriteWithLatestAvroPayload].getName). Review Comment: This test was written for `OverwriteWithLatestAvroPayload` and it checks OverwriteWithLatest behavior, i think it's better to let it be. Or may be add the same test and make it pass by using new default payloads. But... All other tests use default payload and pass. Why it is not enough? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
danny0405 commented on code in PR #10949: URL: https://github.com/apache/hudi/pull/10949#discussion_r1550775053 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { option("hoodie.schema.on.read.enable","true"). option("hoodie.datasource.write.reconcile.schema","true"). option(DataSourceWriteOptions.TABLE_NAME.key(), tableName). +option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), classOf[OverwriteWithLatestAvroPayload].getName). Review Comment: Kind of think we should modify the test to make it pass by using new default payloads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-7573) Metadata Table Improvements
[ https://issues.apache.org/jira/browse/HUDI-7573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen reassigned HUDI-7573: Assignee: Danny Chen > Metadata Table Improvements > --- > > Key: HUDI-7573 > URL: https://issues.apache.org/jira/browse/HUDI-7573 > Project: Apache Hudi > Issue Type: Epic > Components: metadata >Reporter: Vinoth Chandar >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7510) Loosen the compaction scheduling and rollback check for MDT
[ https://issues.apache.org/jira/browse/HUDI-7510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7510: - Epic Link: HUDI-7573 (was: HUDI-6640) > Loosen the compaction scheduling and rollback check for MDT > --- > > Key: HUDI-7510 > URL: https://issues.apache.org/jira/browse/HUDI-7510 > Project: Apache Hudi > Issue Type: Improvement > Components: core, metadata, table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7552: - Epic Link: HUDI-7573 (was: HUDI-6640) > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7552: - Remaining Estimate: 2m Original Estimate: 2m > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Original Estimate: 2m > Remaining Estimate: 2m > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7552: - Remaining Estimate: 2h (was: 2m) Original Estimate: 2h (was: 2m) > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7552: - Story Points: 4 > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7572: - Epic Link: HUDI-7573 (was: HUDI-6640) > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > After change to [loosen the compaction for > MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the > same compaction instant time got used to schedule for multiple times, we > better optimize the compactor to avoid empty compaction plan generation. > Note: although we have a active timeline check to avoid the repetative > scheduling, there is still little chance the compaction already got archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7573) Metadata Table Improvements
Vinoth Chandar created HUDI-7573: Summary: Metadata Table Improvements Key: HUDI-7573 URL: https://issues.apache.org/jira/browse/HUDI-7573 Project: Apache Hudi Issue Type: Epic Components: metadata Reporter: Vinoth Chandar Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-1698) Multiwriting for Flink / Java
[ https://issues.apache.org/jira/browse/HUDI-1698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar resolved HUDI-1698. -- > Multiwriting for Flink / Java > - > > Key: HUDI-1698 > URL: https://issues.apache.org/jira/browse/HUDI-1698 > Project: Apache Hudi > Issue Type: New Feature > Components: flink, writer-core >Reporter: Nishith Agarwal >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7552: - Reviewers: Ethan Guo > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution
[ https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7045: -- Story Points: 12 (was: 6) > Fix new file format and reader for schema evolution > --- > > Key: HUDI-7045 > URL: https://issues.apache.org/jira/browse/HUDI-7045 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > When this is implemented, parquet readers should not be created in > HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can > uncomment/add the code from this commit: > [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7572: - Epic Link: HUDI-6640 > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > After change to [loosen the compaction for > MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the > same compaction instant time got used to schedule for multiple times, we > better optimize the compactor to avoid empty compaction plan generation. > Note: although we have a active timeline check to avoid the repetative > scheduling, there is still little chance the compaction already got archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833792#comment-17833792 ] Vinoth Chandar commented on HUDI-6787: -- If we change the Class inheritance of the Hudi input formats, for e.g subclassing from MapredParquetInputFormat - Hive may generate unoptimized plans or execution plans. Do we change this in this PR? > Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > -- > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7029) Enhance CREATE INDEX syntax for functional index
[ https://issues.apache.org/jira/browse/HUDI-7029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7029: -- Priority: Minor (was: Major) > Enhance CREATE INDEX syntax for functional index > > > Key: HUDI-7029 > URL: https://issues.apache.org/jira/browse/HUDI-7029 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Minor > Fix For: 1.0.0 > > > Currently, user can create index using sql as follows: > `create index idx_datestr on $tableName using column_stats(ts) > options(func='from_unixtime', format='-MM-dd')` > Ideally, we would to simplify this further as follows: > `create index idx_datestr on $tableName using column_stats(from_unixtime(ts, > format='-MM-dd'))` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7548) Close any gaps on Indexing (bloom index, col stats, agg_stats, record index with support for non-unique keys,
[ https://issues.apache.org/jira/browse/HUDI-7548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7548: -- Story Points: 22 > Close any gaps on Indexing (bloom index, col stats, agg_stats, record index > with support for non-unique keys, > -- > > Key: HUDI-7548 > URL: https://issues.apache.org/jira/browse/HUDI-7548 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinoth Chandar >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > * reads through SQL > * writers through all supported means > * async index create/drop w/ multiple writers > * Index updates are handled correctly > * flexible compaction -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7025) Merge Index and Functional Index Config
[ https://issues.apache.org/jira/browse/HUDI-7025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7025: -- Priority: Minor (was: Major) > Merge Index and Functional Index Config > --- > > Key: HUDI-7025 > URL: https://issues.apache.org/jira/browse/HUDI-7025 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Minor > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > There is {{INDEX}} sub-group name in `ConfigGroups`. Functional index configs > can be consolidated within that. > > https://github.com/apache/hudi/pull/9872#discussion_r1377115549 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7144) Support query for tables written as partitionBy but synced as non-partitioned
[ https://issues.apache.org/jira/browse/HUDI-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7144: -- Story Points: 2 > Support query for tables written as partitionBy but synced as non-partitioned > - > > Key: HUDI-7144 > URL: https://issues.apache.org/jira/browse/HUDI-7144 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > In HUDI-7023, we added support to sync any table as non-partitioned table and > yet be able to query via Spark with the same performance benefits of > partitioned table. > This ticket extends the functionality end-to-end. If a user executes > `spark.write.format("hudi").options(options).partitionBy(partCol).save(basePath)`, > then do logical partitioning and sync as non-partitioned table to the > catalog. Yet be able to query efficiently,. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6787) Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6787: - Summary: Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive (was: Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive) > Hive Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > -- > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution
[ https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Vexler updated HUDI-7045: -- Story Points: 6 > Fix new file format and reader for schema evolution > --- > > Key: HUDI-7045 > URL: https://issues.apache.org/jira/browse/HUDI-7045 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > When this is implemented, parquet readers should not be created in > HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can > uncomment/add the code from this commit: > [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7572: - Description: After change to [loosen the compaction for MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the same compaction instant time got used to schedule for multiple times, we better optimize the compactor to avoid empty compaction plan generation. Note: although we have a active timeline check to avoid the repetative scheduling, there is still little chance the compaction already got archived. was: After change to loosen the compaction for MDT, there is rare case the same compaction instant time got used to schedule for multiple times, we better optimize the compactor to avoid empty compaction plan generation. Note: although we have a active timeline check to avoid the repetative scheduling, there is still little chance the compaction already got archived. > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > After change to [loosen the compaction for > MDT|https://issues.apache.org/jira/browse/HUDI-7572], there is rare case the > same compaction instant time got used to schedule for multiple times, we > better optimize the compactor to avoid empty compaction plan generation. > Note: although we have a active timeline check to avoid the repetative > scheduling, there is still little chance the compaction already got archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7146) Implement secondary index
[ https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7146: -- Status: In Progress (was: Open) > Implement secondary index > - > > Key: HUDI-7146 > URL: https://issues.apache.org/jira/browse/HUDI-7146 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > # Secondary index schema should be flexible enough to accommodate various > kinds of secondary index. > # Reuse as much as possible the existing framework for indexing. > # Merge with existing index config and introduce as less configs as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7145) Support for grouping values for same key in HFile
[ https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7145. - Resolution: Done > Support for grouping values for same key in HFile > - > > Key: HUDI-7145 > URL: https://issues.apache.org/jira/browse/HUDI-7145 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Hudi writes metadata table (MT) base files in HFile format. HFile stores > sorted key-value pairs. For the existing MT partitions, the key is guaranteed > to be unique. However, for secondary index, it is very likely that the same > value of secondary index field is in multiple files. > This ticket is to microbenchmark two approaches of storing secondary index: > # Group all values for a key and then store key-value pairs where each value > in this pair is a collection. For example, say column c1 is the secondary > index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this > approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) > v2: [f2]. > # Since each key-value pair is unique as a whole, so store each key-value > pair separately (still lexicographically sorted). So, in this approach, we > have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2. > The benchmark should capture storage overhead and lookup latency of one > approach over the other. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7570) Update RFC with details on API changes
[ https://issues.apache.org/jira/browse/HUDI-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7570: -- Status: In Progress (was: Open) > Update RFC with details on API changes > -- > > Key: HUDI-7570 > URL: https://issues.apache.org/jira/browse/HUDI-7570 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Given that secondary index can have duplicate keys, the existing > `HoodieMergedLogRecordScanner` is insufficient to handle duplicates because > it depends on `ExternalSpillableMap` which can only hold unique keys. RFC > should clarify how the merged log record scanner will change. We should not > be leaking any details to merge handle. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7045) Fix new file format and reader for schema evolution
[ https://issues.apache.org/jira/browse/HUDI-7045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7045: - Reviewers: Ethan Guo > Fix new file format and reader for schema evolution > --- > > Key: HUDI-7045 > URL: https://issues.apache.org/jira/browse/HUDI-7045 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jonathan Vexler >Assignee: Jonathan Vexler >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > When this is implemented, parquet readers should not be created in > HoodieFileGroupReaderBasedParquetFileFormat. Additionally, we can > uncomment/add the code from this commit: > [https://github.com/apache/hudi/pull/10137/commits/b0b711e0c355320da652fa7f2d8669539873d4d6] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7146) Implement secondary index
[ https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7146: -- Story Points: 8 > Implement secondary index > - > > Key: HUDI-7146 > URL: https://issues.apache.org/jira/browse/HUDI-7146 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > # Secondary index schema should be flexible enough to accommodate various > kinds of secondary index. > # Reuse as much as possible the existing framework for indexing. > # Merge with existing index config and introduce as less configs as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7145) Support for grouping values for same key in HFile
[ https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7145: -- Story Points: 6 > Support for grouping values for same key in HFile > - > > Key: HUDI-7145 > URL: https://issues.apache.org/jira/browse/HUDI-7145 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Hudi writes metadata table (MT) base files in HFile format. HFile stores > sorted key-value pairs. For the existing MT partitions, the key is guaranteed > to be unique. However, for secondary index, it is very likely that the same > value of secondary index field is in multiple files. > This ticket is to microbenchmark two approaches of storing secondary index: > # Group all values for a key and then store key-value pairs where each value > in this pair is a collection. For example, say column c1 is the secondary > index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this > approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) > v2: [f2]. > # Since each key-value pair is unique as a whole, so store each key-value > pair separately (still lexicographically sorted). So, in this approach, we > have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2. > The benchmark should capture storage overhead and lookup latency of one > approach over the other. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7572: - Description: After change to loosen the compaction for MDT, there is rare case the same compaction instant time got used to schedule for multiple times, we better optimize the compactor to avoid empty compaction plan generation. Note: although we have a active timeline check to avoid the repetative scheduling, there is still little chance the compaction already got archived. > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > > After change to loosen the compaction for MDT, there is rare case the same > compaction instant time got used to schedule for multiple times, we better > optimize the compactor to avoid empty compaction plan generation. > Note: although we have a active timeline check to avoid the repetative > scheduling, there is still little chance the compaction already got archived. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7145) Support for grouping values for same key in HFile
[ https://issues.apache.org/jira/browse/HUDI-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7145: -- Status: Patch Available (was: In Progress) > Support for grouping values for same key in HFile > - > > Key: HUDI-7145 > URL: https://issues.apache.org/jira/browse/HUDI-7145 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Hudi writes metadata table (MT) base files in HFile format. HFile stores > sorted key-value pairs. For the existing MT partitions, the key is guaranteed > to be unique. However, for secondary index, it is very likely that the same > value of secondary index field is in multiple files. > This ticket is to microbenchmark two approaches of storing secondary index: > # Group all values for a key and then store key-value pairs where each value > in this pair is a collection. For example, say column c1 is the secondary > index clumn with values v1 in files f1, f2 and value v2 in file f2. Then this > approach means there is still just 2 keys as follows: i) v1: [f1, f2] and ii) > v2: [f2]. > # Since each key-value pair is unique as a whole, so store each key-value > pair separately (still lexicographically sorted). So, in this approach, we > have 3 entries in hfile: i) v1: f1, ii) v1: f2 and iii) v2: f2. > The benchmark should capture storage overhead and lookup latency of one > approach over the other. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7146) Implement secondary index
[ https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7146: -- Status: Open (was: In Progress) > Implement secondary index > - > > Key: HUDI-7146 > URL: https://issues.apache.org/jira/browse/HUDI-7146 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > # Secondary index schema should be flexible enough to accommodate various > kinds of secondary index. > # Reuse as much as possible the existing framework for indexing. > # Merge with existing index config and introduce as less configs as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7552) Remove the suffix for MDT table service instants
[ https://issues.apache.org/jira/browse/HUDI-7552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7552: - Status: Patch Available (was: In Progress) > Remove the suffix for MDT table service instants > > > Key: HUDI-7552 > URL: https://issues.apache.org/jira/browse/HUDI-7552 > Project: Apache Hudi > Issue Type: Improvement > Components: core >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > We wanna remove the very specific design for MDT so that it's behavior is in > sync with the DT. > > The criteria for simplification: > {code:java} > 1. use the instant timestamp from DT to commit to the MDT as much as possible > for any delta_commit on MDT. > 2. for table sercives like cleaning, compaction and log_compaction, the > timestamp is auto-generated. > 3. avoid to trigger multiple commits to MDT for one DT action. {code} > The async index instant suffix is kept because there are some validation > logic that needs special filtering on these instants, the suffix is kind of a > "tag" for filtering. We should refacor that out in the future if we have a > better solution. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7572) Avoid to schedule empty compaction plan without log files
Danny Chen created HUDI-7572: Summary: Avoid to schedule empty compaction plan without log files Key: HUDI-7572 URL: https://issues.apache.org/jira/browse/HUDI-7572 Project: Apache Hudi Issue Type: Improvement Components: table-service Reporter: Danny Chen Assignee: Danny Chen Fix For: 1.0.0 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7572) Avoid to schedule empty compaction plan without log files
[ https://issues.apache.org/jira/browse/HUDI-7572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-7572: - Sprint: Sprint 2024-03-25 > Avoid to schedule empty compaction plan without log files > - > > Key: HUDI-7572 > URL: https://issues.apache.org/jira/browse/HUDI-7572 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Danny Chen >Assignee: Danny Chen >Priority: Major > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
wombatu-kun commented on code in PR #10949: URL: https://github.com/apache/hudi/pull/10949#discussion_r1550714037 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { option("hoodie.schema.on.read.enable","true"). option("hoodie.datasource.write.reconcile.schema","true"). option(DataSourceWriteOptions.TABLE_NAME.key(), tableName). +option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), classOf[OverwriteWithLatestAvroPayload].getName). Review Comment: Yes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
danny0405 commented on code in PR #10949: URL: https://github.com/apache/hudi/pull/10949#discussion_r1550687858 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/ddl/TestSpark3DDL.scala: ## @@ -742,6 +744,8 @@ class TestSpark3DDL extends HoodieSparkSqlTestBase { option("hoodie.schema.on.read.enable","true"). option("hoodie.datasource.write.reconcile.schema","true"). option(DataSourceWriteOptions.TABLE_NAME.key(), tableName). +option(HoodieWriteConfig.WRITE_PAYLOAD_CLASS_NAME.key(), classOf[OverwriteWithLatestAvroPayload].getName). Review Comment: So these changes are made only to make the test pass? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
danny0405 commented on code in PR #8338: URL: https://github.com/apache/hudi/pull/8338#discussion_r1550685909 ## hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java: ## @@ -1134,6 +1136,11 @@ public PropertyBuilder set(Map props) { return this; } +public PropertyBuilder setHoodieIndexConf(Properties hoodieIndexConf) { + this.hoodieIndexConf = hoodieIndexConf; + return this; Review Comment: Currently the index config is not a table config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
danny0405 commented on code in PR #8338: URL: https://github.com/apache/hudi/pull/8338#discussion_r1550685079 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/ITTestDataStreamWrite.java: ## @@ -171,6 +178,38 @@ public void testWriteMergeOnReadWithCompaction(String indexType) throws Exceptio testWriteToHoodie(conf, "mor_write_with_compact", 1, EXPECTED); } + @Test + public void testVerifyConsistencyOfBucketNum() throws Exception { +String path = tempFile.getAbsolutePath(); +Configuration conf = TestConfigurations.getDefaultConf(path); +conf.setString(FlinkOptions.INDEX_TYPE, "BUCKET"); +conf.setInteger(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 4); Review Comment: Maybe we should move this test into `TestStreamWriteOperatorCoordinator`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError: org.apache.hudi.configuration.FlinkOptions [hudi]
danny0405 commented on issue #8366: URL: https://github.com/apache/hudi/issues/8366#issuecomment-2035876822 It looks like the hudi flink bundle jar is not correctly loaded in the classpath. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7564] Revert hive sync inconsistency and reason for it (#10959)
This is an automated email from the ASF dual-hosted git repository. danny0405 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 398c9a23c84 [HUDI-7564] Revert hive sync inconsistency and reason for it (#10959) 398c9a23c84 is described below commit 398c9a23c84a54aecfea8e6c7948f198785710c5 Author: voonhous AuthorDate: Thu Apr 4 08:41:39 2024 +0800 [HUDI-7564] Revert hive sync inconsistency and reason for it (#10959) --- .../main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala | 4 +++- .../src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java | 3 ++- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala index 734afd79252..dbac496022f 100644 --- a/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala +++ b/hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala @@ -480,7 +480,9 @@ trait ProvidesHoodieConfig extends Logging { hiveSyncConfig.setValue(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS, props.getString(HoodieSyncConfig.META_SYNC_PARTITION_FIELDS.key)) } hiveSyncConfig.setDefaultValue(HoodieSyncConfig.META_SYNC_PARTITION_EXTRACTOR_CLASS, classOf[MultiPartKeysValueExtractor].getName) - hiveSyncConfig.setDefaultValue(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE, HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE.defaultValue()) +// This is hardcoded to true to ensure consistency as Spark syncs TIMESTAMP types as TIMESTAMP by default +// via Spark's externalCatalog API, which is used by AlterHoodieTableCommand. + hiveSyncConfig.setDefaultValue(HiveSyncConfigHolder.HIVE_SUPPORT_TIMESTAMP_TYPE, "true") if (hiveSyncConfig.useBucketSync()) hiveSyncConfig.setValue(HiveSyncConfigHolder.HIVE_SYNC_BUCKET_SYNC_SPEC, HiveSyncConfig.getBucketSpec(props.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD.key), diff --git a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java index 74cb90de020..8f31cae29bc 100644 --- a/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java +++ b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncConfigHolder.java @@ -90,7 +90,8 @@ public class HiveSyncConfigHolder { .defaultValue("false") .markAdvanced() .withDocumentation("‘INT64’ with original type TIMESTAMP_MICROS is converted to hive ‘timestamp’ type. " - + "Disabled by default for backward compatibility."); + + "Disabled by default for backward compatibility. \n" + + "NOTE: On Spark entrypoints, this is defaulted to TRUE"); public static final ConfigProperty HIVE_TABLE_PROPERTIES = ConfigProperty .key("hoodie.datasource.hive_sync.table_properties") .noDefaultValue()
Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]
danny0405 merged PR #10959: URL: https://github.com/apache/hudi/pull/10959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7507: - Fix Version/s: (was: 1.0.0) > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Priority: Major > Fix For: 0.15.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > -An alternate approach (suggested by- [~pwason] -) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle.- > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline (inflight or otherwise) that are greater than it. If that assertion > fails, then throw a retry-able conflict resolution exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7507: -- Fix Version/s: 1.0.0 > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Priority: Major > Fix For: 0.15.0, 1.0.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > -An alternate approach (suggested by- [~pwason] -) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle.- > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline (inflight or otherwise) that are greater than it. If that assertion > fails, then throw a retry-able conflict resolution exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7507) ongoing concurrent writers with smaller timestamp can cause issues with table services
[ https://issues.apache.org/jira/browse/HUDI-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-7507: -- Fix Version/s: 0.15.0 > ongoing concurrent writers with smaller timestamp can cause issues with > table services > --- > > Key: HUDI-7507 > URL: https://issues.apache.org/jira/browse/HUDI-7507 > Project: Apache Hudi > Issue Type: Improvement > Components: table-service >Reporter: Krishen Bhan >Priority: Major > Fix For: 0.15.0 > > Attachments: Flowchart (1).png, Flowchart.png > > > Although HUDI operations hold a table lock when creating a .requested > instant, because HUDI writers do not generate a timestamp and create a > .requsted plan in the same transaction, there can be a scenario where > # Job 1 starts, chooses timestamp (x) , Job 2 starts and chooses timestamp > (x - 1) > # Job 1 schedules and creates requested file with instant timestamp (x) > # Job 2 schedules and creates requested file with instant timestamp (x-1) > # Both jobs continue running > If one job is writing a commit and the other is a table service, this can > cause issues: > * > ** If Job 2 is ingestion commit and Job 1 is compaction/log compaction, then > when Job 1 runs before Job 2 and can create a compaction plan for all instant > times (up to (x) ) that doesn’t include instant time (x-1) . Later Job 2 > will create instant time (x-1), but timeline will be in a corrupted state > since compaction plan was supposed to include (x-1) > ** There is a similar issue with clean. If Job2 is a long-running commit > (that was stuck/delayed for a while before creating its .requested plan) and > Job 1 is a clean, then Job 1 can perform a clean that updates the > earliest-commit-to-retain without waiting for the inflight instant by Job 2 > at (x-1) to complete. This causes Job2 to be "skipped" by clean. > [Edit] I added a diagram to visualize the issue, specifically the second > scenario with clean > !Flowchart (1).png! > > One way this can be resolved is by combining the operations of generating > instant time and creating a requested file in the same HUDI table > transaction. Specifically, executing the following steps whenever any instant > (commit, table service, etc) is scheduled > # Acquire table lock > # Look at the latest instant C on the active timeline (completed or not). > Generate a timestamp after C > # Create the plan and requested file using this new timestamp ( that is > greater than C) > # Release table lock > Unfortunately this has the following drawbacks > * Every operation must now hold the table lock when computing its plan, even > if its an expensive operation and will take a while > * Users of HUDI cannot easily set their own instant time of an operation, > and this restriction would break any public APIs that allow this > -An alternate approach (suggested by- [~pwason] -) was to instead have all > operations including table services perform conflict resolution checks before > committing. For example, clean and compaction would generate their plan as > usual. But when creating a transaction to write a .requested file, right > before creating the file they should check if another lower timestamp instant > has appeared in the timeline. And if so, they should fail/abort without > creating the plan. Commit operations would also be updated/verified to have > similar check, before creating a .requested file (during a transaction) the > commit operation will check if a table service plan (clean/compact) with a > greater instant time has been created. And if so, would abort/fail. This > avoids the drawbacks of the first approach, but will lead to more transient > failures that users have to handle.- > > An alternate approach is to have every operation abort creating a .requested > file unless it has the latest timestamp. Specifically, for any instant type, > whenever an operation is about to create a .requested plan on timeline, it > should take the table lock and assert that there are no other instants on > timeline (inflight or otherwise) that are greater than it. If that assertion > fails, then throw a retry-able conflict resolution exception. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7503: - Description: Some external workflow schedulers can accidentally (or) misbehave and schedule duplicate executions of the same compaction plan. We need a way to guard against this inside Hudi (vs user taking a lock externally). In such a world, 2 instance of the job concurrently call `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same compaction instant. This is since one writer might execute the instant and create an inflight, while the other writer sees the inflight and tries to roll it back before re-attempting to execute it (since it will assume said inflight was a previously failed compaction attempt). This logic should be updated such that only one writer will actually execute the compaction plan at a time (and the others will fail/abort). One approach is to use a transaction (base table lock) in conjunction with heartbeating, to ensure that the writer triggers a heartbeat before executing compaction, and any concurrent writers will use the heartbeat to check wether the compaction is currently being executed by another writer. Specifically , the compact API should execute the following steps # Get the instant to compact C (as usual) # Start a transaction # Checks if C has an active heartbeat, if so finish transaction and throw exception # Start a heartbeat for C (this will implicitly re-start the heartbeat if it has been started before by another job) # Finish transaction # Run the existing compact API logic on C # If execution succeeds, clean up heartbeat file . If it fails do nothing (as the heartbeat will anyway be automatically expired later). Note that this approach only holds the table lock temporarily, when checking/starting the heartbeat Also, this flow can be applied to execution of clean plans and other table services was: Currently it is not safe for 2+ writers to concurrently call `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same compaction instant. This is since one writer might execute the instant and create an inflight, while the other writer sees the inflight and tries to roll it back before re-attempting to execute it (since it will assume said inflight was a previously failed compaction attempt). This logic should be updated such that only one writer will actually execute the compaction plan at a time (and the others will fail/abort). One approach is to use a transaction (base table lock) in conjunction with heartbeating, to ensure that the writer triggers a heartbeat before executing compaction, and any concurrent writers will use the heartbeat to check wether the compaction is currently being executed by another writer. Specifically , the compact API should execute the following steps # Get the instant to compact C (as usual) # Start a transaction # Checks if C has an active heartbeat, if so finish transaction and throw exception # Start a heartbeat for C (this will implicitly re-start the heartbeat if it has been started before by another job) # Finish transaction # Run the existing compact API logic on C # If execution succeeds, clean up heartbeat file . If it fails do nothing (as the heartbeat will anyway be automatically expired later). Note that this approach only holds the table lock temporarily, when checking/starting the heartbeat Also, this flow can be applied to execution of clean plans and other table services > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan >Priority: Minor > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat
[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7503: - Fix Version/s: 0.15.0 1.0.0 > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan >Priority: Minor > Fix For: 0.15.0, 1.0.0 > > > Some external workflow schedulers can accidentally (or) misbehave and > schedule duplicate executions of the same compaction plan. We need a way to > guard against this inside Hudi (vs user taking a lock externally). In such a > world, 2 instance of the job concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. > This is since one writer might execute the instant and create an inflight, > while the other writer sees the inflight and tries to roll it back before > re-attempting to execute it (since it will assume said inflight was a > previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7503) Concurrent executions of table service plan should not corrupt dataset
[ https://issues.apache.org/jira/browse/HUDI-7503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-7503: - Summary: Concurrent executions of table service plan should not corrupt dataset (was: concurrent executions of table service plan should not corrupt dataset) > Concurrent executions of table service plan should not corrupt dataset > -- > > Key: HUDI-7503 > URL: https://issues.apache.org/jira/browse/HUDI-7503 > Project: Apache Hudi > Issue Type: Improvement > Components: compaction, table-service >Reporter: Krishen Bhan >Priority: Minor > > Currently it is not safe for 2+ writers to concurrently call > `org.apache.hudi.client.BaseHoodieTableServiceClient#compact` on the same > compaction instant. This is since one writer might execute the instant and > create an inflight, while the other writer sees the inflight and tries to > roll it back before re-attempting to execute it (since it will assume said > inflight was a previously failed compaction attempt). > This logic should be updated such that only one writer will actually execute > the compaction plan at a time (and the others will fail/abort). > One approach is to use a transaction (base table lock) in conjunction with > heartbeating, to ensure that the writer triggers a heartbeat before executing > compaction, and any concurrent writers will use the heartbeat to check wether > the compaction is currently being executed by another writer. Specifically , > the compact API should execute the following steps > # Get the instant to compact C (as usual) > # Start a transaction > # Checks if C has an active heartbeat, if so finish transaction and throw > exception > # Start a heartbeat for C (this will implicitly re-start the heartbeat if it > has been started before by another job) > # Finish transaction > # Run the existing compact API logic on C > # If execution succeeds, clean up heartbeat file . If it fails do nothing > (as the heartbeat will anyway be automatically expired later). > Note that this approach only holds the table lock temporarily, when > checking/starting the heartbeat > Also, this flow can be applied to execution of clean plans and other table > services -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]
hudi-bot commented on PR #10954: URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035796882 ## CI report: * 865526e2bb6d40e51fe7b72bb5313701efb6df19 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23095) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Flink streaming write to Hudi table using data stream API java.lang.NoClassDefFoundError: org.apache.hudi.configuration.FlinkOptions [hudi]
ankit0811 commented on issue #8366: URL: https://github.com/apache/hudi/issues/8366#issuecomment-2035796844 piggybacking on this issue @danny0405 I still see this exception thrown when running the listed example for flink version `1.15.2` and hudi version `0.14.0` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]
hudi-bot commented on PR #10954: URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035703493 ## CI report: * a20e9d4c236a04becc36724f22972c8eb925c15d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23080) * 865526e2bb6d40e51fe7b72bb5313701efb6df19 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23095) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]
hudi-bot commented on PR #10954: URL: https://github.com/apache/hudi/pull/10954#issuecomment-2035693521 ## CI report: * a20e9d4c236a04becc36724f22972c8eb925c15d Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23080) * 865526e2bb6d40e51fe7b72bb5313701efb6df19 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-6787) Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and RealtimeCompactedRecordReader for Hive
[ https://issues.apache.org/jira/browse/HUDI-6787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833719#comment-17833719 ] Jonathan Vexler commented on HUDI-6787: --- Use [https://github.com/apache/hudi/pull/5786] as guide for testing hive3 with docker demo > Integrate FileGroupReader with HoodieMergeOnReadSnapshotReader and > RealtimeCompactedRecordReader for Hive > - > > Key: HUDI-6787 > URL: https://issues.apache.org/jira/browse/HUDI-6787 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Ethan Guo >Assignee: Jonathan Vexler >Priority: Blocker > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]
hudi-bot commented on PR #10960: URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035289254 ## CI report: * 5fb7cb53038a810807489ff17b52b8568b6925d5 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23094) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]
hudi-bot commented on PR #10960: URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035161250 ## CI report: * 5fb7cb53038a810807489ff17b52b8568b6925d5 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23094) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]
hudi-bot commented on PR #10960: URL: https://github.com/apache/hudi/pull/10960#issuecomment-2035142907 ## CI report: * 5fb7cb53038a810807489ff17b52b8568b6925d5 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode
[ https://issues.apache.org/jira/browse/HUDI-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7571: - Labels: pull-request-available (was: ) > Add api to get exception details in HoodieMetadataTableValidator with > ignoreFailed mode > --- > > Key: HUDI-7571 > URL: https://issues.apache.org/jira/browse/HUDI-7571 > Project: Apache Hudi > Issue Type: Bug >Reporter: Lokesh Jain >Assignee: Lokesh Jain >Priority: Major > Labels: pull-request-available > > When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure > and continues the validation. This jira aims to add api to get list of > exceptions and an api to check if validation exception was thrown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[PR] [HUDI-7571] Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode [hudi]
lokeshj1703 opened a new pull request, #10960: URL: https://github.com/apache/hudi/pull/10960 ### Change Logs When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure and continues the validation. This jira aims to add api to get list of exceptions and an api to check if validation exception was thrown. ### Impact NA ### Risk level (write none, low medium or high below) low ### Documentation Update NA ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-7571) Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode
Lokesh Jain created HUDI-7571: - Summary: Add api to get exception details in HoodieMetadataTableValidator with ignoreFailed mode Key: HUDI-7571 URL: https://issues.apache.org/jira/browse/HUDI-7571 Project: Apache Hudi Issue Type: Bug Reporter: Lokesh Jain Assignee: Lokesh Jain When ignoreFailed is enabled, HoodieMetadataTableValidator ignores failure and continues the validation. This jira aims to add api to get list of exceptions and an api to check if validation exception was thrown. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
hudi-bot commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2035012033 ## CI report: * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5432] Fixing rollback block handling in LogRecordReader [hudi]
nsivabalan commented on PR #7649: URL: https://github.com/apache/hudi/pull/7649#issuecomment-2034923879 this may not be valid anymore. we made rollbacks eager and ensure we rollback any failed writes in MDT before starting a new commit. https://github.com/apache/hudi/blob/bf723f56cd0d379f951a5a2d535502f326d1bc78/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java#L1177 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5432] Fixing rollback block handling in LogRecordReader [hudi]
nsivabalan closed pull request #7649: [HUDI-5432] Fixing rollback block handling in LogRecordReader URL: https://github.com/apache/hudi/pull/7649 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
hudi-bot commented on PR #8338: URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034903046 ## CI report: * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN * fbbfddc71d0aefd947dcb21bb412c12571f357d2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23092) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
hudi-bot commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034893011 ## CI report: * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077) * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7559] [1/n] Fix RecordLevelIndexSupport::filterQueryWithRecordKey [hudi]
codope commented on code in PR #10947: URL: https://github.com/apache/hudi/pull/10947#discussion_r1549935529 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala: ## @@ -180,9 +157,40 @@ class RecordLevelIndexSupport(spark: SparkSession, } /** - * Return true if metadata table is enabled and record index metadata partition is available. + * Returns the attribute and literal pair given the operands of a binary operator. The pair is returned only if one of + * the operand is an attribute and other is literal. In other cases it returns an empty Option. + * @param expression1 - Left operand of the binary operator + * @param expression2 - Right operand of the binary operator + * @return Attribute and literal pair */ - def isIndexAvailable: Boolean = { -metadataConfig.enabled && metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX) + private def getAttributeLiteralTuple(expression1: Expression, expression2: Expression): Option[(AttributeReference, Literal)] = { +expression1 match { + case attr: AttributeReference => expression2 match { +case literal: Literal => + Option.apply(attr, literal) +case _ => + Option.empty + } + case literal: Literal => expression2 match { +case attr: AttributeReference => + Option.apply(attr, literal) +case _ => + Option.empty + } + case _ => Option.empty +} + } + + /** + * Matches the configured simple record key with the input attribute name. + * @param attributeName The attribute name provided in the query + * @return true if input attribute name matches the configured simple record key + */ + private def attributeMatchesRecordKey(attributeName: String, recordKeyOpt: Option[String]): Boolean = { Review Comment: What was wrong with prev implementation which did call `getRecordKeyConfig` inside this method? Or is it just refactoring? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala: ## @@ -180,9 +157,40 @@ class RecordLevelIndexSupport(spark: SparkSession, } /** - * Return true if metadata table is enabled and record index metadata partition is available. + * Returns the attribute and literal pair given the operands of a binary operator. The pair is returned only if one of + * the operand is an attribute and other is literal. In other cases it returns an empty Option. + * @param expression1 - Left operand of the binary operator + * @param expression2 - Right operand of the binary operator + * @return Attribute and literal pair */ - def isIndexAvailable: Boolean = { -metadataConfig.enabled && metaClient.getTableConfig.getMetadataPartitions.contains(HoodieTableMetadataUtil.PARTITION_NAME_RECORD_INDEX) + private def getAttributeLiteralTuple(expression1: Expression, expression2: Expression): Option[(AttributeReference, Literal)] = { Review Comment: can we unti test this method? ## hudi-spark-datasource/hudi-spark-common/src/test/scala/org/apache/hudi/TestRecordLevelIndexSupport.scala: ## @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.hudi + +import org.apache.hudi.common.model.HoodieRecord.HoodieMetadataField +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, EqualTo, FromUnixTime, Literal} +import org.apache.spark.sql.types.StringType +import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue} +import org.junit.jupiter.api.Test + +import java.util.TimeZone + +class TestRecordLevelIndexSupport { + @Test + def testFilterQueryWithRecordKey(): Unit = { Review Comment: Good that we have the test for equalTo. Can we also some for In and Not In? ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/RecordLevelIndexSupport.scala: ## @@ -143,22 +104,38 @@ class RecordLevelIndexSupport(spark: SparkSession, } } + /** + * Return true if metadata table is enabled and record index metadata partition is
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
hudi-bot commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034872367 ## CI report: * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077) * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2034872494 ## CI report: * 66f7add237e807bc7ad7a870ee39f3c60762b728 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23086) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
wombatu-kun commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034830732 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7526] Fix constructors for bulkinsert sort partitioners to ensure we could use it as user defined partitioners [hudi]
wombatu-kun commented on code in PR #10942: URL: https://github.com/apache/hudi/pull/10942#discussion_r1549898777 ## hudi-client/hudi-java-client/src/main/java/org/apache/hudi/execution/bulkinsert/JavaGlobalSortPartitioner.java: ## @@ -31,12 +32,21 @@ * * @param HoodieRecordPayload type */ -public class JavaGlobalSortPartitioner -implements BulkInsertPartitioner>> { +public class JavaGlobalSortPartitioner implements BulkInsertPartitioner>> { + + public JavaGlobalSortPartitioner() { + } + + /** + * Constructor to create as UserDefinedBulkInsertPartitioner class via reflection + * @param config HoodieWriteConfig Review Comment: @danny0405 can you please assign this PR to @nsivabalan to summon him to this discussion? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
hudi-bot commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034768935 ## CI report: * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077) * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23093) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
hudi-bot commented on PR #8338: URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034755617 ## CI report: * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091) * fbbfddc71d0aefd947dcb21bb412c12571f357d2 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23092) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6854] Change default payload type to HOODIE_AVRO_DEFAULT [hudi]
hudi-bot commented on PR #10949: URL: https://github.com/apache/hudi/pull/10949#issuecomment-2034744354 ## CI report: * c344e38bfcfea10fb1556a4d335af1b5b92da6ee Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23077) * a61ba0a9cd3f8a23975d5ab39385c2a60d8da788 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
hudi-bot commented on PR #8338: URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034733630 ## CI report: * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091) * fbbfddc71d0aefd947dcb21bb412c12571f357d2 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-203479 ## CI report: * 66f7add237e807bc7ad7a870ee39f3c60762b728 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23086) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [DO NOT MERGE][HUDI-7567] Add schema evolution to the filegroup reader [hudi]
hudi-bot commented on PR #10957: URL: https://github.com/apache/hudi/pull/10957#issuecomment-2034721873 ## CI report: * 66f7add237e807bc7ad7a870ee39f3c60762b728 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7565] Create spark file readers to read a single file instead of an entire partition [hudi]
jonvex commented on code in PR #10954: URL: https://github.com/apache/hudi/pull/10954#discussion_r1549758875 ## hudi-spark-datasource/hudi-spark2/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/Spark24HoodieParquetReader.scala: ## @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.Path +import org.apache.hadoop.mapreduce.lib.input.FileSplit +import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl +import org.apache.hadoop.mapreduce.{JobID, TaskAttemptID, TaskID, TaskType} +import org.apache.parquet.filter2.compat.FilterCompat +import org.apache.parquet.filter2.predicate.FilterApi +import org.apache.parquet.format.converter.ParquetMetadataConverter.SKIP_ROW_GROUPS +import org.apache.parquet.hadoop.{ParquetFileReader, ParquetInputFormat, ParquetRecordReader} +import org.apache.spark.TaskContext +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.avro.AvroDeserializer +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection +import org.apache.spark.sql.catalyst.expressions.{Cast, JoinedRow, UnsafeRow} +import org.apache.spark.sql.catalyst.util.DateTimeUtils +import org.apache.spark.sql.execution.datasources.{PartitionedFile, RecordReaderIterator} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.sources.Filter +import org.apache.spark.sql.types.{AtomicType, StructField, StructType} +import org.apache.spark.util.SerializableConfiguration + +import java.net.URI + +object Spark24HoodieParquetReader { + + /** + * Get properties needed to read a parquet file + * + * @param vectorized true if vectorized reading is not prohibited due to schema, reading mode, etc + * @param sqlConfthe [[SQLConf]] used for the read + * @param optionspassed as a param to the file format + * @param hadoopConf some configs will be set for the hadoopConf + * @return map of properties needed for reading a parquet file + */ + def getPropsForReadingParquet(vectorized: Boolean, +sqlConf: SQLConf, +options: Map[String, String], +hadoopConf: Configuration): Map[String, String] = { +//set hadoopconf +hadoopConf.set(ParquetInputFormat.READ_SUPPORT_CLASS, classOf[ParquetReadSupport].getName) +hadoopConf.set(SQLConf.SESSION_LOCAL_TIMEZONE.key, sqlConf.sessionLocalTimeZone) +hadoopConf.setBoolean(SQLConf.CASE_SENSITIVE.key, sqlConf.caseSensitiveAnalysis) +hadoopConf.setBoolean(SQLConf.PARQUET_BINARY_AS_STRING.key, sqlConf.isParquetBinaryAsString) +hadoopConf.setBoolean(SQLConf.PARQUET_INT96_AS_TIMESTAMP.key, sqlConf.isParquetINT96AsTimestamp) + +Map( + "enableVectorizedReader" -> vectorized.toString, + "enableParquetFilterPushDown" -> sqlConf.parquetFilterPushDown.toString, + "pushDownDate" -> sqlConf.parquetFilterPushDownDate.toString, + "pushDownTimestamp" -> sqlConf.parquetFilterPushDownTimestamp.toString, + "pushDownDecimal" -> sqlConf.parquetFilterPushDownDecimal.toString, + "pushDownInFilterThreshold" -> sqlConf.parquetFilterPushDownInFilterThreshold.toString, + "pushDownStringStartWith" -> sqlConf.parquetFilterPushDownStringStartWith.toString, + "isCaseSensitive" -> sqlConf.caseSensitiveAnalysis.toString, + "timestampConversion" -> sqlConf.isParquetINT96TimestampConversion.toString, + "enableOffHeapColumnVector" -> sqlConf.offHeapColumnVectorEnabled.toString, + "capacity" -> sqlConf.parquetVectorizedReaderBatchSize.toString, + "returningBatch" -> sqlConf.parquetVectorizedReaderEnabled.toString, + "enableRecordFilter" -> sqlConf.parquetRecordFilterEnabled.toString, + "timeZoneId" -> sqlConf.sessionLocalTimeZone +) + } + + /** + * Read an individual parquet file + * Code from ParquetFileFormat#buildReaderWithPartitionValues from Spark v2.4.8 adapted here + * + * @param fileparquet file to read + * @param
Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]
hudi-bot commented on PR #10959: URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034585539 ## CI report: * 912797cefdd31067dde0c43e2b5c537d73d2b084 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23090) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7569: -- Status: In Progress (was: Open) > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit updated HUDI-7569: -- Status: Patch Available (was: In Progress) > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sagar Sumit closed HUDI-7569. - Resolution: Fixed > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
Re: [I] [SUPPORT] Slashes in partition columns [hudi]
eshu commented on issue #10754: URL: https://github.com/apache/hudi/issues/10754#issuecomment-2034546730 @ad1happy2go It does not work in my example. Did you tried it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
(hudi) branch master updated: [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation (#10778)
This is an automated email from the ASF dual-hosted git repository. jonvex pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new bf723f56cd0 [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation (#10778) bf723f56cd0 is described below commit bf723f56cd0d379f951a5a2d535502f326d1bc78 Author: Jon Vexler AuthorDate: Wed Apr 3 08:50:12 2024 -0400 [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation (#10778) * make exceptions more specific * use hudi avro exception * Address review comments * fix unnecessary changes * add exception wrapping * style * address review comments * remove . from config * address review comments * fix merge * fix checkstyle * Update hudi-common/src/main/java/org/apache/hudi/exception/HoodieRecordCreationException.java Co-authored-by: Y Ethan Guo * Update hudi-common/src/main/java/org/apache/hudi/exception/HoodieAvroSchemaException.java Co-authored-by: Y Ethan Guo * add javadoc to exception wrapper - Co-authored-by: Jonathan Vexler <=> Co-authored-by: Y Ethan Guo --- .../org/apache/hudi/AvroConversionUtils.scala | 14 +-- .../scala/org/apache/hudi/HoodieSparkUtils.scala | 20 ++--- .../hudi/util/ExceptionWrappingIterator.scala | 44 +++ .../java/org/apache/hudi/avro/AvroSchemaUtils.java | 10 ++--- .../java/org/apache/hudi/avro/HoodieAvroUtils.java | 25 +++ .../hudi/exception/HoodieAvroSchemaException.java | 31 ++ .../exception/HoodieRecordCreationException.java | 32 ++ .../org/apache/hudi/HoodieSparkSqlWriter.scala | 14 --- .../utilities/config/HoodieStreamerConfig.java | 7 .../apache/hudi/utilities/sources/RowSource.java | 9 +++- .../utilities/streamer/HoodieStreamerUtils.java| 24 +++ .../utilities/streamer/SourceFormatAdapter.java| 9 +++- .../hudi/utilities/sources/TestAvroDFSSource.java | 3 +- .../hudi/utilities/sources/TestCsvDFSSource.java | 3 +- .../hudi/utilities/sources/TestJsonDFSSource.java | 49 +- .../utilities/sources/TestParquetDFSSource.java| 3 +- .../sources/AbstractDFSSourceTestBase.java | 7 +++- 17 files changed, 257 insertions(+), 47 deletions(-) diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala index 55877938f8c..95962d1ca44 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/AvroConversionUtils.scala @@ -23,6 +23,7 @@ import org.apache.avro.generic.GenericRecord import org.apache.avro.{JsonProperties, Schema} import org.apache.hudi.HoodieSparkUtils.sparkAdapter import org.apache.hudi.avro.AvroSchemaUtils +import org.apache.hudi.exception.SchemaCompatibilityException import org.apache.hudi.internal.schema.HoodieSchemaException import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.InternalRow @@ -58,9 +59,16 @@ object AvroConversionUtils { */ def createInternalRowToAvroConverter(rootCatalystType: StructType, rootAvroType: Schema, nullable: Boolean): InternalRow => GenericRecord = { val serializer = sparkAdapter.createAvroSerializer(rootCatalystType, rootAvroType, nullable) -row => serializer - .serialize(row) - .asInstanceOf[GenericRecord] +row => { + try { +serializer + .serialize(row) + .asInstanceOf[GenericRecord] + } catch { +case e: HoodieSchemaException => throw e +case e => throw new SchemaCompatibilityException("Failed to convert spark record into avro record", e) + } +} } /** diff --git a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala index 03d977f6fc9..6de5de8842e 100644 --- a/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala +++ b/hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala @@ -18,25 +18,25 @@ package org.apache.hudi +import org.apache.avro.Schema +import org.apache.avro.generic.GenericRecord +import org.apache.hadoop.fs.Path import org.apache.hudi.HoodieConversionUtils.toScalaOption import org.apache.hudi.avro.{AvroSchemaUtils, HoodieAvroUtils} import org.apache.hudi.client.utils.SparkRowSerDe import org.apache.hudi.common.model.HoodieRecord import org.apache.hudi.hadoop.fs.CachingPath - -import org.apache.avro.Schema -import
Re: [PR] [HUDI-7486] Classify schema exceptions when converting from avro to spark row representation [hudi]
jonvex merged PR #10778: URL: https://github.com/apache/hudi/pull/10778 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
hudi-bot commented on PR #8338: URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034415412 ## CI report: * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23091) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]
hudi-bot commented on PR #10959: URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034405993 ## CI report: * 912797cefdd31067dde0c43e2b5c537d73d2b084 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23090) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-5996] Verify the consistency of bucket num at job sta… [hudi]
hudi-bot commented on PR #8338: URL: https://github.com/apache/hudi/pull/8338#issuecomment-2034400310 ## CI report: * fccdb147c249b08d856819e028986d76603828e9 UNKNOWN * 8081f81d180126d9c407eac821dbfbd7f5ae28f2 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=16284) * 3bb52ed2a7193d4cae00f55339b00b17e7f1993b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]
hudi-bot commented on PR #10959: URL: https://github.com/apache/hudi/pull/10959#issuecomment-2034389641 ## CI report: * 912797cefdd31067dde0c43e2b5c537d73d2b084 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] - org.apache.hudi.exception.HoodieException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool [hudi]
ad1happy2go commented on issue #10361: URL: https://github.com/apache/hudi/issues/10361#issuecomment-2034384415 @limadiego Gentle ping here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Optimized code ComparableVersion [hudi]
ad1happy2go commented on issue #10933: URL: https://github.com/apache/hudi/issues/10933#issuecomment-2034376325 @balloon72 Did you got a chance to provide more details here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] Insert and Delete in a Single Operation. [hudi]
ad1happy2go commented on issue #10958: URL: https://github.com/apache/hudi/issues/10958#issuecomment-2034372822 @lucianondolenc I dont think that is possible. as if we use 'upsert' operation type, then it won't allow dups and maintain uniqueness and if we use operation type 'insert' then it won't value the '_hoodie_is_deleted' field. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7564] Fix HiveSyncConfig inconsistency [hudi]
voonhous commented on PR #10951: URL: https://github.com/apache/hudi/pull/10951#issuecomment-2034289199 @danny0405 PR to revert this change + add docs: https://github.com/apache/hudi/pull/10959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7564] Revert hive sync inconsistency and add docs for it [hudi]
voonhous opened a new pull request, #10959: URL: https://github.com/apache/hudi/pull/10959 ### Change Logs Reverting the hive-sync inconsistency as described in https://github.com/apache/hudi/pull/10951#issuecomment-2034230672. TLDR, this inconsistency was introduced to ensure that hive-sync's behaviour is inline with Spark's externalCatalog table schema sync, which is used in AlterTableHoodieCommand. Hive-sync is used to create the _ro and _rt tables of MOR. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update Updated the config description for `hoodie.datasource.hive_sync.support_timestamp` to document that this is an intended inconsistency. ### Contributor's checklist - [X] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-7564) Fix HiveSync configuration inconsistencies
[ https://issues.apache.org/jira/browse/HUDI-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833501#comment-17833501 ] voon commented on HUDI-7564: h1. The reason for this discrepancy is due to Spark's external catalogue API, which syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive. Given that Hudi has multiple entrypoints, it make sense that Spark introduced this inconsistency. While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as {{{}false{}}}, I think it's best we just document this. > Fix HiveSync configuration inconsistencies > -- > > Key: HUDI-7564 > URL: https://issues.apache.org/jira/browse/HUDI-7564 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > *hoodie.datasource.hive_sync.support_timestamp* is required to be *false* > such that *TIMESTAMP (MICROS)* columns will be synced onto HMS as *LONG* > types. > > While this is not visible to hive-console/spark-sql console with the > {_}show-create-database{_}/{_}describe-table{_} command, HMS will store the > timestamp type as: > > {code:java} > support_timestamp=false LONG > support_timestamp=true TIMESTAMP{code} > > By overriding this to {*}true{*}, Trino/Presto queries will fail with this > error as it is reliant on HMS information: > {code:java} > Caused by: io.prestosql.jdbc.$internal.client.FailureInfo$FailureException: > Expected field to be long, actual timestamp(9) (field 0) > at > io.trino.plugin.hive.GenericHiveRecordCursor.validateType(GenericHiveRecordCursor.java:569) > at > io.trino.plugin.hive.GenericHiveRecordCursor.getLong(GenericHiveRecordCursor.java:274) > at > io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:106) > at io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120) > at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299) > at io.trino.operator.Driver.processInternal(Driver.java:395) > at io.trino.operator.Driver.lambda$process$8(Driver.java:298) > at io.trino.operator.Driver.tryWithLock(Driver.java:694) > at io.trino.operator.Driver.process(Driver.java:290) > at io.trino.operator.Driver.processForDuration(Driver.java:261) > at > io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:911) > at > io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:188) > at > io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:569) > at > io.trino.$gen.Trino_trino426_sql_hudi_di07_00120240326_074936_2.run(Unknown > Source) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > 2024-04-02 17:32:21 (UTC+8) INFO - Clear session property for connection. > 2024-04-02 17:32:21 (UTC+8) ERROR- Task Execution failed with > CommonException: Query failed (#20240402_093220_06724_cg4jg): Expected field > to be long, actual timestamp(9) (field 0) {code} > To demonstrate that the default support_timestamp config is not true via > spark-sql: > {code:java} > -- EXECUTE THESE QUERIES IN SPARK > -- Create a table > create table if not exists dev_hudi.timestamp_issue ( > int_col bigint, > `timestamp_col` TIMESTAMP > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'int_col' > ); > -- Perform an insert to trigger hive sync to create _ro and _rt tables > insert into dev_hudi.timestamp_issue select > 1 as int_col, > to_timestamp('2023-01-01', '-MM-dd') as timestamp_col; > -- Execute a query to verify that data has been written > select * from dev_hudi.timestamp_issue_rt; > -- Set support_timestamp to it's supposed default value (false) > set hoodie.datasource.hive_sync.support_timestamp=false; > -- Perform an insert again (Will throw an error) > insert into dev_hudi.timestamp_issue select > 1 as int_col, > to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;{code} > The last insert query will throw the error below, showing that > {*}support_timestamp{*}'s default value is {*}true{*}. > {code:java} > Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception > when hive syncing timestamp_issue > at > org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:190) > at > org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:58) > ... 64 more > Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert > field Type from TIMESTAMP to bigint
[jira] [Comment Edited] (HUDI-7564) Fix HiveSync configuration inconsistencies
[ https://issues.apache.org/jira/browse/HUDI-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17833501#comment-17833501 ] voon edited comment on HUDI-7564 at 4/3/24 11:10 AM: - The reason for this discrepancy is due to Spark's external catalogue API, which syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive. Given that Hudi has multiple entrypoints, it make sense that Spark introduced this inconsistency. While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as {{{}false{}}}, I think it's best we just document this. was (Author: JIRAUSER294635): h1. The reason for this discrepancy is due to Spark's external catalogue API, which syncs {{TIMESTAMP}} types as {{TIMESTAMP}} to hive. Given that Hudi has multiple entrypoints, it make sense that Spark introduced this inconsistency. While I am not sure why hive-sync-tool defaulted the {{support_timestamp}} as {{{}false{}}}, I think it's best we just document this. > Fix HiveSync configuration inconsistencies > -- > > Key: HUDI-7564 > URL: https://issues.apache.org/jira/browse/HUDI-7564 > Project: Apache Hudi > Issue Type: Bug >Reporter: voon >Assignee: voon >Priority: Major > Labels: pull-request-available > Fix For: 0.15.0 > > > *hoodie.datasource.hive_sync.support_timestamp* is required to be *false* > such that *TIMESTAMP (MICROS)* columns will be synced onto HMS as *LONG* > types. > > While this is not visible to hive-console/spark-sql console with the > {_}show-create-database{_}/{_}describe-table{_} command, HMS will store the > timestamp type as: > > {code:java} > support_timestamp=false LONG > support_timestamp=true TIMESTAMP{code} > > By overriding this to {*}true{*}, Trino/Presto queries will fail with this > error as it is reliant on HMS information: > {code:java} > Caused by: io.prestosql.jdbc.$internal.client.FailureInfo$FailureException: > Expected field to be long, actual timestamp(9) (field 0) > at > io.trino.plugin.hive.GenericHiveRecordCursor.validateType(GenericHiveRecordCursor.java:569) > at > io.trino.plugin.hive.GenericHiveRecordCursor.getLong(GenericHiveRecordCursor.java:274) > at > io.trino.spi.connector.RecordPageSource.getNextPage(RecordPageSource.java:106) > at io.trino.plugin.hudi.HudiPageSource.getNextPage(HudiPageSource.java:120) > at io.trino.operator.TableScanOperator.getOutput(TableScanOperator.java:299) > at io.trino.operator.Driver.processInternal(Driver.java:395) > at io.trino.operator.Driver.lambda$process$8(Driver.java:298) > at io.trino.operator.Driver.tryWithLock(Driver.java:694) > at io.trino.operator.Driver.process(Driver.java:290) > at io.trino.operator.Driver.processForDuration(Driver.java:261) > at > io.trino.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:911) > at > io.trino.execution.executor.timesharing.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:188) > at > io.trino.execution.executor.timesharing.TimeSharingTaskExecutor$TaskRunner.run(TimeSharingTaskExecutor.java:569) > at > io.trino.$gen.Trino_trino426_sql_hudi_di07_00120240326_074936_2.run(Unknown > Source) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > at java.base/java.lang.Thread.run(Thread.java:833) > 2024-04-02 17:32:21 (UTC+8) INFO - Clear session property for connection. > 2024-04-02 17:32:21 (UTC+8) ERROR- Task Execution failed with > CommonException: Query failed (#20240402_093220_06724_cg4jg): Expected field > to be long, actual timestamp(9) (field 0) {code} > To demonstrate that the default support_timestamp config is not true via > spark-sql: > {code:java} > -- EXECUTE THESE QUERIES IN SPARK > -- Create a table > create table if not exists dev_hudi.timestamp_issue ( > int_col bigint, > `timestamp_col` TIMESTAMP > ) using hudi > tblproperties ( > type = 'mor', > primaryKey = 'int_col' > ); > -- Perform an insert to trigger hive sync to create _ro and _rt tables > insert into dev_hudi.timestamp_issue select > 1 as int_col, > to_timestamp('2023-01-01', '-MM-dd') as timestamp_col; > -- Execute a query to verify that data has been written > select * from dev_hudi.timestamp_issue_rt; > -- Set support_timestamp to it's supposed default value (false) > set hoodie.datasource.hive_sync.support_timestamp=false; > -- Perform an insert again (Will throw an error) > insert into dev_hudi.timestamp_issue select > 1 as int_col, > to_timestamp('2023-01-01', '-MM-dd') as timestamp_col;{code} > The last insert query will throw the error below, showing that > {*}support_timestamp{*}'s default value is {*}true{*}. >