[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1228077953 ## CI report: * be94781340ba821d5de240c1a4eed249efa2e0db Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10756) * 37785220f2d17a1a04d136521f10c3a0314fe448 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10970) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
hudi-bot commented on PR #6384: URL: https://github.com/apache/hudi/pull/6384#issuecomment-1228073942 ## CI report: * be94781340ba821d5de240c1a4eed249efa2e0db Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10756) * 37785220f2d17a1a04d136521f10c3a0314fe448 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228073114 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 858a47a5b106462a5089ecf77278196bc7c7a0a8 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10966) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ThinkerLei commented on a diff in pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
ThinkerLei commented on code in PR #6384: URL: https://github.com/apache/hudi/pull/6384#discussion_r955653574 ## hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java: ## @@ -64,19 +64,17 @@ import java.util.function.Function; import java.util.function.Predicate; import java.util.regex.Matcher; -import java.util.regex.Pattern; import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN; + /** * Utility functions related to accessing the file storage. */ public class FSUtils { private static final Logger LOG = LogManager.getLogger(FSUtils.class); - // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1 - private static final Pattern LOG_FILE_PATTERN = - Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?"); private static final String LOG_FILE_PREFIX = "."; Review Comment: OK, I re-updated the PR. Can you give some advice ?Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955645966 ## rfc/rfc-51/rfc-51.md: ## @@ -215,18 +245,31 @@ Note: - Only instants that are active can be queried in a CDC scenario. - `CDCReader` manages all the things on CDC, and all the spark entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`. -- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work to get the change data. The following illustration explains the difference when this config is true or false. +- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute the changed data. The following illustrates the difference. ![](read_cdc_log_file.jpg) COW table -Reading COW table in CDC query mode is equivalent to reading a simplified MOR table that has no normal log files. +Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO mode. MOR table -According to the design of the writing part, only the cases where writing mor tables will write out the base file (which call the `HoodieMergeHandle` and it's subclasses) will write out the cdc files. -In other words, cdc files will be written out only for the index and file size reasons. +According to the section "Persisting CDC in MOR", CDC data is available upon base files' generation. + +When users want to get fresher real-time CDC results: + +- users are to set `hoodie.datasource.query.incremental.type=snapshot` +- the implementation logic is to compute the results in-flight by reading log files and the corresponding base files ( + current and previous file slices). +- this is equivalent to running incremental-query on MOR RT tables + +When users want to optimize compute-cost and are tolerant with latency of CDC results, + +- users are to set `hoodie.datasource.query.incremental.type=read_optimized` +- the implementation logic is to extract the results by reading persisted CDC data and the corresponding base files ( + current and previous file slices). Review Comment: Do we need this two config options then ? `hoodie.datasource.query.incremental.type=snapshot` `hoodie.datasource.query.incremental.type=read_optimized` Very confusing from my side, shouldn't it always be reading the refresh cdc data here ? Why we expose a ro view in cdc streaming read, for what use case ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955644523 ## rfc/rfc-51/rfc-51.md: ## @@ -148,20 +152,46 @@ hudi_cdc_table/ Under a partition directory, the `.log` file with `CDCBlock` above will keep the changing data we have to materialize. -There is an option to control what data is written to `CDCBlock`, that is `hoodie.table.cdc.supplemental.logging`. See the description of this config above. + Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction + +2 design choices on when to persist CDC in MOR tables: + +Write-on-indexing allows CDC info to be persisted at the earliest, however, in case of Flink writer or Bucket +indexing, `op` (I/U/D) data is not available at indexing. + +Write-on-compaction can always persist CDC info and achieve standardization of implementation logic across engines, +however, some delays are added to the CDC query results. Based on the business requirements, Log Compaction (RFC-48) or +scheduling more frequent compaction can be used to minimize the latency. -Spark DataSource example: +The semantics we propose to establish are: when base files are written, the corresponding CDC data is also persisted. + +- For Spark + - inserts are written to base files: the CDC data `op=I` will be persisted + - updates/deletes that written to log files are compacted into base files: the CDC data `op=U|D` will be persisted +- For Flink + - inserts/updates/deletes that written to log files are compacted into base files: the CDC data `op=I|U|D` will be +persisted + Review Comment: >inserts/updates/deletes that written to log files are compacted into base files: the CDC data `op=I|U|D` will be persisted I don't think the compaction generated cdc logs makes any sense in production, it lost the data freshness for cdc stream and it relies on the compaction service which it self is not a very robust infrastructure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955643711 ## rfc/rfc-51/rfc-51.md: ## @@ -148,20 +152,46 @@ hudi_cdc_table/ Under a partition directory, the `.log` file with `CDCBlock` above will keep the changing data we have to materialize. -There is an option to control what data is written to `CDCBlock`, that is `hoodie.table.cdc.supplemental.logging`. See the description of this config above. + Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction + +2 design choices on when to persist CDC in MOR tables: + +Write-on-indexing allows CDC info to be persisted at the earliest, however, in case of Flink writer or Bucket Review Comment: So what is the solution here ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955642598 ## rfc/rfc-51/rfc-51.md: ## @@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals 1. Support row-level CDC records generation and persistence; 2. Support both MOR and COW tables; 3. Support all the write operations; 4. Support Spark DataFrame/SQL/Streaming Query; -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly.| +| hoodie.table.cdc.supplemental.logging | `false` | If `true`, persist the required information about the changed data, including `before`. If `false`, only `op` and record keys will be persisted. | +| hoodie.table.cdc.supplemental.logging.include_after | `false` | If `true`, persist `after` as well. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and `hoodie.datasource.query.type=incremental`. -Note: Table operations like `Compact`, `Clean`, `Index` do not write/change any data. So we don't need to consider them in CDC scenario. - -### Modifiying code paths +| key| default| description | +|||--| +| hoodie.table.cdc.enabled | `false`| set to `true` for CDC queries| +| hoodie.datasource.query.type | `snapshot` | set to `incremental` for CDC queries | +| hoodie.datasource.read.start.timestamp | - | requried. | +| hoodie.datasource.read.end.timestamp | - | optional. | -![](points.jpg) +### Logical File Types -### Config Definitions +We define 4 logical file types for the CDC scenario. Review Comment: Can someone give a explanation here ? What is exactly a logical file type ? name it action seems more suitable here? ## rfc/rfc-51/rfc-51.md: ## @@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals 1. Support row-level CDC records generation and persistence; 2. Support both MOR and COW tables; 3. Support all the write operations; 4. Support Spark DataFrame/SQL/Streaming Query; -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly.| +| hoodie.table.cdc.supplemental.logging | `false` | If `true`, persist the required information about the changed data, including `before`. If `false`, only `op` and record keys will be persisted. | +| hoodie.table.cdc.supplemental.logging.include_after | `false` | If `true`, persist `after` as well. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and `hoodie.datasource.query.type=incremental`. -Note: Table operations like `Compact`, `Clean`, `Index` do not write/change any data. So we don't need to consider them in CDC scenario. - -### Modifiying code paths +| key| default| description |
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955641659 ## rfc/rfc-51/rfc-51.md: ## @@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals -1. Support row-level CDC records generation and persistence; -2. Support both MOR and COW tables; -3. Support all the write operations; -4. Support Spark DataFrame/SQL/Streaming Query; +1. Support row-level CDC records generation and persistence +2. Support both MOR and COW tables +3. Support all the write operations +4. Support incremental queries in CDC format across supported engines -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly. | +| hoodie.table.cdc.supplemental.logging.mode | `KEY_OP` | A mode to indicate the level of changed data being persisted. At the minimum level, `KEY_OP` indicates changed records' keys and operations to be persisted. `DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. `DATA_BEFORE_AFTER`: persist records' after-images in addition to `DATA_BEFORE`. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.datasource.query.incremental.format=cdc` and `hoodie.datasource.query.type=incremental`. -Note: Table operations like `Compact`, `Clean`, `Index` do not write/change any data. So we don't need to consider them in CDC scenario. - -### Modifiying code paths +| key| default| description | +|||--| +| hoodie.datasource.query.type | `snapshot` | set to `incremental` for incremental query. | +| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` (current incremental query behavior) returns the latest records' values. Set to `cdc` to return the full CDC results. | +| hoodie.datasource.read.start.timestamp | - | requried. | +| hoodie.datasource.read.end.timestamp | - | optional. | Review Comment: `hoodie.datasource.read.start.timestamp` `hoodie.datasource.read.end.timestamp` Would these two options be needed by the fs view API or only for reader/writer ? The `start.timestamp` should also have a default value and by default, consumes from the latest commit. We should mark it clearly what is the default value here if it is optional. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955641659 ## rfc/rfc-51/rfc-51.md: ## @@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals -1. Support row-level CDC records generation and persistence; -2. Support both MOR and COW tables; -3. Support all the write operations; -4. Support Spark DataFrame/SQL/Streaming Query; +1. Support row-level CDC records generation and persistence +2. Support both MOR and COW tables +3. Support all the write operations +4. Support incremental queries in CDC format across supported engines -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly. | +| hoodie.table.cdc.supplemental.logging.mode | `KEY_OP` | A mode to indicate the level of changed data being persisted. At the minimum level, `KEY_OP` indicates changed records' keys and operations to be persisted. `DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. `DATA_BEFORE_AFTER`: persist records' after-images in addition to `DATA_BEFORE`. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.datasource.query.incremental.format=cdc` and `hoodie.datasource.query.type=incremental`. -Note: Table operations like `Compact`, `Clean`, `Index` do not write/change any data. So we don't need to consider them in CDC scenario. - -### Modifiying code paths +| key| default| description | +|||--| +| hoodie.datasource.query.type | `snapshot` | set to `incremental` for incremental query. | +| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` (current incremental query behavior) returns the latest records' values. Set to `cdc` to return the full CDC results. | +| hoodie.datasource.read.start.timestamp | - | requried. | +| hoodie.datasource.read.end.timestamp | - | optional. | Review Comment: `hoodie.datasource.read.start.timestamp` `hoodie.datasource.read.end.timestamp` Would these two options be needed by the fs view API or only for reader/writer ? The `start.timestamp` should also have a default value and by default, consumes from the latest commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
hudi-bot commented on PR #6347: URL: https://github.com/apache/hudi/pull/6347#issuecomment-1228042484 ## CI report: * 386a9eb87a073a4c956fc5f5329701feeb012227 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10702) * 473a8b74676e345ee91093a3fe9885e062ca Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10969) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955640553 ## rfc/rfc-51/rfc-51.md: ## @@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown below Note: the illustration here ignores all the Hudi metadata columns like `_hoodie_commit_time` in `before` and `after` columns. -## Goals +## Design Goals -1. Support row-level CDC records generation and persistence; -2. Support both MOR and COW tables; -3. Support all the write operations; -4. Support Spark DataFrame/SQL/Streaming Query; +1. Support row-level CDC records generation and persistence +2. Support both MOR and COW tables +3. Support all the write operations +4. Support incremental queries in CDC format across supported engines -## Implementation +## Configurations -### CDC Architecture +| key | default | description | +|-|--|--| +| hoodie.table.cdc.enabled| `false` | The master switch of the CDC features. If `true`, writers and readers will respect CDC configurations and behave accordingly. | +| hoodie.table.cdc.supplemental.logging.mode | `KEY_OP` | A mode to indicate the level of changed data being persisted. At the minimum level, `KEY_OP` indicates changed records' keys and operations to be persisted. `DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. `DATA_BEFORE_AFTER`: persist records' after-images in addition to `DATA_BEFORE`. | -![](arch.jpg) +To perform CDC queries, users need to set `hoodie.datasource.query.incremental.format=cdc` and `hoodie.datasource.query.type=incremental`. Review Comment: `hoodie.datasource.query.incremental.format=cdc` `hoodie.datasource.query.type=incremental` Can someone explain why we need these two options ? Shouldn't they both be as default and there is no need to configure explicitly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
hudi-bot commented on PR #6347: URL: https://github.com/apache/hudi/pull/6347#issuecomment-1228039780 ## CI report: * 386a9eb87a073a4c956fc5f5329701feeb012227 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10702) * 473a8b74676e345ee91093a3fe9885e062ca UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support
danny0405 commented on code in PR #6256: URL: https://github.com/apache/hudi/pull/6256#discussion_r955638291 ## rfc/rfc-51/rfc-51.md: ## @@ -42,11 +43,11 @@ In cases where Hudi tables used as streaming sources, we want to be aware of all To implement this feature, we need to implement the logic on the write and read path to let Hudi figure out the changed data when read. In some cases, we need to write extra data to help optimize CDC queries. -## Scenarios +## Scenario Illustration Review Comment: > should produce separate CDC rows I guess this is a must ? How could you combine them in one row, the schema does not match. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition
hudi-bot commented on PR #6438: URL: https://github.com/apache/hudi/pull/6438#issuecomment-1228037192 ## CI report: * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10964) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …
hudi-bot commented on PR #6489: URL: https://github.com/apache/hudi/pull/6489#issuecomment-1228037279 ## CI report: * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN * a0e2f520a7f422bd396b984c3cec2c5653a41743 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10965) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function
danny0405 commented on code in PR #6384: URL: https://github.com/apache/hudi/pull/6384#discussion_r955636239 ## hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java: ## @@ -64,19 +64,17 @@ import java.util.function.Function; import java.util.function.Predicate; import java.util.regex.Matcher; -import java.util.regex.Pattern; import java.util.stream.Collectors; import java.util.stream.Stream; +import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN; + /** * Utility functions related to accessing the file storage. */ public class FSUtils { private static final Logger LOG = LogManager.getLogger(FSUtils.class); - // Log files are of this pattern - .b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1 - private static final Pattern LOG_FILE_PATTERN = - Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?"); private static final String LOG_FILE_PREFIX = "."; Review Comment: I don't think so, memory footprint is more critical for fs view(the risk of OOM) comparing to CPU cost. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #6491: [HUDI-4714] HoodieFlinkWriteClient can't load callback config to Hood…
danny0405 commented on PR #6491: URL: https://github.com/apache/hudi/pull/6491#issuecomment-1228035391 Can you rebase the latest master and force-push again -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] honeyaya commented on a diff in pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
honeyaya commented on code in PR #6347: URL: https://github.com/apache/hudi/pull/6347#discussion_r955632739 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java: ## @@ -192,18 +194,31 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + tableName); try { StorageDescriptor sd = client.getTable(databaseName, tableName).getSd(); - List partitionList = partitionsToAdd.stream().map(partition -> { + if (syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) <= 0) { +throw new HoodieHiveSyncException("batch-sync-num for sync hive table must be greater than 0, pls check your parameter"); + } + List partitionList = new ArrayList<>(); + int batchSyncPartitionNum = syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM); + for (int idx = 0; idx < partitionsToAdd.size(); idx++) { Review Comment: this is a more cool style of writing than what I did before, and done. thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] honeyaya commented on a diff in pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false
honeyaya commented on code in PR #6347: URL: https://github.com/apache/hudi/pull/6347#discussion_r955632261 ## hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java: ## @@ -192,18 +194,31 @@ public void addPartitionsToTable(String tableName, List partitionsToAdd) LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + tableName); try { StorageDescriptor sd = client.getTable(databaseName, tableName).getSd(); - List partitionList = partitionsToAdd.stream().map(partition -> { + if (syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) <= 0) { +throw new HoodieHiveSyncException("batch-sync-num for sync hive table must be greater than 0, pls check your parameter"); + } Review Comment: thanks for your comment. ValidationUils is a good suggestion, already done. But because the property: HIVE_BATCH_SYNC_PARTITION_NUM is an optional param, if the user does not set it, then we will use the default value, then the sync config level(HiveSyncConfig.java) might not fit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView
[ https://issues.apache.org/jira/browse/HUDI-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Danny Chen updated HUDI-4721: - Fix Version/s: 0.12.1 > Fix thread safety w/ RemoteTableFileSystemView > --- > > Key: HUDI-4721 > URL: https://issues.apache.org/jira/browse/HUDI-4721 > Project: Apache Hudi > Issue Type: Test > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > After retry mechanism was added to RemoteTableFileSystemView, looks like the > code is not thread safe. > > [https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946] > > This might impact regular flows as well even if no retries are enabled. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] xiarixiaoyao commented on issue #6496: [SUPPORT] Hudi schema evolution, Null for oldest values
xiarixiaoyao commented on issue #6496: URL: https://github.com/apache/hudi/issues/6496#issuecomment-1228023484 @Armelabdelkbir spark now not support default value, maybe https://github.com/apache/spark/pull/36672/files can help you, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 opened a new pull request, #6507: [DO NOT MERGE] 0.12.0 release patch branch
danny0405 opened a new pull request, #6507: URL: https://github.com/apache/hudi/pull/6507 ### Change Logs _Describe context and summary for this change. Highlight if any code was copied._ ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kk17 commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool
kk17 commented on PR #5920: URL: https://github.com/apache/hudi/pull/5920#issuecomment-1228016989 @nsivabalan @minihippo minihippowill try to add a test in the next two coming weeks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer
veenaypatil commented on code in PR #6111: URL: https://github.com/apache/hudi/pull/6111#discussion_r955617721 ## rfc/rfc-57/rfc-57.md: ## @@ -0,0 +1,85 @@ + +# RFC-57: DeltaStreamer Protobuf Support + + + +## Proposers + +- @the-other-tim-brown + +## Approvers +- @bhasudha +- @vinothchandar + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4399 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Support consuming Protobuf messages from Kafka with the DeltaStreamer. + +## Background +Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require: +1. Parsing the data from Kafka into Protobuf Messages +2. Generating a schema from a Protobuf Message class +3. Converting from Protobuf to Avro + +## Implementation + +### Parsing Data from Kafka +Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message. + +Configuration options: +hoodie.deltastreamer.schemaprovider.proto.className - The class to use Review Comment: Yes, let's take it in next cut -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.
hudi-bot commented on PR #6387: URL: https://github.com/apache/hudi/pull/6387#issuecomment-1228010409 ## CI report: * fb86adcdcf26b1565cccf6e89c30c6058477cd85 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10963) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…
hudi-bot commented on PR #6500: URL: https://github.com/apache/hudi/pull/6500#issuecomment-1228008168 ## CI report: * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952) * 2d75af2a075741142bbfd4b6f50e541661e55bdd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10968) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6486: [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type
hudi-bot commented on PR #6486: URL: https://github.com/apache/hudi/pull/6486#issuecomment-1228008105 ## CI report: * d6b7c487e76c46460a2fb0c9647aeea901d17995 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10921) * 9d687afca94b7bfcc592c69cfebd73eb846b3b70 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10967) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228007439 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 64ddff55a9f3083be754e0951bf5f082fecca9e5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10953) * 858a47a5b106462a5089ecf77278196bc7c7a0a8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10966) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…
hudi-bot commented on PR #6500: URL: https://github.com/apache/hudi/pull/6500#issuecomment-1228005628 ## CI report: * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952) * 2d75af2a075741142bbfd4b6f50e541661e55bdd UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query
hudi-bot commented on PR #6499: URL: https://github.com/apache/hudi/pull/6499#issuecomment-1228005604 ## CI report: * 91e047073b4ff4389bf1e3e4f5ce59342756ebd1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10951) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6486: [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type
hudi-bot commented on PR #6486: URL: https://github.com/apache/hudi/pull/6486#issuecomment-1228005564 ## CI report: * d6b7c487e76c46460a2fb0c9647aeea901d17995 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10921) * 9d687afca94b7bfcc592c69cfebd73eb846b3b70 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.
hudi-bot commented on PR #5629: URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228004890 ## CI report: * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN * 64ddff55a9f3083be754e0951bf5f082fecca9e5 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10953) * 858a47a5b106462a5089ecf77278196bc7c7a0a8 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1228000512 ## CI report: * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query
hudi-bot commented on PR #6499: URL: https://github.com/apache/hudi/pull/6499#issuecomment-1228000366 ## CI report: * 91e047073b4ff4389bf1e3e4f5ce59342756ebd1 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10951) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query
YannByron commented on PR #6499: URL: https://github.com/apache/hudi/pull/6499#issuecomment-1227975465 @hudi-bot run azure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wzx140 commented on a diff in pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…
wzx140 commented on code in PR #6500: URL: https://github.com/apache/hudi/pull/6500#discussion_r955588339 ## hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java: ## @@ -94,7 +94,11 @@ private HoodieInternalRow(UTF8String[] metaFields, @Override public int numFields() { -return sourceRow.numFields(); +if (sourceContainsMetaFields) { Review Comment: Add UT in TestHoodieInternalRow#testNumFields -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] LinMingQiang commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
LinMingQiang commented on PR #6393: URL: https://github.com/apache/hudi/pull/6393#issuecomment-1227964420 I understand what you mean. I'll fix it here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …
hudi-bot commented on PR #6489: URL: https://github.com/apache/hudi/pull/6489#issuecomment-1227964038 ## CI report: * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN * b26294c07ac06186c66a10444e7677656be94037 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10934) * a0e2f520a7f422bd396b984c3cec2c5653a41743 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10965) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …
hudi-bot commented on PR #6489: URL: https://github.com/apache/hudi/pull/6489#issuecomment-1227958050 ## CI report: * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN * b26294c07ac06186c66a10444e7677656be94037 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10934) * a0e2f520a7f422bd396b984c3cec2c5653a41743 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227947525 ## CI report: * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #5884: [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles…
danny0405 commented on code in PR #5884: URL: https://github.com/apache/hudi/pull/5884#discussion_r955568822 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java: ## @@ -165,17 +179,9 @@ private T executeRequest(String requestPath, Map queryParame String url = builder.toString(); LOG.info("Sending request : (" + url + ")"); -Response response; -int timeout = this.timeoutSecs * 1000; // msec -switch (method) { - case GET: -response = Request.Get(url).connectTimeout(timeout).socketTimeout(timeout).execute(); -break; - case POST: - default: -response = Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute(); -break; -} +// Reset url and method, to avoid repeatedly instantiating objects. +urlCheckedFunc.setUrlAndMethod(url, method); Review Comment: > we should have been more careful here +10086, we may need more attention for the core part code since there are many users now, the core change should be conservative. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] brskiran1 commented on issue #6304: Hudi MultiTable Deltastreamer not updating glue catalog when new column added on Source
brskiran1 commented on issue #6304: URL: https://github.com/apache/hudi/issues/6304#issuecomment-1227907870 @rmahindra123 please let me know if you have an update on this? I have tried with hoodie.schema.on.read.enable ==true but still no change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.
hudi-bot commented on PR #6387: URL: https://github.com/apache/hudi/pull/6387#issuecomment-1227900507 ## CI report: * bdddf2706b8e0e362ad2777282ade733a45d8f03 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10750) * fb86adcdcf26b1565cccf6e89c30c6058477cd85 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10963) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition
hudi-bot commented on PR #6438: URL: https://github.com/apache/hudi/pull/6438#issuecomment-1227900600 ## CI report: * fea65135a8035ef70929759594da64dc985a2d0a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10924) * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10964) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition
hudi-bot commented on PR #6438: URL: https://github.com/apache/hudi/pull/6438#issuecomment-1227897904 ## CI report: * fea65135a8035ef70929759594da64dc985a2d0a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10924) * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.
hudi-bot commented on PR #6387: URL: https://github.com/apache/hudi/pull/6387#issuecomment-1227897790 ## CI report: * bdddf2706b8e0e362ad2777282ade733a45d8f03 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10750) * fb86adcdcf26b1565cccf6e89c30c6058477cd85 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227895179 ## CI report: * 1cc9581196646dc677a0940c169d30407188b178 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958) * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime
nsivabalan commented on PR #6000: URL: https://github.com/apache/hudi/pull/6000#issuecomment-1227872985 can you rebase w/ latest master and address minor comments from danny. we can land it then. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #6341: [SUPPORT] Hudi delete not working via spark apis
nsivabalan commented on issue #6341: URL: https://github.com/apache/hudi/issues/6341#issuecomment-1227867862 yes. likely the issue could be (3) from yann's comment above. if you are setting it as delete operation, you don't need to override the payload class. if you are explicitly setting it to EmptyPayload, then you don't need to set operation type as "delete". also, can you confirm that your filtered df is actually not empty? instead of writing to hudi, did you do df.count to ensure there are valid records. Can you also post the contents of .hoodie/*.commit or .hoodie/*.deltacommit file that got added to .hoodie dir when you triggered the delete operation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227863447 ## CI report: * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957) * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…
nsivabalan commented on PR #6393: URL: https://github.com/apache/hudi/pull/6393#issuecomment-1227856444 hey folks. I reverted the original patch as it could lead to data issues. https://github.com/apache/hudi/pull/6501 You can put up the patch again w/ proper fix around thread safety. I have added a link to where the potential issue could be in the PR description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4674) change the default value of inputFormat for the MOR table
[ https://issues.apache.org/jira/browse/HUDI-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4674: - Status: In Progress (was: Open) > change the default value of inputFormat for the MOR table > - > > Key: HUDI-4674 > URL: https://issues.apache.org/jira/browse/HUDI-4674 > Project: Apache Hudi > Issue Type: Improvement >Reporter: linfey.nie >Assignee: linfey.nie >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > When we build a mor table, for example with Sparksql,the default value of > inputFormat is HoodieParquetRealtimeInputFormat.but when use hive sync > metadata and skip the _ro suffix for Read,The inputFormat of the original > table name should be HoodieParquetInputFormat,but now is not.I think we > should change the default value of inputFormat,just like cow table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3861) 'path' in CatalogTable#properties failed to be updated when renaming table
[ https://issues.apache.org/jira/browse/HUDI-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3861: - Status: In Progress (was: Open) > 'path' in CatalogTable#properties failed to be updated when renaming table > -- > > Key: HUDI-3861 > URL: https://issues.apache.org/jira/browse/HUDI-3861 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jin Xing >Assignee: KnightChess >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Reproduce the issue as below > {code:java} > 1. Create a MOR table > create table mor_simple( > id int, > name string, > price double > ) > using hudi > options ( > type = 'cow', > primaryKey = 'id' > ) > 2. Renaming > alter table mor_simple rename to mor_simple0 > 3. Show create table mor_simple0 > Output as > CREATE TABLE hudi.mor_simple0 ( > `_hoodie_commit_time` STRING, > `_hoodie_commit_seqno` STRING, > `_hoodie_record_key` STRING, > `_hoodie_partition_path` STRING, > `_hoodie_file_name` STRING, > `id` INT, > `name` STRING, > `price` DOUBLE) > USING hudi > OPTIONS( > 'primaryKey' = 'id', > 'type' = 'cow') > TBLPROPERTIES( > 'path' = '/user/hive/warehous/hudi.db/mor_simple'){code} > As we can see, the 'path' property is > '/user/hive/warehous/hudi.db/mor_simple', rather than > '/user/hive/warehous/hudi.db/mor_simple0'. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-3861) 'path' in CatalogTable#properties failed to be updated when renaming table
[ https://issues.apache.org/jira/browse/HUDI-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3861: - Status: Patch Available (was: In Progress) > 'path' in CatalogTable#properties failed to be updated when renaming table > -- > > Key: HUDI-3861 > URL: https://issues.apache.org/jira/browse/HUDI-3861 > Project: Apache Hudi > Issue Type: Bug >Reporter: Jin Xing >Assignee: KnightChess >Priority: Critical > Labels: pull-request-available > Fix For: 0.12.1 > > > Reproduce the issue as below > {code:java} > 1. Create a MOR table > create table mor_simple( > id int, > name string, > price double > ) > using hudi > options ( > type = 'cow', > primaryKey = 'id' > ) > 2. Renaming > alter table mor_simple rename to mor_simple0 > 3. Show create table mor_simple0 > Output as > CREATE TABLE hudi.mor_simple0 ( > `_hoodie_commit_time` STRING, > `_hoodie_commit_seqno` STRING, > `_hoodie_record_key` STRING, > `_hoodie_partition_path` STRING, > `_hoodie_file_name` STRING, > `id` INT, > `name` STRING, > `price` DOUBLE) > USING hudi > OPTIONS( > 'primaryKey' = 'id', > 'type' = 'cow') > TBLPROPERTIES( > 'path' = '/user/hive/warehous/hudi.db/mor_simple'){code} > As we can see, the 'path' property is > '/user/hive/warehous/hudi.db/mor_simple', rather than > '/user/hive/warehous/hudi.db/mor_simple0'. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4297) Test TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters* is flaky
[ https://issues.apache.org/jira/browse/HUDI-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4297: - Status: In Progress (was: Open) > Test > TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters* > is flaky > -- > > Key: HUDI-4297 > URL: https://issues.apache.org/jira/browse/HUDI-4297 > Project: Apache Hudi > Issue Type: Bug > Components: tests-ci >Reporter: Danny Chen >Assignee: Zhaojing Yu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9418/logs/36] > [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/10304/logs/16] > Both testUpsertsContinuousModeWithMultipleWritersForConflicts and > testUpsertsContinuousModeWithMultipleWritersWithoutConflicts are flaky. Fails > about 20% of the time. Increasing the timeout can only decrease the > probability of failure but that's not a fix. We need to look into the data > generator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[hudi] branch master updated (e90872b396 -> 11f85d1efb)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git from e90872b396 [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494) add 11f85d1efb Revert "[HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles… (#5884)" (#6501) No new revisions were added by this update. Summary of changes: .../client/embedded/EmbeddedTimelineService.java | 5 -- .../common/table/view/FileSystemViewManager.java | 3 +- .../table/view/FileSystemViewStorageConfig.java| 76 -- .../view/RemoteHoodieTableFileSystemView.java | 67 +-- .../org/apache/hudi/common/util/RetryHelper.java | 46 + .../java/org/apache/hudi/util/StreamerUtil.java| 5 -- .../TestRemoteHoodieTableFileSystemView.java | 29 - 7 files changed, 36 insertions(+), 195 deletions(-)
[GitHub] [hudi] yihua merged pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"
yihua merged PR #6501: URL: https://github.com/apache/hudi/pull/6501 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer
[ https://issues.apache.org/jira/browse/HUDI-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-4696. Resolution: Fixed > Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer > > > Key: HUDI-4696 > URL: https://issues.apache.org/jira/browse/HUDI-4696 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Labels: pull-request-available > Fix For: 0.12.1 > > > https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0=746585d8-b50a-55c3-26c5-517d93af9934 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227828149 ## CI report: * 1cc9581196646dc677a0940c169d30407188b178 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958) * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227825299 ## CI report: * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957) * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227822500 ## CI report: * 1cc9581196646dc677a0940c169d30407188b178 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958) * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227822474 ## CI report: * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957) * 24c8b543afd26438898efff96c98c81130c9ca54 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227819012 ## CI report: * 1cc9581196646dc677a0940c169d30407188b178 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"
hudi-bot commented on PR #6501: URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227818974 ## CI report: * f07b0630b9654b1c9b10ff5efc0e5989625404da Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10955) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new e90872b396 [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494) e90872b396 is described below commit e90872b396630318e6cc18d560f23e16c3595a29 Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com> AuthorDate: Thu Aug 25 16:58:35 2022 -0500 [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494) --- .../hudi/common/testutils/minicluster/HdfsTestService.java | 12 +--- .../TestHoodieCombineHiveInputFormat.java| 11 --- 2 files changed, 9 insertions(+), 14 deletions(-) diff --git a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java index eda8591749..ba584a4329 100644 --- a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java +++ b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java @@ -18,7 +18,6 @@ package org.apache.hudi.common.testutils.minicluster; -import org.apache.hudi.common.testutils.HoodieTestUtils; import org.apache.hudi.common.testutils.NetworkTestUtils; import org.apache.hudi.common.util.FileIOUtils; @@ -45,7 +44,7 @@ public class HdfsTestService { /** * Configuration settings. */ - private Configuration hadoopConf; + private final Configuration hadoopConf; private final String workDir; /** @@ -54,6 +53,7 @@ public class HdfsTestService { private MiniDFSCluster miniDfsCluster; public HdfsTestService() throws IOException { +hadoopConf = new Configuration(); workDir = Files.createTempDirectory("temp").toAbsolutePath().toString(); } @@ -63,7 +63,6 @@ public class HdfsTestService { public MiniDFSCluster start(boolean format) throws IOException { Objects.requireNonNull(workDir, "The work dir must be set before starting cluster."); -hadoopConf = HoodieTestUtils.getDefaultHadoopConf(); // If clean, then remove the work dir so we can start fresh. String localDFSLocation = getDFSLocation(workDir); @@ -107,7 +106,6 @@ public class HdfsTestService { miniDfsCluster.shutdown(true, true); } miniDfsCluster = null; -hadoopConf = null; } /** @@ -123,9 +121,9 @@ public class HdfsTestService { /** * Configure the DFS Cluster before launching it. * - * @param config The already created Hadoop configuration we'll further configure for HDFS + * @param config The already created Hadoop configuration we'll further configure for HDFS * @param localDFSLocation The location on the local filesystem where cluster data is stored - * @param bindIP An IP address we want to force the datanode and namenode to bind to. + * @param bindIP An IP address we want to force the datanode and namenode to bind to. * @return The updated Configuration object. */ private static Configuration configureDFSCluster(Configuration config, String localDFSLocation, String bindIP, @@ -146,7 +144,7 @@ public class HdfsTestService { String user = System.getProperty("user.name"); config.set("hadoop.proxyuser." + user + ".groups", "*"); config.set("hadoop.proxyuser." + user + ".hosts", "*"); -config.setBoolean("dfs.permissions",false); +config.setBoolean("dfs.permissions", false); return config; } diff --git a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java similarity index 98% rename from hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java rename to hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java index 0a14af2212..9b26a7915d 100644 --- a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java +++ b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java @@ -16,10 +16,8 @@ * limitations under the License. */ -package org.apache.hudi.hadoop.functional; +package org.apache.hudi.hadoop.hive; -import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants; -import org.apache.hadoop.hive.ql.io.IOContextMap; import org.apache.hudi.avro.HoodieAvroUtils; import org.apache.hudi.common.model.HoodieCommitMetadata; import org.apache.hudi.common.model.HoodieTableType; @@ -33,9 +31,6 @@ import org.apache.hudi.common.testutils.SchemaTestUtil; import org.apache.hudi.common.testutils.minicluster.MiniClusterUtil; import org.apache.hudi.common.util.CommitUtils; import
[GitHub] [hudi] nsivabalan merged pull request #6494: [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat
nsivabalan merged PR #6494: URL: https://github.com/apache/hudi/pull/6494 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3391) presto and hive beeline fails to read MOR table w/ 2 or more array fields
[ https://issues.apache.org/jira/browse/HUDI-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3391: - Reviewers: sivabalan narayanan > presto and hive beeline fails to read MOR table w/ 2 or more array fields > - > > Key: HUDI-3391 > URL: https://issues.apache.org/jira/browse/HUDI-3391 > Project: Apache Hudi > Issue Type: Bug > Components: dependencies, reader-core, trino-presto >Reporter: sivabalan narayanan >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.12.1 > > Original Estimate: 4h > Remaining Estimate: 4h > > We have an issue reported by user > [here|[https://github.com/apache/hudi/issues/2657].] Looks like w/ 0.10.0 or > later, spark datasource read works, but hive beeline does not work. Even > spark.sql (hive table) querying works as well. > Another related ticket: > [https://github.com/apache/hudi/issues/3834#issuecomment-997307677] > > Steps that I tried: > [https://gist.github.com/nsivabalan/fdb8794104181f93b9268380c7f7f079] > From beeline, you will encounter below exception > {code:java} > Failed with exception > java.io.IOException:org.apache.hudi.org.apache.avro.SchemaParseException: > Can't redefine: array {code} > All linked ticket states that upgrading parquet to 1.11.0 or greater should > work. We need to try it out w/ latest master and go from there. > > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4650) Commits Command: Include both active and archive timeline for a given range of intants
[ https://issues.apache.org/jira/browse/HUDI-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4650: - Reviewers: Raymond Xu (was: sivabalan narayanan) > Commits Command: Include both active and archive timeline for a given range > of intants > -- > > Key: HUDI-4650 > URL: https://issues.apache.org/jira/browse/HUDI-4650 > Project: Apache Hudi > Issue Type: Improvement > Components: cli >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4389) Make HoodieStreamingSink idempotent
[ https://issues.apache.org/jira/browse/HUDI-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4389: - Reviewers: sivabalan narayanan > Make HoodieStreamingSink idempotent > --- > > Key: HUDI-4389 > URL: https://issues.apache.org/jira/browse/HUDI-4389 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available, streaming > Fix For: 0.13.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4648) Add command to rename partition
[ https://issues.apache.org/jira/browse/HUDI-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4648: - Reviewers: Raymond Xu > Add command to rename partition > --- > > Key: HUDI-4648 > URL: https://issues.apache.org/jira/browse/HUDI-4648 > Project: Apache Hudi > Issue Type: Improvement > Components: cli >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Fix For: 0.12.1 > > > Based on https://github.com/apache/hudi/pull/6438#discussion_r949841206 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4633) Add command to trace partition through a range of commits
[ https://issues.apache.org/jira/browse/HUDI-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4633: - Reviewers: Raymond Xu (was: sivabalan narayanan) > Add command to trace partition through a range of commits > - > > Key: HUDI-4633 > URL: https://issues.apache.org/jira/browse/HUDI-4633 > Project: Apache Hudi > Issue Type: Improvement > Components: cli >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range
[ https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4528: - Reviewers: Raymond Xu (was: sivabalan narayanan) > Diff tool to compare metadata across snapshots in a given time range > > > Key: HUDI-4528 > URL: https://issues.apache.org/jira/browse/HUDI-4528 > Project: Apache Hudi > Issue Type: Task > Components: cli >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Blocker > Labels: pull-request-available > Fix For: 0.12.1 > > > A tool that diffs two snapshots at table and partition level and can give > info about what new file ids got created, deleted, updated and track other > changes that are captured in write stats. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index
hudi-bot commented on PR #6506: URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227784163 ## CI report: * 1cc9581196646dc677a0940c169d30407188b178 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-4485) Hudi cli got empty result for command show fsview all
[ https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-4485: Assignee: Yao Zhang > Hudi cli got empty result for command show fsview all > - > > Key: HUDI-4485 > URL: https://issues.apache.org/jira/browse/HUDI-4485 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Affects Versions: 0.11.1 > Environment: Hudi version : 0.11.1 > Spark version : 3.1.1 > Hive version : 3.1.0 > Hadoop version : 3.1.1 >Reporter: Yao Zhang >Assignee: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: spring-shell-1.2.0.RELEASE.jar > > > This issue is from: [[SUPPORT] Hudi cli got empty result for command show > fsview all · Issue #6177 · apache/hudi > (github.com)|https://github.com/apache/hudi/issues/6177] > {*}{{*}}Describe the problem you faced{{*}}{*} > Hudi cli got empty result after running command show fsview all. > ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png]) > The type of table t1 is COW and I am sure that the parquet file is actually > generated inside data folder. Also, the parquet files are not damaged as the > data could be retrieved correctly by reading as Hudi table or directly > reading each parquet file(using Spark). > {*}{{*}}To Reproduce{{*}}{*} > Steps to reproduce the behavior: > 1. Enter Flink SQL client. > 2. Execute the SQL and check the data was written successfully. > ```sql > CREATE TABLE t1( > uuid VARCHAR(20), > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition` VARCHAR(20) > ) > PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'hudi', > 'path' = 'hdfs:///path/to/table/', > 'table.type' = 'COPY_ON_WRITE' > ); > – insert data using values > INSERT INTO t1 VALUES > ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), > ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), > ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), > ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), > ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), > ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), > ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), > ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); > ``` > 3. Enter Hudi cli and execute `show fsview all` > {*}{{*}}Expected behavior{{*}}{*} > `show fsview all` in Hudi cli should return all file slices. > {*}{{*}}Environment Description{{*}}{*} > * Hudi version : 0.11.1 > * Spark version : 3.1.1 > * Hive version : 3.1.0 > * Hadoop version : 3.1.1 > * Storage (HDFS/S3/GCS..) : HDFS > * Running on Docker? (yes/no) : no > {*}{{*}}Additional context{{*}}{*} > No. > {*}{{*}}Stacktrace{{*}}{*} > N/A > > Temporary solution: > I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the > attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi
hudi-bot commented on PR #6502: URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227779992 ## CI report: * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227780026 ## CI report: * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni opened a new pull request, #6506: Allow hoodie read client to choose index
parisni opened a new pull request, #6506: URL: https://github.com/apache/hudi/pull/6506 ### Change Logs Currently the hudi read client use BLOOM and this cannot be overwriten. This allows to use GLOBAL_BLOOM and provides fast lookup on the primary key (without partition keys) ``` HudiReadClient client = new HudiReadClient(context, path, spark.sqlContext(), GLOBAL_BLOOM); client.readROView(keyRdd, 200); ``` ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer
the-other-tim-brown commented on code in PR #6111: URL: https://github.com/apache/hudi/pull/6111#discussion_r955436628 ## rfc/rfc-57/rfc-57.md: ## @@ -0,0 +1,85 @@ + +# RFC-57: DeltaStreamer Protobuf Support + + + +## Proposers + +- @the-other-tim-brown + +## Approvers +- @bhasudha +- @vinothchandar + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4399 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Support consuming Protobuf messages from Kafka with the DeltaStreamer. + +## Background +Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require: +1. Parsing the data from Kafka into Protobuf Messages +2. Generating a schema from a Protobuf Message class +3. Converting from Protobuf to Avro + +## Implementation + +### Parsing Data from Kafka +Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message. + +Configuration options: +hoodie.deltastreamer.schemaprovider.proto.className - The class to use + +### ProtobufClassBasedSchemaProvider +This new SchemaProvider will allow the user to provide a Protobuf Message class and get an Avro Schema. In the proto world, there is no concept of a nullable field so people use wrapper types such as Int32Value and StringValue to represent a nullable field. The schema provider will also allow the user to treat these wrapper fields as nullable versions of the fields they are wrapping instead of treating them as a nested message. In practice, this means that the user can choose between representing a field `Int32Value my_int = 1;` as `my_int.value` or simply `my_int` when writing the data out to the file system. + + Handling of Unsigned Integers and Longs +Protobuf provides support for unsigned integers and longs while Avro does not. The schema provider will convert unsigned integers and longs to Avro long type in the schema definition. + + Schema Evolution +**Adding a Field:** +Protobuf has a default value for all fields and the translation from proto to avro schema will carry over this default value so there are no errors when adding a new field to the proto definition. +**Removing a Field:** +If a user removes a field in the Protobuf schema, the schema provider will not be able to add this field to the avro schema it generates. To avoid issues when writing data, users must use `hoodie.datasource.write.reconcile.schema=true` to properly reconcile the schemas if a field is removed from the proto definition. Users can avoid this situation by using `deprecated` field option in proto instead of removing the field from the schema. + +Configuration Options: +hoodie.deltastreamer.schemaprovider.proto.className - The class to use +hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - By default the wrapper classes will be treated like any other message and have a nested `value` field. When this is set to true, we do not have a nested `value` field and treat the field as nullable in the generated Schema + +### ProtoToAvroConverter Review Comment: I'll add this to the RFC but note that it likely won't be done in the first cut. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer
the-other-tim-brown commented on code in PR #6111: URL: https://github.com/apache/hudi/pull/6111#discussion_r955436277 ## rfc/rfc-57/rfc-57.md: ## @@ -0,0 +1,85 @@ + +# RFC-57: DeltaStreamer Protobuf Support + + + +## Proposers + +- @the-other-tim-brown + +## Approvers +- @bhasudha +- @vinothchandar + +## Status + +JIRA: https://issues.apache.org/jira/browse/HUDI-4399 + +> Please keep the status updated in `rfc/README.md`. + +## Abstract + +Support consuming Protobuf messages from Kafka with the DeltaStreamer. + +## Background +Hudi's DeltaStreamer currently supports consuming Avro and JSON data from Kafka but it does not support Protobuf. Adding support will require: +1. Parsing the data from Kafka into Protobuf Messages +2. Generating a schema from a Protobuf Message class +3. Converting from Protobuf to Avro + +## Implementation + +### Parsing Data from Kafka +Users will provide a classname for the Protobuf Message that is contained within a jar that is on the path. We will then implement a deserializer that parses the bytes from the kafka message into a protobuf Message. + +Configuration options: +hoodie.deltastreamer.schemaprovider.proto.className - The class to use Review Comment: Do you have experience using that confluent value deserializer? We can add that in as an option but I don't have experience with it so may need your help. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi
hudi-bot commented on PR #6502: URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227775803 ## CI report: * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning
hudi-bot commented on PR #6505: URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227775849 ## CI report: * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on issue #6056: [SUPPORT] Metadata table suddenly not cleaned / compacted anymore
yihua commented on issue #6056: URL: https://github.com/apache/hudi/issues/6056#issuecomment-1227773266 Here's the tracking Jira ticket: [HUDI-4688](https://issues.apache.org/jira/browse/HUDI-4688). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra
[ https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4722: - Labels: pull-request-available (was: ) > Add support for metrics for locking infra > - > > Key: HUDI-4722 > URL: https://issues.apache.org/jira/browse/HUDI-4722 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jagmeet bali >Priority: Minor > Labels: pull-request-available > > Added metrics for following > # Lock request latency > # Count of Lock success > # Count of failed to acquire the lock > # Duration of locks held with support for re-entrancy > # Conflict resolution metrics. Succes vs Failure -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4485) Hudi cli got empty result for command show fsview all
[ https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-4485: - Priority: Major (was: Minor) > Hudi cli got empty result for command show fsview all > - > > Key: HUDI-4485 > URL: https://issues.apache.org/jira/browse/HUDI-4485 > Project: Apache Hudi > Issue Type: Bug > Components: cli >Affects Versions: 0.11.1 > Environment: Hudi version : 0.11.1 > Spark version : 3.1.1 > Hive version : 3.1.0 > Hadoop version : 3.1.1 >Reporter: Yao Zhang >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: spring-shell-1.2.0.RELEASE.jar > > > This issue is from: [[SUPPORT] Hudi cli got empty result for command show > fsview all · Issue #6177 · apache/hudi > (github.com)|https://github.com/apache/hudi/issues/6177] > {*}{{*}}Describe the problem you faced{{*}}{*} > Hudi cli got empty result after running command show fsview all. > ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png]) > The type of table t1 is COW and I am sure that the parquet file is actually > generated inside data folder. Also, the parquet files are not damaged as the > data could be retrieved correctly by reading as Hudi table or directly > reading each parquet file(using Spark). > {*}{{*}}To Reproduce{{*}}{*} > Steps to reproduce the behavior: > 1. Enter Flink SQL client. > 2. Execute the SQL and check the data was written successfully. > ```sql > CREATE TABLE t1( > uuid VARCHAR(20), > name VARCHAR(10), > age INT, > ts TIMESTAMP(3), > `partition` VARCHAR(20) > ) > PARTITIONED BY (`partition`) > WITH ( > 'connector' = 'hudi', > 'path' = 'hdfs:///path/to/table/', > 'table.type' = 'COPY_ON_WRITE' > ); > – insert data using values > INSERT INTO t1 VALUES > ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'), > ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'), > ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'), > ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'), > ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'), > ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'), > ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'), > ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4'); > ``` > 3. Enter Hudi cli and execute `show fsview all` > {*}{{*}}Expected behavior{{*}}{*} > `show fsview all` in Hudi cli should return all file slices. > {*}{{*}}Environment Description{{*}}{*} > * Hudi version : 0.11.1 > * Spark version : 3.1.1 > * Hive version : 3.1.0 > * Hadoop version : 3.1.1 > * Storage (HDFS/S3/GCS..) : HDFS > * Running on Docker? (yes/no) : no > {*}{{*}}Additional context{{*}}{*} > No. > {*}{{*}}Stacktrace{{*}}{*} > N/A > > Temporary solution: > I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the > attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi
hudi-bot commented on PR #6502: URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227771890 ## CI report: * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] parisni opened a new pull request, #6505: AwsglueSync Turn already exist error into warning
parisni opened a new pull request, #6505: URL: https://github.com/apache/hudi/pull/6505 ### Change Logs This avoids the sync to fail when likely concurrent sync happens or other cases fixes #5960 ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] santoshraj123 opened a new issue, #6504: [SUPPORT]
santoshraj123 opened a new issue, #6504: URL: https://github.com/apache/hudi/issues/6504 Hello, we are facing issue when running the spark-submit command after performing a DELETE operation on the PostGres database. The spark command, schema and properties file are given. The spark command generates the target Hudi tables after an INSERT of a row into the database, successfully. But, it fails with a "rolled-back" HudiException when after running the Spark command. We are using EMR version 6.7.0 with Hudi 0.11.0-amzn-0. The source of the data is a PostGres database. AWS DMS generates the parquet file from Postgres and lands the datasets into S3 landing zone. We tried both COPY_ON_WRITE and MERGE_ON_READ, yet DELETEs fail. Environment information: -- Hudi version : 0.11.0-amzn-0 Spark version : version 3.2.1-amzn-0 Hive version : 3.1.3 Scala version : 2.12.15 Hadoop version : xxx Storage (HDFS/S3/GCS..) : S3 Running on Docker? (yes/no) : no DMS Engine Version: 3.4.7 **Spark command** - sudo spark-submit --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-utilities-bundle.jar \ --master yarn \ --deploy-mode client \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.sql.hive.convertMetastoreParquet=false \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type MERGE_ON_READ \ --source-ordering-field order_id\ --props s3_url/hoodie-glue.properties \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ --target-base-path s3_url/v_hudi_orders \ --target-table v_hudi_orders --payload-class org.apache.hudi.common.model.AWSDmsAvroPayload \ --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf hoodie.deltastreamer.transformer.sql="SELECT Op, dms_received_ts, col1, col2, col3, CASE WHEN a.Op = 'D' THEN true ELSE false END as _hoodie_is_deleted FROM a" \ --op BULK_INSERT **Stacktrace**: --- 22/08/23 15:15:46 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 22/08/23 15:15:46 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" org.apache.hudi.exception.HoodieException: Commit 20220823151531894 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:649) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:331) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:200) at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:198) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:549) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/08/23 15:15:46 INFO ShutdownHookManager: Shutdown hook called Properties file - hoodie.table.name=t_hudi_able hoodie.table.type=MERGE_ON_READ hoodie.deltastreamer.source.dfs.root=s3_URL hoodie.datasource.write.recordkey.field=col1 (pk) hoodie.datasource.write.partitionpath.field=col3 hoodie.datasource.write.precombine.field=ts (DMS generated) hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.table=t_hudi_able hoodie.datasource.hive_sync.database=default hoodie.datasource.write.hive_style_partitioning=true hoodie.datasource.hive_sync.partition_fields=col3
[GitHub] [hudi] maduraitech opened a new issue, #6503: Hudi Merge Into with larger volume
maduraitech opened a new issue, #6503: URL: https://github.com/apache/hudi/issues/6503 Use case: We are trying to perform merge into for update partial columns, else insert new records in single command. Issue: Data is not updating as expected rather it’s trying to insert the record which is already existing and creating duplicates. Also its updating for few rows. When we retry the same merge into statement with same data again, it's always inserting new rows and for specific rows it's keep on updating every run. **Environment Description: Hudi: 0.11.0 Spark: 2.4.8 Storage: GCS** More Details: When we tried similar use case for small tables, it's working fine. **We do have the following additional options:** Added below hudi write configs while creating table to see, we don't see much difference but rather its not even updating the column which was updating previously for few rows. Options ( hoodie.datasource.write.table.type='COPY_ON_WRITE', primaryKey = 'col1,col2 etc.', hoodie.datasource.write.hive_style_partitioning = false, hoodie.datasource.write.operation = 'upsert', hoodie.datasource.write.payload.class = 'org.apache.hudi.common.model.DefaultHoodieRecordPayload', hoodie.datasource.write.keygenerator.class = 'org.apache.hudi.keygen.ComplexKeyGenerator' ) Addition we also tried is to check if we can combine all our keys into one (if too many key columns was concern) and perform merge. Even this scenario has no difference in behaviour. **Please note,** Reason we don’t want to precombineField at table level was as it will enforce to include in our update which we don’t want as part of the use case behavior. For lower volume which we have tested , we didn’t have precombineField at table level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra
[ https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagmeet bali updated HUDI-4722: --- Description: Added metrics for following # Lock request latency # Count of Lock success # Count of failed to acquire the lock # Duration of locks held with support for re-entrancy # Conflict resolution metrics. Succes vs Failure > Add support for metrics for locking infra > - > > Key: HUDI-4722 > URL: https://issues.apache.org/jira/browse/HUDI-4722 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jagmeet bali >Priority: Minor > > Added metrics for following > # Lock request latency > # Count of Lock success > # Count of failed to acquire the lock > # Duration of locks held with support for re-entrancy > # Conflict resolution metrics. Succes vs Failure -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra
[ https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jagmeet bali updated HUDI-4722: --- Priority: Minor (was: Major) > Add support for metrics for locking infra > - > > Key: HUDI-4722 > URL: https://issues.apache.org/jira/browse/HUDI-4722 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Jagmeet bali >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4722) Add support for metrics for locking infra
Jagmeet bali created HUDI-4722: -- Summary: Add support for metrics for locking infra Key: HUDI-4722 URL: https://issues.apache.org/jira/browse/HUDI-4722 Project: Apache Hudi Issue Type: Improvement Reporter: Jagmeet bali -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] jsbali opened a new pull request, #6502: Added locking metrics for Hudi
jsbali opened a new pull request, #6502: URL: https://github.com/apache/hudi/pull/6502 ### Change Logs Added metrics for the following for the locking infra 1. Lock request latency 2. Count of Lock success 3. Count of failed to acquire the lock 4. Duration of locks held with support for re-entrancy 5. Conflict resolution metrics. Succes vs Failure ### Impact _Describe any public API or user-facing feature change or any performance impact._ **Risk level: none | low | medium | high** _Choose one. If medium or high, explain what verification was done to mitigate the risks._ ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"
hudi-bot commented on PR #6501: URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227716705 ## CI report: * f07b0630b9654b1c9b10ff5efc0e5989625404da Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10955) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView
[ https://issues.apache.org/jira/browse/HUDI-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-4721: - Labels: pull-request-available (was: ) > Fix thread safety w/ RemoteTableFileSystemView > --- > > Key: HUDI-4721 > URL: https://issues.apache.org/jira/browse/HUDI-4721 > Project: Apache Hudi > Issue Type: Test > Components: reader-core, writer-core >Reporter: sivabalan narayanan >Priority: Major > Labels: pull-request-available > > After retry mechanism was added to RemoteTableFileSystemView, looks like the > code is not thread safe. > > [https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946] > > This might impact regular flows as well even if no retries are enabled. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"
hudi-bot commented on PR #6501: URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227711584 ## CI report: * f07b0630b9654b1c9b10ff5efc0e5989625404da UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…
hudi-bot commented on PR #6500: URL: https://github.com/apache/hudi/pull/6500#issuecomment-1227706366 ## CI report: * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan opened a new pull request, #6501: Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"
nsivabalan opened a new pull request, #6501: URL: https://github.com/apache/hudi/pull/6501 This reverts commit 660177bce1cd82975d7c25715497e0d2fbb2a95e. ### Change Logs Some [thread safety issues](https://github.com/apache/hudi/pull/5884/files#r955363946) are deducted w/ this feature added. Reverting it for now. I will let the author put up a new patch w/ proper fix. ### Impact Could result in wrong data being served. **Risk level: high** ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView
sivabalan narayanan created HUDI-4721: - Summary: Fix thread safety w/ RemoteTableFileSystemView Key: HUDI-4721 URL: https://issues.apache.org/jira/browse/HUDI-4721 Project: Apache Hudi Issue Type: Test Components: reader-core, writer-core Reporter: sivabalan narayanan After retry mechanism was added to RemoteTableFileSystemView, looks like the code is not thread safe. [https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946] This might impact regular flows as well even if no retries are enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [hudi] yihua commented on a diff in pull request #5884: [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles…
yihua commented on code in PR #5884: URL: https://github.com/apache/hudi/pull/5884#discussion_r955364712 ## hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java: ## @@ -165,17 +179,9 @@ private T executeRequest(String requestPath, Map queryParame String url = builder.toString(); LOG.info("Sending request : (" + url + ")"); -Response response; -int timeout = this.timeoutSecs * 1000; // msec -switch (method) { - case GET: -response = Request.Get(url).connectTimeout(timeout).socketTimeout(timeout).execute(); -break; - case POST: - default: -response = Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute(); -break; -} +// Reset url and method, to avoid repeatedly instantiating objects. +urlCheckedFunc.setUrlAndMethod(url, method); +Response response = retryHelper != null ? retryHelper.tryWith(urlCheckedFunc).start() : urlCheckedFunc.get(); Review Comment: @LinMingQiang @danny0405 Every request goes through this flow and the `urlCheckedFunc` should not be shared across requests. The logic here is incorrect. This can cause serious correctness problems under concurrency. We need to revert this logic. Also suggest that we should guard against such changes in the hot path with a flag. cc @nsivabalan @rmahindra123 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org