[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-08-25 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1228077953

   
   ## CI report:
   
   * be94781340ba821d5de240c1a4eed249efa2e0db Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10756)
 
   * 37785220f2d17a1a04d136521f10c3a0314fe448 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10970)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-08-25 Thread GitBox


hudi-bot commented on PR #6384:
URL: https://github.com/apache/hudi/pull/6384#issuecomment-1228073942

   
   ## CI report:
   
   * be94781340ba821d5de240c1a4eed249efa2e0db Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10756)
 
   * 37785220f2d17a1a04d136521f10c3a0314fe448 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-08-25 Thread GitBox


hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228073114

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 858a47a5b106462a5089ecf77278196bc7c7a0a8 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] ThinkerLei commented on a diff in pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-08-25 Thread GitBox


ThinkerLei commented on code in PR #6384:
URL: https://github.com/apache/hudi/pull/6384#discussion_r955653574


##
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##
@@ -64,19 +64,17 @@
 import java.util.function.Function;
 import java.util.function.Predicate;
 import java.util.regex.Matcher;
-import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN;
+
 /**
  * Utility functions related to accessing the file storage.
  */
 public class FSUtils {
 
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
-  // Log files are of this pattern - 
.b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
-  private static final Pattern LOG_FILE_PATTERN =
-  
Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");
   private static final String LOG_FILE_PREFIX = ".";

Review Comment:
   OK, I re-updated the PR. Can you give some advice ?Thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955645966


##
rfc/rfc-51/rfc-51.md:
##
@@ -215,18 +245,31 @@ Note:
 
 - Only instants that are active can be queried in a CDC scenario.
 - `CDCReader` manages all the things on CDC, and all the spark 
entrances(DataFrame, SQL, Streaming) call the functions in `CDCReader`.
-- If `hoodie.table.cdc.supplemental.logging` is false, we need to do more work 
to get the change data. The following illustration explains the difference when 
this config is true or false.
+- If `hoodie.table.cdc.supplemental.logging.mode=KEY_OP`, we need to compute 
the changed data. The following illustrates the difference.
 
 ![](read_cdc_log_file.jpg)
 
  COW table
 
-Reading COW table in CDC query mode is equivalent to reading a simplified MOR 
table that has no normal log files.
+Reading COW tables in CDC query mode is equivalent to reading MOR tables in RO 
mode.
 
  MOR table
 
-According to the design of the writing part, only the cases where writing mor 
tables will write out the base file (which call the `HoodieMergeHandle` and 
it's subclasses) will write out the cdc files.
-In other words, cdc files will be written out only for the index and file size 
reasons.
+According to the section "Persisting CDC in MOR", CDC data is available upon 
base files' generation.
+
+When users want to get fresher real-time CDC results:
+
+- users are to set `hoodie.datasource.query.incremental.type=snapshot`
+- the implementation logic is to compute the results in-flight by reading log 
files and the corresponding base files (
+  current and previous file slices).
+- this is equivalent to running incremental-query on MOR RT tables
+
+When users want to optimize compute-cost and are tolerant with latency of CDC 
results,
+
+- users are to set `hoodie.datasource.query.incremental.type=read_optimized`
+- the implementation logic is to extract the results by reading persisted CDC 
data and the corresponding base files (
+  current and previous file slices).

Review Comment:
   Do we need this two config options then ? 
   `hoodie.datasource.query.incremental.type=snapshot`
   `hoodie.datasource.query.incremental.type=read_optimized`
   
   Very confusing from my side, shouldn't it always be reading the refresh cdc 
data here ? Why we expose a ro view in cdc streaming read, for what use case ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955644523


##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +152,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation logic across engines,
+however, some delays are added to the CDC query results. Based on the business 
requirements, Log Compaction (RFC-48) or
+scheduling more frequent compaction can be used to minimize the latency.
 
-Spark DataSource example:
+The semantics we propose to establish are: when base files are written, the 
corresponding CDC data is also persisted.
+
+- For Spark
+  - inserts are written to base files: the CDC data `op=I` will be persisted
+  - updates/deletes that written to log files are compacted into base files: 
the CDC data `op=U|D` will be persisted
+- For Flink
+  - inserts/updates/deletes that written to log files are compacted into base 
files: the CDC data `op=I|U|D` will be
+persisted
+

Review Comment:
   >inserts/updates/deletes that written to log files are compacted into base 
files: the CDC data `op=I|U|D` will be
   persisted
   
   I don't think the compaction generated cdc logs makes any sense in 
production, it lost the data freshness for cdc stream and it relies on the 
compaction service which it self is not a very robust infrastructure.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955643711


##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +152,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket

Review Comment:
   So what is the solution here ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955642598


##
rfc/rfc-51/rfc-51.md:
##
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 

 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.|
+| hoodie.table.cdc.supplemental.logging   | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.
  |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description
  |
+|||--|
+| hoodie.table.cdc.enabled   | `false`| set to `true` for CDC 
queries|
+| hoodie.datasource.query.type   | `snapshot` | set to `incremental` 
for CDC queries |
+| hoodie.datasource.read.start.timestamp | -  | requried.  
  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  
  |
 
-![](points.jpg)
+### Logical File Types
 
-### Config Definitions
+We define 4 logical file types for the CDC scenario.

Review Comment:
   Can someone give a explanation here ? What is exactly a logical file type ? 
name it action seems more suitable here?



##
rfc/rfc-51/rfc-51.md:
##
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 

 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.|
+| hoodie.table.cdc.supplemental.logging   | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.
  |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description
  |

[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955641659


##
rfc/rfc-51/rfc-51.md:
##
@@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 



 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.  

  |
+| hoodie.table.cdc.supplemental.logging.mode  | `KEY_OP` | A mode to 
indicate the level of changed data being persisted. At the minimum level, 
`KEY_OP` indicates changed records' keys and operations to be persisted. 
`DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. 
`DATA_BEFORE_AFTER`: persist records' after-images in addition to 
`DATA_BEFORE`. |
 
-![](arch.jpg)
+To perform CDC queries, users need to set 
`hoodie.datasource.query.incremental.format=cdc` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description

  |
+|||--|
+| hoodie.datasource.query.type   | `snapshot` | set to 
`incremental` for incremental query.
  |
+| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` 
(current incremental query behavior) returns the latest records' values. Set to 
`cdc` to return the full CDC results. |
+| hoodie.datasource.read.start.timestamp | -  | requried.  

  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  

  |

Review Comment:
   `hoodie.datasource.read.start.timestamp`
   `hoodie.datasource.read.end.timestamp`
   
   Would these two options be needed by the fs view API or only for 
reader/writer ? The `start.timestamp` should also have a default value and by 
default, consumes from the latest commit.
   
   We should mark it clearly what is the default value here if it is optional.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955641659


##
rfc/rfc-51/rfc-51.md:
##
@@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 



 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.  

  |
+| hoodie.table.cdc.supplemental.logging.mode  | `KEY_OP` | A mode to 
indicate the level of changed data being persisted. At the minimum level, 
`KEY_OP` indicates changed records' keys and operations to be persisted. 
`DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. 
`DATA_BEFORE_AFTER`: persist records' after-images in addition to 
`DATA_BEFORE`. |
 
-![](arch.jpg)
+To perform CDC queries, users need to set 
`hoodie.datasource.query.incremental.format=cdc` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description

  |
+|||--|
+| hoodie.datasource.query.type   | `snapshot` | set to 
`incremental` for incremental query.
  |
+| hoodie.datasource.query.incremental.format | `latest_state` | `latest_state` 
(current incremental query behavior) returns the latest records' values. Set to 
`cdc` to return the full CDC results. |
+| hoodie.datasource.read.start.timestamp | -  | requried.  

  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  

  |

Review Comment:
   `hoodie.datasource.read.start.timestamp`
   `hoodie.datasource.read.end.timestamp`
   
   Would these two options be needed by the fs view API or only for 
reader/writer ? The `start.timestamp` should also have a default value and by 
default, consumes from the latest commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false

2022-08-25 Thread GitBox


hudi-bot commented on PR #6347:
URL: https://github.com/apache/hudi/pull/6347#issuecomment-1228042484

   
   ## CI report:
   
   * 386a9eb87a073a4c956fc5f5329701feeb012227 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10702)
 
   * 473a8b74676e345ee91093a3fe9885e062ca Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10969)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955640553


##
rfc/rfc-51/rfc-51.md:
##
@@ -64,71 +65,74 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 



 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.  

  |
+| hoodie.table.cdc.supplemental.logging.mode  | `KEY_OP` | A mode to 
indicate the level of changed data being persisted. At the minimum level, 
`KEY_OP` indicates changed records' keys and operations to be persisted. 
`DATA_BEFORE`: persist records' before-images in addition to `KEY_OP`. 
`DATA_BEFORE_AFTER`: persist records' after-images in addition to 
`DATA_BEFORE`. |
 
-![](arch.jpg)
+To perform CDC queries, users need to set 
`hoodie.datasource.query.incremental.format=cdc` and 
`hoodie.datasource.query.type=incremental`.
 

Review Comment:
   `hoodie.datasource.query.incremental.format=cdc`
   `hoodie.datasource.query.type=incremental`
   
   Can someone explain why we need these two options ? Shouldn't they both be 
as default and there is no need to configure explicitly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false

2022-08-25 Thread GitBox


hudi-bot commented on PR #6347:
URL: https://github.com/apache/hudi/pull/6347#issuecomment-1228039780

   
   ## CI report:
   
   * 386a9eb87a073a4c956fc5f5329701feeb012227 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10702)
 
   * 473a8b74676e345ee91093a3fe9885e062ca UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r955638291


##
rfc/rfc-51/rfc-51.md:
##
@@ -42,11 +43,11 @@ In cases where Hudi tables used as streaming sources, we 
want to be aware of all
 
 To implement this feature, we need to implement the logic on the write and 
read path to let Hudi figure out the changed data when read. In some cases, we 
need to write extra data to help optimize CDC queries.
 
-## Scenarios
+## Scenario Illustration

Review Comment:
   > should produce separate CDC rows
   
   I guess this is a must ? How could you combine them in one row, the schema 
does not match.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition

2022-08-25 Thread GitBox


hudi-bot commented on PR #6438:
URL: https://github.com/apache/hudi/pull/6438#issuecomment-1228037192

   
   ## CI report:
   
   * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10964)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-08-25 Thread GitBox


hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1228037279

   
   ## CI report:
   
   * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN
   * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN
   * a0e2f520a7f422bd396b984c3cec2c5653a41743 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10965)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #6384: [HUDI-4613] Avoid the use of regex expressions when call hoodieFileGroup#addLogFile function

2022-08-25 Thread GitBox


danny0405 commented on code in PR #6384:
URL: https://github.com/apache/hudi/pull/6384#discussion_r955636239


##
hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java:
##
@@ -64,19 +64,17 @@
 import java.util.function.Function;
 import java.util.function.Predicate;
 import java.util.regex.Matcher;
-import java.util.regex.Pattern;
 import java.util.stream.Collectors;
 import java.util.stream.Stream;
 
+import static org.apache.hudi.common.model.HoodieLogFile.LOG_FILE_PATTERN;
+
 /**
  * Utility functions related to accessing the file storage.
  */
 public class FSUtils {
 
   private static final Logger LOG = LogManager.getLogger(FSUtils.class);
-  // Log files are of this pattern - 
.b5068208-e1a4-11e6-bf01-fe55135034f3_20170101134598.log.1
-  private static final Pattern LOG_FILE_PATTERN =
-  
Pattern.compile("\\.(.*)_(.*)\\.(.*)\\.([0-9]*)(_(([0-9]*)-([0-9]*)-([0-9]*)))?");
   private static final String LOG_FILE_PREFIX = ".";

Review Comment:
   I don't think so, memory footprint is more critical for fs view(the risk of 
OOM) comparing to CPU cost.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on pull request #6491: [HUDI-4714] HoodieFlinkWriteClient can't load callback config to Hood…

2022-08-25 Thread GitBox


danny0405 commented on PR #6491:
URL: https://github.com/apache/hudi/pull/6491#issuecomment-1228035391

   Can you rebase the latest master and force-push again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] honeyaya commented on a diff in pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false

2022-08-25 Thread GitBox


honeyaya commented on code in PR #6347:
URL: https://github.com/apache/hudi/pull/6347#discussion_r955632739


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -192,18 +194,31 @@ public void addPartitionsToTable(String tableName, 
List partitionsToAdd)
 LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + 
tableName);
 try {
   StorageDescriptor sd = client.getTable(databaseName, tableName).getSd();
-  List partitionList = partitionsToAdd.stream().map(partition 
-> {
+  if (syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) <= 0) {
+throw new HoodieHiveSyncException("batch-sync-num for sync hive table 
must be greater than 0, pls check your parameter");
+  }
+  List partitionList = new ArrayList<>();
+  int batchSyncPartitionNum = 
syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM);
+  for (int idx = 0; idx < partitionsToAdd.size(); idx++) {

Review Comment:
   this is a more cool style of writing than what I did before, and done. thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] honeyaya commented on a diff in pull request #6347: [HUDI-4582] Support batch synchronization of partition to hive metastore to avoid timeout with --sync-mode="hms" and use-jdbc=false

2022-08-25 Thread GitBox


honeyaya commented on code in PR #6347:
URL: https://github.com/apache/hudi/pull/6347#discussion_r955632261


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/ddl/HMSDDLExecutor.java:
##
@@ -192,18 +194,31 @@ public void addPartitionsToTable(String tableName, 
List partitionsToAdd)
 LOG.info("Adding partitions " + partitionsToAdd.size() + " to table " + 
tableName);
 try {
   StorageDescriptor sd = client.getTable(databaseName, tableName).getSd();
-  List partitionList = partitionsToAdd.stream().map(partition 
-> {
+  if (syncConfig.getIntOrDefault(HIVE_BATCH_SYNC_PARTITION_NUM) <= 0) {
+throw new HoodieHiveSyncException("batch-sync-num for sync hive table 
must be greater than 0, pls check your parameter");
+  }

Review Comment:
   thanks for your comment. 
   
   ValidationUils is a good suggestion, already done. 
   
   But because the property: HIVE_BATCH_SYNC_PARTITION_NUM is an optional 
param, if the user does not set it, then we will use the default value, then 
the sync config level(HiveSyncConfig.java) might not fit. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView

2022-08-25 Thread Danny Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-4721:
-
Fix Version/s: 0.12.1

> Fix thread safety w/ RemoteTableFileSystemView 
> ---
>
> Key: HUDI-4721
> URL: https://issues.apache.org/jira/browse/HUDI-4721
> Project: Apache Hudi
>  Issue Type: Test
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After retry mechanism was added to RemoteTableFileSystemView, looks like the 
> code is not thread safe. 
>  
> [https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946]
>  
> This might impact regular flows as well even if no retries are enabled. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] xiarixiaoyao commented on issue #6496: [SUPPORT] Hudi schema evolution, Null for oldest values

2022-08-25 Thread GitBox


xiarixiaoyao commented on issue #6496:
URL: https://github.com/apache/hudi/issues/6496#issuecomment-1228023484

   @Armelabdelkbir   spark now not support default value, maybe 
https://github.com/apache/spark/pull/36672/files can help you, thanks  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 opened a new pull request, #6507: [DO NOT MERGE] 0.12.0 release patch branch

2022-08-25 Thread GitBox


danny0405 opened a new pull request, #6507:
URL: https://github.com/apache/hudi/pull/6507

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kk17 commented on pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-08-25 Thread GitBox


kk17 commented on PR #5920:
URL: https://github.com/apache/hudi/pull/5920#issuecomment-1228016989

   @nsivabalan @minihippo minihippowill try to add a test in the next two 
coming weeks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] veenaypatil commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

2022-08-25 Thread GitBox


veenaypatil commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955617721


##
rfc/rfc-57/rfc-57.md:
##
@@ -0,0 +1,85 @@
+
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from 
Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained 
within a jar that is on the path. We will then implement a deserializer that 
parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Yes, let's take it in next cut



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.

2022-08-25 Thread GitBox


hudi-bot commented on PR #6387:
URL: https://github.com/apache/hudi/pull/6387#issuecomment-1228010409

   
   ## CI report:
   
   * fb86adcdcf26b1565cccf6e89c30c6058477cd85 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10963)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…

2022-08-25 Thread GitBox


hudi-bot commented on PR #6500:
URL: https://github.com/apache/hudi/pull/6500#issuecomment-1228008168

   
   ## CI report:
   
   * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952)
 
   * 2d75af2a075741142bbfd4b6f50e541661e55bdd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10968)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6486: [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type

2022-08-25 Thread GitBox


hudi-bot commented on PR #6486:
URL: https://github.com/apache/hudi/pull/6486#issuecomment-1228008105

   
   ## CI report:
   
   * d6b7c487e76c46460a2fb0c9647aeea901d17995 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10921)
 
   * 9d687afca94b7bfcc592c69cfebd73eb846b3b70 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10967)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-08-25 Thread GitBox


hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228007439

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 64ddff55a9f3083be754e0951bf5f082fecca9e5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10953)
 
   * 858a47a5b106462a5089ecf77278196bc7c7a0a8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10966)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…

2022-08-25 Thread GitBox


hudi-bot commented on PR #6500:
URL: https://github.com/apache/hudi/pull/6500#issuecomment-1228005628

   
   ## CI report:
   
   * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952)
 
   * 2d75af2a075741142bbfd4b6f50e541661e55bdd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query

2022-08-25 Thread GitBox


hudi-bot commented on PR #6499:
URL: https://github.com/apache/hudi/pull/6499#issuecomment-1228005604

   
   ## CI report:
   
   * 91e047073b4ff4389bf1e3e4f5ce59342756ebd1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10951)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6486: [HUDI-4706] Fix InternalSchemaChangeApplier#applyAddChange error to add nest type

2022-08-25 Thread GitBox


hudi-bot commented on PR #6486:
URL: https://github.com/apache/hudi/pull/6486#issuecomment-1228005564

   
   ## CI report:
   
   * d6b7c487e76c46460a2fb0c9647aeea901d17995 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10921)
 
   * 9d687afca94b7bfcc592c69cfebd73eb846b3b70 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-08-25 Thread GitBox


hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1228004890

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * 64ddff55a9f3083be754e0951bf5f082fecca9e5 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10953)
 
   * 858a47a5b106462a5089ecf77278196bc7c7a0a8 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1228000512

   
   ## CI report:
   
   * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query

2022-08-25 Thread GitBox


hudi-bot commented on PR #6499:
URL: https://github.com/apache/hudi/pull/6499#issuecomment-1228000366

   
   ## CI report:
   
   * 91e047073b4ff4389bf1e3e4f5ce59342756ebd1 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10951)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] YannByron commented on pull request #6499: [HUDI-4703] use the historical schema to response time travel query

2022-08-25 Thread GitBox


YannByron commented on PR #6499:
URL: https://github.com/apache/hudi/pull/6499#issuecomment-1227975465

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wzx140 commented on a diff in pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…

2022-08-25 Thread GitBox


wzx140 commented on code in PR #6500:
URL: https://github.com/apache/hudi/pull/6500#discussion_r955588339


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/model/HoodieInternalRow.java:
##
@@ -94,7 +94,11 @@ private HoodieInternalRow(UTF8String[] metaFields,
 
   @Override
   public int numFields() {
-return sourceRow.numFields();
+if (sourceContainsMetaFields) {

Review Comment:
   Add UT in TestHoodieInternalRow#testNumFields



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] LinMingQiang commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…

2022-08-25 Thread GitBox


LinMingQiang commented on PR #6393:
URL: https://github.com/apache/hudi/pull/6393#issuecomment-1227964420

   I understand what you mean. I'll fix it here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-08-25 Thread GitBox


hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1227964038

   
   ## CI report:
   
   * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN
   * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN
   * b26294c07ac06186c66a10444e7677656be94037 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10934)
 
   * a0e2f520a7f422bd396b984c3cec2c5653a41743 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10965)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6489: [HUDI-4485] [cli] Bumped spring shell to 2.1.1. Updated the default …

2022-08-25 Thread GitBox


hudi-bot commented on PR #6489:
URL: https://github.com/apache/hudi/pull/6489#issuecomment-1227958050

   
   ## CI report:
   
   * 47680402da599615de30c13a1f22f79f3573ee30 UNKNOWN
   * 5613f14b3d5f1c8aaf8de1730e2f21b78a657150 UNKNOWN
   * b26294c07ac06186c66a10444e7677656be94037 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10934)
 
   * a0e2f520a7f422bd396b984c3cec2c5653a41743 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227947525

   
   ## CI report:
   
   * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5884: [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles…

2022-08-25 Thread GitBox


danny0405 commented on code in PR #5884:
URL: https://github.com/apache/hudi/pull/5884#discussion_r955568822


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java:
##
@@ -165,17 +179,9 @@ private  T executeRequest(String requestPath, 
Map queryParame
 
 String url = builder.toString();
 LOG.info("Sending request : (" + url + ")");
-Response response;
-int timeout = this.timeoutSecs * 1000; // msec
-switch (method) {
-  case GET:
-response = 
Request.Get(url).connectTimeout(timeout).socketTimeout(timeout).execute();
-break;
-  case POST:
-  default:
-response = 
Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute();
-break;
-}
+// Reset url and method, to avoid repeatedly instantiating objects.
+urlCheckedFunc.setUrlAndMethod(url, method);

Review Comment:
   >  we should have been more careful here
   
   +10086, we may need more attention for the core part code since there are 
many users now, the core change should be conservative.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] brskiran1 commented on issue #6304: Hudi MultiTable Deltastreamer not updating glue catalog when new column added on Source

2022-08-25 Thread GitBox


brskiran1 commented on issue #6304:
URL: https://github.com/apache/hudi/issues/6304#issuecomment-1227907870

   @rmahindra123  please let me know if you have an update on this? I have 
tried with hoodie.schema.on.read.enable ==true but still no change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.

2022-08-25 Thread GitBox


hudi-bot commented on PR #6387:
URL: https://github.com/apache/hudi/pull/6387#issuecomment-1227900507

   
   ## CI report:
   
   * bdddf2706b8e0e362ad2777282ade733a45d8f03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10750)
 
   * fb86adcdcf26b1565cccf6e89c30c6058477cd85 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10963)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition

2022-08-25 Thread GitBox


hudi-bot commented on PR #6438:
URL: https://github.com/apache/hudi/pull/6438#issuecomment-1227900600

   
   ## CI report:
   
   * fea65135a8035ef70929759594da64dc985a2d0a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10924)
 
   * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10964)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6438: [HUDI-4642] Adding support to hudi-cli to repair depcrated partition

2022-08-25 Thread GitBox


hudi-bot commented on PR #6438:
URL: https://github.com/apache/hudi/pull/6438#issuecomment-1227897904

   
   ## CI report:
   
   * fea65135a8035ef70929759594da64dc985a2d0a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10924)
 
   * 9cf4d2a70b355cdaa5463fc34ce72908cb5a8da3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6387: [HUDI-4615] Return checkpoint as null for empty data from events queue.

2022-08-25 Thread GitBox


hudi-bot commented on PR #6387:
URL: https://github.com/apache/hudi/pull/6387#issuecomment-1227897790

   
   ## CI report:
   
   * bdddf2706b8e0e362ad2777282ade733a45d8f03 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10750)
 
   * fb86adcdcf26b1565cccf6e89c30c6058477cd85 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227895179

   
   ## CI report:
   
   * 1cc9581196646dc677a0940c169d30407188b178 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958)
 
   * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #6000: [HUDI-4340] fix not parsable text DateTimeParseException in HoodieInstantTimeGenerator.parseDateFromInstantTime

2022-08-25 Thread GitBox


nsivabalan commented on PR #6000:
URL: https://github.com/apache/hudi/pull/6000#issuecomment-1227872985

   can you rebase w/ latest master and address minor comments from danny. we 
can land it then.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #6341: [SUPPORT] Hudi delete not working via spark apis

2022-08-25 Thread GitBox


nsivabalan commented on issue #6341:
URL: https://github.com/apache/hudi/issues/6341#issuecomment-1227867862

   yes. likely the issue could be (3) from yann's comment above. if you are 
setting it as delete operation, you don't need to override the payload class. 
if you are explicitly setting it to EmptyPayload, then you don't need to set 
operation type as "delete".
   
   also, can you confirm that your filtered df is actually not empty? instead 
of writing to hudi, did you do df.count to ensure there are valid records. 
   
   Can you also post the contents of .hoodie/*.commit or .hoodie/*.deltacommit 
file that got added to .hoodie dir when you triggered the delete operation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227863447

   
   ## CI report:
   
   * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957)
 
   * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan commented on pull request #6393: [HUDI-4619] Fix The retry mechanism of remotehoodietablefilesystemvie…

2022-08-25 Thread GitBox


nsivabalan commented on PR #6393:
URL: https://github.com/apache/hudi/pull/6393#issuecomment-1227856444

   hey folks. I reverted the original patch as it could lead to data issues. 
   https://github.com/apache/hudi/pull/6501
   You can put up the patch again w/ proper fix around thread safety. I have 
added a link to where the potential issue could be in the PR description. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4674) change the default value of inputFormat for the MOR table

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4674:
-
Status: In Progress  (was: Open)

> change the default value of inputFormat for the MOR table
> -
>
> Key: HUDI-4674
> URL: https://issues.apache.org/jira/browse/HUDI-4674
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: linfey.nie
>Assignee: linfey.nie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> When we build a mor table, for example with Sparksql,the default value of 
> inputFormat is HoodieParquetRealtimeInputFormat.but when use hive sync 
> metadata and skip the _ro suffix for Read,The inputFormat of the original 
> table name should be HoodieParquetInputFormat,but now is not.I think we 
> should change the default value of inputFormat,just like cow table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3861) 'path' in CatalogTable#properties failed to be updated when renaming table

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3861:
-
Status: In Progress  (was: Open)

> 'path' in CatalogTable#properties failed to be updated when renaming table
> --
>
> Key: HUDI-3861
> URL: https://issues.apache.org/jira/browse/HUDI-3861
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Assignee: KnightChess
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Reproduce the issue as below
> {code:java}
> 1. Create a MOR table 
> create table mor_simple(
>   id int,
>   name string,
>   price double
> )
> using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> )
> 2. Renaming
> alter table mor_simple rename to mor_simple0
> 3. Show create table mor_simple0
> Output as
> CREATE TABLE hudi.mor_simple0 (
>   `_hoodie_commit_time` STRING,
>   `_hoodie_commit_seqno` STRING,
>   `_hoodie_record_key` STRING,
>   `_hoodie_partition_path` STRING,
>   `_hoodie_file_name` STRING,
>   `id` INT,
>   `name` STRING,
>   `price` DOUBLE)
> USING hudi
> OPTIONS(
>   'primaryKey' = 'id',
>   'type' = 'cow')
> TBLPROPERTIES(
>   'path' = '/user/hive/warehous/hudi.db/mor_simple'){code}
> As we can see, the 'path' property is 
> '/user/hive/warehous/hudi.db/mor_simple', rather than 
> '/user/hive/warehous/hudi.db/mor_simple0'.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-3861) 'path' in CatalogTable#properties failed to be updated when renaming table

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3861:
-
Status: Patch Available  (was: In Progress)

> 'path' in CatalogTable#properties failed to be updated when renaming table
> --
>
> Key: HUDI-3861
> URL: https://issues.apache.org/jira/browse/HUDI-3861
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jin Xing
>Assignee: KnightChess
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> Reproduce the issue as below
> {code:java}
> 1. Create a MOR table 
> create table mor_simple(
>   id int,
>   name string,
>   price double
> )
> using hudi
> options (
>   type = 'cow',
>   primaryKey = 'id'
> )
> 2. Renaming
> alter table mor_simple rename to mor_simple0
> 3. Show create table mor_simple0
> Output as
> CREATE TABLE hudi.mor_simple0 (
>   `_hoodie_commit_time` STRING,
>   `_hoodie_commit_seqno` STRING,
>   `_hoodie_record_key` STRING,
>   `_hoodie_partition_path` STRING,
>   `_hoodie_file_name` STRING,
>   `id` INT,
>   `name` STRING,
>   `price` DOUBLE)
> USING hudi
> OPTIONS(
>   'primaryKey' = 'id',
>   'type' = 'cow')
> TBLPROPERTIES(
>   'path' = '/user/hive/warehous/hudi.db/mor_simple'){code}
> As we can see, the 'path' property is 
> '/user/hive/warehous/hudi.db/mor_simple', rather than 
> '/user/hive/warehous/hudi.db/mor_simple0'.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4297) Test TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters* is flaky

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4297:
-
Status: In Progress  (was: Open)

> Test 
> TestHoodieDeltaStreamerWithMultiWriter.testUpsertsContinuousModeWithMultipleWriters*
>  is flaky
> --
>
> Key: HUDI-4297
> URL: https://issues.apache.org/jira/browse/HUDI-4297
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Danny Chen
>Assignee: Zhaojing Yu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/9418/logs/36]
> [https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_apis/build/builds/10304/logs/16]
> Both testUpsertsContinuousModeWithMultipleWritersForConflicts and 
> testUpsertsContinuousModeWithMultipleWritersWithoutConflicts are flaky. Fails 
> about 20% of the time. Increasing the timeout can only decrease the 
> probability of failure but that's not a fix. We need to look into the data 
> generator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[hudi] branch master updated (e90872b396 -> 11f85d1efb)

2022-08-25 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from e90872b396 [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat 
(#6494)
 add 11f85d1efb Revert "[HUDI-3669] Add a remote request retry mechanism 
for 'Remotehoodietablefiles… (#5884)" (#6501)

No new revisions were added by this update.

Summary of changes:
 .../client/embedded/EmbeddedTimelineService.java   |  5 --
 .../common/table/view/FileSystemViewManager.java   |  3 +-
 .../table/view/FileSystemViewStorageConfig.java| 76 --
 .../view/RemoteHoodieTableFileSystemView.java  | 67 +--
 .../org/apache/hudi/common/util/RetryHelper.java   | 46 +
 .../java/org/apache/hudi/util/StreamerUtil.java|  5 --
 .../TestRemoteHoodieTableFileSystemView.java   | 29 -
 7 files changed, 36 insertions(+), 195 deletions(-)



[GitHub] [hudi] yihua merged pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"

2022-08-25 Thread GitBox


yihua merged PR #6501:
URL: https://github.com/apache/hudi/pull/6501


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-4696) Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-4696.

Resolution: Fixed

> Flaky: TestHoodieCombineHiveInputFormat.setUpClass:86 » NullPointer

> 
>
> Key: HUDI-4696
> URL: https://issues.apache.org/jira/browse/HUDI-4696
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=10720=logs=dcedfe73-9485-5cc5-817a-73b61fc5dcb0=746585d8-b50a-55c3-26c5-517d93af9934



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227828149

   
   ## CI report:
   
   * 1cc9581196646dc677a0940c169d30407188b178 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958)
 
   * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10961)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227825299

   
   ## CI report:
   
   * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957)
 
   * 24c8b543afd26438898efff96c98c81130c9ca54 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10960)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227822500

   
   ## CI report:
   
   * 1cc9581196646dc677a0940c169d30407188b178 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958)
 
   * 48a24245761ca6a1b910aa2f39ba1fb2a596a048 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227822474

   
   ## CI report:
   
   * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957)
 
   * 24c8b543afd26438898efff96c98c81130c9ca54 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227819012

   
   ## CI report:
   
   * 1cc9581196646dc677a0940c169d30407188b178 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10958)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"

2022-08-25 Thread GitBox


hudi-bot commented on PR #6501:
URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227818974

   
   ## CI report:
   
   * f07b0630b9654b1c9b10ff5efc0e5989625404da Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10955)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[hudi] branch master updated: [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494)

2022-08-25 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e90872b396 [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat 
(#6494)
e90872b396 is described below

commit e90872b396630318e6cc18d560f23e16c3595a29
Author: Shiyan Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Thu Aug 25 16:58:35 2022 -0500

[HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat (#6494)
---
 .../hudi/common/testutils/minicluster/HdfsTestService.java   | 12 +---
 .../TestHoodieCombineHiveInputFormat.java| 11 ---
 2 files changed, 9 insertions(+), 14 deletions(-)

diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
index eda8591749..ba584a4329 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/minicluster/HdfsTestService.java
@@ -18,7 +18,6 @@
 
 package org.apache.hudi.common.testutils.minicluster;
 
-import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.common.testutils.NetworkTestUtils;
 import org.apache.hudi.common.util.FileIOUtils;
 
@@ -45,7 +44,7 @@ public class HdfsTestService {
   /**
* Configuration settings.
*/
-  private Configuration hadoopConf;
+  private final Configuration hadoopConf;
   private final String workDir;
 
   /**
@@ -54,6 +53,7 @@ public class HdfsTestService {
   private MiniDFSCluster miniDfsCluster;
 
   public HdfsTestService() throws IOException {
+hadoopConf = new Configuration();
 workDir = Files.createTempDirectory("temp").toAbsolutePath().toString();
   }
 
@@ -63,7 +63,6 @@ public class HdfsTestService {
 
   public MiniDFSCluster start(boolean format) throws IOException {
 Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
-hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
 
 // If clean, then remove the work dir so we can start fresh.
 String localDFSLocation = getDFSLocation(workDir);
@@ -107,7 +106,6 @@ public class HdfsTestService {
   miniDfsCluster.shutdown(true, true);
 }
 miniDfsCluster = null;
-hadoopConf = null;
   }
 
   /**
@@ -123,9 +121,9 @@ public class HdfsTestService {
   /**
* Configure the DFS Cluster before launching it.
*
-   * @param config The already created Hadoop configuration we'll further 
configure for HDFS
+   * @param config   The already created Hadoop configuration we'll 
further configure for HDFS
* @param localDFSLocation The location on the local filesystem where 
cluster data is stored
-   * @param bindIP An IP address we want to force the datanode and namenode to 
bind to.
+   * @param bindIP   An IP address we want to force the datanode and 
namenode to bind to.
* @return The updated Configuration object.
*/
   private static Configuration configureDFSCluster(Configuration config, 
String localDFSLocation, String bindIP,
@@ -146,7 +144,7 @@ public class HdfsTestService {
 String user = System.getProperty("user.name");
 config.set("hadoop.proxyuser." + user + ".groups", "*");
 config.set("hadoop.proxyuser." + user + ".hosts", "*");
-config.setBoolean("dfs.permissions",false);
+config.setBoolean("dfs.permissions", false);
 return config;
   }
 
diff --git 
a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java
 
b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
similarity index 98%
rename from 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java
rename to 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
index 0a14af2212..9b26a7915d 100644
--- 
a/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/functional/TestHoodieCombineHiveInputFormat.java
+++ 
b/hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/hive/TestHoodieCombineHiveInputFormat.java
@@ -16,10 +16,8 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.hadoop.functional;
+package org.apache.hudi.hadoop.hive;
 
-import org.apache.hadoop.hive.metastore.api.hive_metastoreConstants;
-import org.apache.hadoop.hive.ql.io.IOContextMap;
 import org.apache.hudi.avro.HoodieAvroUtils;
 import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.common.model.HoodieTableType;
@@ -33,9 +31,6 @@ import org.apache.hudi.common.testutils.SchemaTestUtil;
 import org.apache.hudi.common.testutils.minicluster.MiniClusterUtil;
 import org.apache.hudi.common.util.CommitUtils;
 import 

[GitHub] [hudi] nsivabalan merged pull request #6494: [HUDI-4696] Fix flaky TestHoodieCombineHiveInputFormat

2022-08-25 Thread GitBox


nsivabalan merged PR #6494:
URL: https://github.com/apache/hudi/pull/6494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3391) presto and hive beeline fails to read MOR table w/ 2 or more array fields

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3391:
-
Reviewers: sivabalan narayanan

> presto and hive beeline fails to read MOR table w/ 2 or more array fields
> -
>
> Key: HUDI-3391
> URL: https://issues.apache.org/jira/browse/HUDI-3391
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: dependencies, reader-core, trino-presto
>Reporter: sivabalan narayanan
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We have an issue reported by user 
> [here|[https://github.com/apache/hudi/issues/2657].] Looks like w/ 0.10.0 or 
> later, spark datasource read works, but hive beeline does not work. Even 
> spark.sql (hive table) querying works as well. 
> Another related ticket: 
> [https://github.com/apache/hudi/issues/3834#issuecomment-997307677]
>  
> Steps that I tried:
> [https://gist.github.com/nsivabalan/fdb8794104181f93b9268380c7f7f079]
> From beeline, you will encounter below exception
> {code:java}
> Failed with exception 
> java.io.IOException:org.apache.hudi.org.apache.avro.SchemaParseException: 
> Can't redefine: array {code}
> All linked ticket states that upgrading parquet to 1.11.0 or greater should 
> work. We need to try it out w/ latest master and go from there. 
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4650) Commits Command: Include both active and archive timeline for a given range of intants

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4650:
-
Reviewers: Raymond Xu  (was: sivabalan narayanan)

> Commits Command: Include both active and archive timeline for a given range 
> of intants
> --
>
> Key: HUDI-4650
> URL: https://issues.apache.org/jira/browse/HUDI-4650
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4389) Make HoodieStreamingSink idempotent

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4389:
-
Reviewers: sivabalan narayanan

> Make HoodieStreamingSink idempotent
> ---
>
> Key: HUDI-4389
> URL: https://issues.apache.org/jira/browse/HUDI-4389
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available, streaming
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4648) Add command to rename partition

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4648:
-
Reviewers: Raymond Xu

> Add command to rename partition
> ---
>
> Key: HUDI-4648
> URL: https://issues.apache.org/jira/browse/HUDI-4648
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
> Fix For: 0.12.1
>
>
> Based on https://github.com/apache/hudi/pull/6438#discussion_r949841206



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4633) Add command to trace partition through a range of commits

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4633:
-
Reviewers: Raymond Xu  (was: sivabalan narayanan)

> Add command to trace partition through a range of commits
> -
>
> Key: HUDI-4633
> URL: https://issues.apache.org/jira/browse/HUDI-4633
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cli
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4528) Diff tool to compare metadata across snapshots in a given time range

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4528:
-
Reviewers: Raymond Xu  (was: sivabalan narayanan)

> Diff tool to compare metadata across snapshots in a given time range
> 
>
> Key: HUDI-4528
> URL: https://issues.apache.org/jira/browse/HUDI-4528
> Project: Apache Hudi
>  Issue Type: Task
>  Components: cli
>Reporter: Sagar Sumit
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> A tool that diffs two snapshots at table and partition level and can give 
> info about what new file ids got created, deleted, updated and track other 
> changes that are captured in write stats. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


hudi-bot commented on PR #6506:
URL: https://github.com/apache/hudi/pull/6506#issuecomment-1227784163

   
   ## CI report:
   
   * 1cc9581196646dc677a0940c169d30407188b178 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-4485:


Assignee: Yao Zhang

> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version : 3.1.1
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: spring-shell-1.2.0.RELEASE.jar
>
>
> This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
> fsview all · Issue #6177 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/6177]
> {*}{{*}}Describe the problem you faced{{*}}{*}
> Hudi cli got empty result after running command show fsview all.
> ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])
> The type of table t1 is COW and I am sure that the parquet file is actually 
> generated inside data folder. Also, the parquet files are not damaged as the 
> data could be retrieved correctly by reading as Hudi table or directly 
> reading each parquet file(using Spark).
> {*}{{*}}To Reproduce{{*}}{*}
> Steps to reproduce the behavior:
> 1. Enter Flink SQL client.
> 2. Execute the SQL and check the data was written successfully.
> ```sql
> CREATE TABLE t1(
> uuid VARCHAR(20),
> name VARCHAR(10),
> age INT,
> ts TIMESTAMP(3),
> `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
> 'connector' = 'hudi',
> 'path' = 'hdfs:///path/to/table/',
> 'table.type' = 'COPY_ON_WRITE'
> );
> – insert data using values
> INSERT INTO t1 VALUES
> ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
> ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
> ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
> ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
> ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
> ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
> ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
> ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
> ```
> 3. Enter Hudi cli and execute `show fsview all`
> {*}{{*}}Expected behavior{{*}}{*}
> `show fsview all` in Hudi cli should return all file slices.
> {*}{{*}}Environment Description{{*}}{*}
>  * Hudi version : 0.11.1
>  * Spark version : 3.1.1
>  * Hive version : 3.1.0
>  * Hadoop version : 3.1.1
>  * Storage (HDFS/S3/GCS..) : HDFS
>  * Running on Docker? (yes/no) : no
> {*}{{*}}Additional context{{*}}{*}
> No.
> {*}{{*}}Stacktrace{{*}}{*}
> N/A
>  
> Temporary solution:
> I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the 
> attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-08-25 Thread GitBox


hudi-bot commented on PR #6502:
URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227779992

   
   ## CI report:
   
   * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN
   * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10956)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227780026

   
   ## CI report:
   
   * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10957)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] parisni opened a new pull request, #6506: Allow hoodie read client to choose index

2022-08-25 Thread GitBox


parisni opened a new pull request, #6506:
URL: https://github.com/apache/hudi/pull/6506

   ### Change Logs
   
   Currently the hudi read client use BLOOM and this cannot be overwriten. This 
allows to use GLOBAL_BLOOM
   and provides fast lookup on the primary key (without partition keys)
   
   ```
   HudiReadClient client =
   new HudiReadClient(context, path, spark.sqlContext(), GLOBAL_BLOOM);
  client.readROView(keyRdd, 200);
   
   ```
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

2022-08-25 Thread GitBox


the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955436628


##
rfc/rfc-57/rfc-57.md:
##
@@ -0,0 +1,85 @@
+
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from 
Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained 
within a jar that is on the path. We will then implement a deserializer that 
parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+
+### ProtobufClassBasedSchemaProvider
+This new SchemaProvider will allow the user to provide a Protobuf Message 
class and get an Avro Schema. In the proto world, there is no concept of a 
nullable field so people use wrapper types such as Int32Value and StringValue 
to represent a nullable field. The schema provider will also allow the user to 
treat these wrapper fields as nullable versions of the fields they are wrapping 
instead of treating them as a nested message. In practice, this means that the 
user can choose between representing a field `Int32Value my_int = 1;` as 
`my_int.value` or simply `my_int` when writing the data out to the file system.
+
+ Handling of Unsigned Integers and Longs
+Protobuf provides support for unsigned integers and longs while Avro does not. 
The schema provider will convert unsigned integers and longs to Avro long type 
in the schema definition.
+
+ Schema Evolution
+**Adding a Field:**
+Protobuf has a default value for all fields and the translation from proto to 
avro schema will carry over this default value so there are no errors when 
adding a new field to the proto definition.
+**Removing a Field:**
+If a user removes a field in the Protobuf schema, the schema provider will not 
be able to add this field to the avro schema it generates. To avoid issues when 
writing data, users must use `hoodie.datasource.write.reconcile.schema=true` to 
properly reconcile the schemas if a field is removed from the proto definition. 
Users can avoid this situation by using `deprecated` field option in proto 
instead of removing the field from the schema.
+
+Configuration Options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use
+hoodie.deltastreamer.schemaprovider.proto.flattenWrappers (Default: false) - 
By default the wrapper classes will be treated like any other message and have 
a nested `value` field. When this is set to true, we do not have a nested 
`value` field and treat the field as nullable in the generated Schema
+
+### ProtoToAvroConverter

Review Comment:
   I'll add this to the RFC but note that it likely won't be done in the first 
cut.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] the-other-tim-brown commented on a diff in pull request #6111: [HUDI-4399][RFC-57] Protobuf support in DeltaStreamer

2022-08-25 Thread GitBox


the-other-tim-brown commented on code in PR #6111:
URL: https://github.com/apache/hudi/pull/6111#discussion_r955436277


##
rfc/rfc-57/rfc-57.md:
##
@@ -0,0 +1,85 @@
+
+# RFC-57: DeltaStreamer Protobuf Support
+
+
+
+## Proposers
+
+- @the-other-tim-brown
+
+## Approvers
+- @bhasudha
+- @vinothchandar
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-4399
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Support consuming Protobuf messages from Kafka with the DeltaStreamer.
+
+## Background
+Hudi's DeltaStreamer currently supports consuming Avro and JSON data from 
Kafka but it does not support Protobuf. Adding support will require:
+1. Parsing the data from Kafka into Protobuf Messages
+2. Generating a schema from a Protobuf Message class
+3. Converting from Protobuf to Avro
+
+## Implementation
+
+### Parsing Data from Kafka
+Users will provide a classname for the Protobuf Message that is contained 
within a jar that is on the path. We will then implement a deserializer that 
parses the bytes from the kafka message into a protobuf Message.
+
+Configuration options:
+hoodie.deltastreamer.schemaprovider.proto.className - The class to use

Review Comment:
   Do you have experience using that confluent value deserializer? We can add 
that in as an option but I don't have experience with it so may need your help.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-08-25 Thread GitBox


hudi-bot commented on PR #6502:
URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227775803

   
   ## CI report:
   
   * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN
   * 8b1585464429a60d9eff4cfa2cb9f937b1ac6f0d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


hudi-bot commented on PR #6505:
URL: https://github.com/apache/hudi/pull/6505#issuecomment-1227775849

   
   ## CI report:
   
   * cd3d263bc18ea422b3ab124e109cdebcdfda26a3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] yihua commented on issue #6056: [SUPPORT] Metadata table suddenly not cleaned / compacted anymore

2022-08-25 Thread GitBox


yihua commented on issue #6056:
URL: https://github.com/apache/hudi/issues/6056#issuecomment-1227773266

   Here's the tracking Jira ticket: 
[HUDI-4688](https://issues.apache.org/jira/browse/HUDI-4688).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4722:
-
Labels: pull-request-available  (was: )

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Minor
>  Labels: pull-request-available
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4485) Hudi cli got empty result for command show fsview all

2022-08-25 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4485:
-
Priority: Major  (was: Minor)

> Hudi cli got empty result for command show fsview all
> -
>
> Key: HUDI-4485
> URL: https://issues.apache.org/jira/browse/HUDI-4485
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.11.1
> Environment: Hudi version : 0.11.1
> Spark version : 3.1.1
> Hive version : 3.1.0
> Hadoop version : 3.1.1
>Reporter: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
> Attachments: spring-shell-1.2.0.RELEASE.jar
>
>
> This issue is from: [[SUPPORT] Hudi cli got empty result for command show 
> fsview all · Issue #6177 · apache/hudi 
> (github.com)|https://github.com/apache/hudi/issues/6177]
> {*}{{*}}Describe the problem you faced{{*}}{*}
> Hudi cli got empty result after running command show fsview all.
> ![image]([https://user-images.githubusercontent.com/7007327/180346750-6a55f472-45ac-46cf-8185-3c4fc4c76434.png])
> The type of table t1 is COW and I am sure that the parquet file is actually 
> generated inside data folder. Also, the parquet files are not damaged as the 
> data could be retrieved correctly by reading as Hudi table or directly 
> reading each parquet file(using Spark).
> {*}{{*}}To Reproduce{{*}}{*}
> Steps to reproduce the behavior:
> 1. Enter Flink SQL client.
> 2. Execute the SQL and check the data was written successfully.
> ```sql
> CREATE TABLE t1(
> uuid VARCHAR(20),
> name VARCHAR(10),
> age INT,
> ts TIMESTAMP(3),
> `partition` VARCHAR(20)
> )
> PARTITIONED BY (`partition`)
> WITH (
> 'connector' = 'hudi',
> 'path' = 'hdfs:///path/to/table/',
> 'table.type' = 'COPY_ON_WRITE'
> );
> – insert data using values
> INSERT INTO t1 VALUES
> ('id1','Danny',23,TIMESTAMP '1970-01-01 00:00:01','par1'),
> ('id2','Stephen',33,TIMESTAMP '1970-01-01 00:00:02','par1'),
> ('id3','Julian',53,TIMESTAMP '1970-01-01 00:00:03','par2'),
> ('id4','Fabian',31,TIMESTAMP '1970-01-01 00:00:04','par2'),
> ('id5','Sophia',18,TIMESTAMP '1970-01-01 00:00:05','par3'),
> ('id6','Emma',20,TIMESTAMP '1970-01-01 00:00:06','par3'),
> ('id7','Bob',44,TIMESTAMP '1970-01-01 00:00:07','par4'),
> ('id8','Han',56,TIMESTAMP '1970-01-01 00:00:08','par4');
> ```
> 3. Enter Hudi cli and execute `show fsview all`
> {*}{{*}}Expected behavior{{*}}{*}
> `show fsview all` in Hudi cli should return all file slices.
> {*}{{*}}Environment Description{{*}}{*}
>  * Hudi version : 0.11.1
>  * Spark version : 3.1.1
>  * Hive version : 3.1.0
>  * Hadoop version : 3.1.1
>  * Storage (HDFS/S3/GCS..) : HDFS
>  * Running on Docker? (yes/no) : no
> {*}{{*}}Additional context{{*}}{*}
> No.
> {*}{{*}}Stacktrace{{*}}{*}
> N/A
>  
> Temporary solution:
> I modified and recompiled spring-shell 1.2.0.RELEASE. Please download the 
> attachment and replace the same file in ${HUDI_CLI_DIR}/target/lib/.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6502: HUDI-4722 Added locking metrics for Hudi

2022-08-25 Thread GitBox


hudi-bot commented on PR #6502:
URL: https://github.com/apache/hudi/pull/6502#issuecomment-1227771890

   
   ## CI report:
   
   * fbedf9a29c4c574ad4d69406416dbb057c080345 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] parisni opened a new pull request, #6505: AwsglueSync Turn already exist error into warning

2022-08-25 Thread GitBox


parisni opened a new pull request, #6505:
URL: https://github.com/apache/hudi/pull/6505

   ### Change Logs
   
   This avoids the sync to fail when likely concurrent sync happens or other 
cases 
   fixes #5960 
   
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] santoshraj123 opened a new issue, #6504: [SUPPORT]

2022-08-25 Thread GitBox


santoshraj123 opened a new issue, #6504:
URL: https://github.com/apache/hudi/issues/6504

   
   Hello, we are facing issue when running the spark-submit command after 
performing a DELETE operation on the PostGres database.  The spark command, 
schema and properties file are given.  The spark command generates the target 
Hudi tables after an INSERT of a row into the database, successfully.  But, it 
fails with a "rolled-back" HudiException when after running the Spark command. 
We are using EMR version 6.7.0 with Hudi 0.11.0-amzn-0.  The source of the data 
is a PostGres database.  AWS DMS generates the parquet file from Postgres and 
lands the datasets into S3 landing zone.  We tried both COPY_ON_WRITE and 
MERGE_ON_READ, yet DELETEs fail. 
   
   Environment information:
   --
   Hudi version : 0.11.0-amzn-0
   Spark version : version 3.2.1-amzn-0
   Hive version :  3.1.3
   Scala version : 2.12.15
   Hadoop version : xxx
   Storage (HDFS/S3/GCS..) : S3
   Running on Docker? (yes/no) : no
   DMS Engine Version: 3.4.7
   
   **Spark command**
   -
   sudo spark-submit --jars 
/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/hudi/hudi-utilities-bundle.jar
  \
   --master yarn  \
   --deploy-mode client  \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.sql.hive.convertMetastoreParquet=false   \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
/usr/lib/hudi/hudi-utilities-bundle.jar  \
   --table-type MERGE_ON_READ   \
   --source-ordering-field order_id\
   --props s3_url/hoodie-glue.properties   \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource   \
   --target-base-path s3_url/v_hudi_orders   \
   --target-table v_hudi_orders --payload-class 
org.apache.hudi.common.model.AWSDmsAvroPayload \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf hoodie.deltastreamer.transformer.sql="SELECT Op, 
dms_received_ts, col1, col2, col3, CASE WHEN a.Op = 'D' THEN true ELSE false  
END as _hoodie_is_deleted  FROM  a" \
   --op BULK_INSERT
   
   **Stacktrace**:
   ---
   22/08/23 15:15:46 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
   22/08/23 15:15:46 INFO SparkContext: Successfully stopped SparkContext
   Exception in thread "main" org.apache.hudi.exception.HoodieException: Commit 
20220823151531894 failed and rolled-back !
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:649)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:331)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:200)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:97)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:198)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:549)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1000)
   at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1089)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1098)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   22/08/23 15:15:46 INFO ShutdownHookManager: Shutdown hook called
   
   Properties file
   -
   hoodie.table.name=t_hudi_able
   hoodie.table.type=MERGE_ON_READ
   hoodie.deltastreamer.source.dfs.root=s3_URL
   hoodie.datasource.write.recordkey.field=col1 (pk)
   hoodie.datasource.write.partitionpath.field=col3
   hoodie.datasource.write.precombine.field=ts (DMS generated)
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.table=t_hudi_able
   hoodie.datasource.hive_sync.database=default
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.hive_sync.partition_fields=col3
   

[GitHub] [hudi] maduraitech opened a new issue, #6503: Hudi Merge Into with larger volume

2022-08-25 Thread GitBox


maduraitech opened a new issue, #6503:
URL: https://github.com/apache/hudi/issues/6503

   Use case: We are trying to perform merge into for update partial columns, 
else insert new records in single command.
   
   Issue: Data is not updating as expected rather it’s trying to insert the 
record which is already existing and creating duplicates. 
   Also its updating for few rows.
   When we retry the same merge into statement with same data again, it's 
always inserting new rows and for specific rows it's keep on updating every run.
   
   **Environment Description:
   Hudi:  0.11.0
   Spark: 2.4.8
   Storage: GCS**
   
   More Details: 
   When we tried similar use case for small tables, it's working fine.
   **We do have the following additional options:**
   Added below hudi write configs while creating table to see, we don't see 
much difference but rather its not even updating the column which was updating 
previously for few rows.

   Options (
   hoodie.datasource.write.table.type='COPY_ON_WRITE',
   primaryKey = 'col1,col2 etc.',
   hoodie.datasource.write.hive_style_partitioning = false,
   hoodie.datasource.write.operation = 'upsert',
   hoodie.datasource.write.payload.class = 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload',
   hoodie.datasource.write.keygenerator.class = 
'org.apache.hudi.keygen.ComplexKeyGenerator'
   )
   Addition we also tried is to check if we can combine all our keys into one 
(if too many key columns was concern) and perform merge. Even this scenario has 
no difference in behaviour.
   
   **Please note,** Reason we don’t want to precombineField at table level was 
as it will enforce to include in our update which we don’t want as part of the 
use case behavior. For lower volume which we have tested , we didn’t have  
precombineField at table level. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-08-25 Thread Jagmeet bali (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagmeet bali updated HUDI-4722:
---
Description: 
Added metrics for following
 # Lock request latency
 # Count of Lock success
 # Count of failed to acquire the lock
 # Duration of locks held with support for re-entrancy
 # Conflict resolution metrics. Succes vs Failure

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Minor
>
> Added metrics for following
>  # Lock request latency
>  # Count of Lock success
>  # Count of failed to acquire the lock
>  # Duration of locks held with support for re-entrancy
>  # Conflict resolution metrics. Succes vs Failure



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HUDI-4722) Add support for metrics for locking infra

2022-08-25 Thread Jagmeet bali (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jagmeet bali updated HUDI-4722:
---
Priority: Minor  (was: Major)

> Add support for metrics for locking infra
> -
>
> Key: HUDI-4722
> URL: https://issues.apache.org/jira/browse/HUDI-4722
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Jagmeet bali
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HUDI-4722) Add support for metrics for locking infra

2022-08-25 Thread Jagmeet bali (Jira)
Jagmeet bali created HUDI-4722:
--

 Summary: Add support for metrics for locking infra
 Key: HUDI-4722
 URL: https://issues.apache.org/jira/browse/HUDI-4722
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Jagmeet bali






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] jsbali opened a new pull request, #6502: Added locking metrics for Hudi

2022-08-25 Thread GitBox


jsbali opened a new pull request, #6502:
URL: https://github.com/apache/hudi/pull/6502

   ### Change Logs
   
   Added metrics for the following for the locking infra
   
   1. Lock request latency
   2. Count of Lock success
   3. Count of failed to acquire the lock
   4. Duration of locks held with support for re-entrancy
   5. Conflict resolution metrics. Succes vs Failure
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"

2022-08-25 Thread GitBox


hudi-bot commented on PR #6501:
URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227716705

   
   ## CI report:
   
   * f07b0630b9654b1c9b10ff5efc0e5989625404da Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10955)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView

2022-08-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4721:
-
Labels: pull-request-available  (was: )

> Fix thread safety w/ RemoteTableFileSystemView 
> ---
>
> Key: HUDI-4721
> URL: https://issues.apache.org/jira/browse/HUDI-4721
> Project: Apache Hudi
>  Issue Type: Test
>  Components: reader-core, writer-core
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> After retry mechanism was added to RemoteTableFileSystemView, looks like the 
> code is not thread safe. 
>  
> [https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946]
>  
> This might impact regular flows as well even if no retries are enabled. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] hudi-bot commented on pull request #6501: [HUDI-4721] Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"

2022-08-25 Thread GitBox


hudi-bot commented on PR #6501:
URL: https://github.com/apache/hudi/pull/6501#issuecomment-1227711584

   
   ## CI report:
   
   * f07b0630b9654b1c9b10ff5efc0e5989625404da UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #6500: [HUDI-4720] Fix HoodieInternalRow return wrong num of fields when sou…

2022-08-25 Thread GitBox


hudi-bot commented on PR #6500:
URL: https://github.com/apache/hudi/pull/6500#issuecomment-1227706366

   
   ## CI report:
   
   * 5edcd57668db6ed3de47f484020d00600b3e8d81 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10952)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] nsivabalan opened a new pull request, #6501: Revert "[HUDI-3669] Add a remote request retry mechanism for 'RemoteHoodietablefiles… (#5884)"

2022-08-25 Thread GitBox


nsivabalan opened a new pull request, #6501:
URL: https://github.com/apache/hudi/pull/6501

   This reverts commit 660177bce1cd82975d7c25715497e0d2fbb2a95e.
   
   ### Change Logs
   
   Some [thread safety 
issues](https://github.com/apache/hudi/pull/5884/files#r955363946) are deducted 
w/ this feature added. Reverting it for now. I will let the author put up a new 
patch w/ proper fix. 
   
   ### Impact
   
   Could result in wrong data being served. 
   
   **Risk level: high**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4721) Fix thread safety w/ RemoteTableFileSystemView

2022-08-25 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-4721:
-

 Summary: Fix thread safety w/ RemoteTableFileSystemView 
 Key: HUDI-4721
 URL: https://issues.apache.org/jira/browse/HUDI-4721
 Project: Apache Hudi
  Issue Type: Test
  Components: reader-core, writer-core
Reporter: sivabalan narayanan


After retry mechanism was added to RemoteTableFileSystemView, looks like the 
code is not thread safe. 

 

[https://github.com/apache/hudi/pull/5884/files#diff-0d301525ef388eb460372ea300c827728c954fdda799adfce7040158ec8b1d84R183|https://github.com/apache/hudi/pull/5884/files#r955363946]

 

This might impact regular flows as well even if no retries are enabled. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [hudi] yihua commented on a diff in pull request #5884: [HUDI-3669] Add a remote request retry mechanism for 'Remotehoodietablefiles…

2022-08-25 Thread GitBox


yihua commented on code in PR #5884:
URL: https://github.com/apache/hudi/pull/5884#discussion_r955364712


##
hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java:
##
@@ -165,17 +179,9 @@ private  T executeRequest(String requestPath, 
Map queryParame
 
 String url = builder.toString();
 LOG.info("Sending request : (" + url + ")");
-Response response;
-int timeout = this.timeoutSecs * 1000; // msec
-switch (method) {
-  case GET:
-response = 
Request.Get(url).connectTimeout(timeout).socketTimeout(timeout).execute();
-break;
-  case POST:
-  default:
-response = 
Request.Post(url).connectTimeout(timeout).socketTimeout(timeout).execute();
-break;
-}
+// Reset url and method, to avoid repeatedly instantiating objects.
+urlCheckedFunc.setUrlAndMethod(url, method);
+Response response =  retryHelper != null ? 
retryHelper.tryWith(urlCheckedFunc).start() : urlCheckedFunc.get();

Review Comment:
   @LinMingQiang @danny0405 Every request goes through this flow and the 
`urlCheckedFunc` should not be shared across requests.  The logic here is 
incorrect.  This can cause serious correctness problems under concurrency.
   
   We need to revert this logic.  Also suggest that we should guard against 
such changes in the hot path with a flag.
   
   cc @nsivabalan @rmahindra123 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



  1   2   3   >