date:20220919

[GitHub] [hudi] hudi-bot commented on pull request #6722: [HUDI-4326] Fix hive sync serde properties

2022-09-19 Thread GitBox



hudi-bot commented on PR #6722:
URL: https://github.com/apache/hudi/pull/6722#issuecomment-1251874266

   
   ## CI report:
   
   * e70039066ec51b7328e4f69cb2f266bdd1cc065e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11521)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS

2022-09-19 Thread GitBox



hudi-bot commented on PR #6665:
URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251874119

   
   ## CI report:
   
   * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509)
 
   * 2c9da2578674f51f1a68db91b3a1defe29d5cfcc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11520)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codope commented on a diff in pull request #6722: [HUDI-4326] Fix hive sync serde properties

2022-09-19 Thread GitBox



codope commented on code in PR #6722:
URL: https://github.com/apache/hudi/pull/6722#discussion_r974911483


##
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java:
##
@@ -303,6 +301,7 @@ public void testSyncCOWTableWithProperties(boolean 
useSchemaFromCommitMetadata,
 hiveDriver.run("SHOW CREATE TABLE " + dbTableName);
 hiveDriver.getResults(results);
 String ddl = String.join("\n", results);
+assertTrue(ddl.contains(String.format("ROW FORMAT SERDE \n  '%s'", 
ParquetHiveSerDe.class.getName(;

Review Comment:
   Is `ROW FORMAT SERDE \n '%s'` a fixed format? `getTable` API in hive client 
was introduced previously just for this testing purpose. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6722: [HUDI-4326] Fix hive sync serde properties

2022-09-19 Thread GitBox



hudi-bot commented on PR #6722:
URL: https://github.com/apache/hudi/pull/6722#issuecomment-1251871135

   
   ## CI report:
   
   * e70039066ec51b7328e4f69cb2f266bdd1cc065e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS

2022-09-19 Thread GitBox



hudi-bot commented on PR #6665:
URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251871016

   
   ## CI report:
   
   * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509)
 
   * 2c9da2578674f51f1a68db91b3a1defe29d5cfcc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251870554

   
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512)
 
   * e75f6d0031490025107040c1b0093c3c5720a67d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11519)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



hudi-bot commented on PR #6580:
URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251867741

   
   ## CI report:
   
   * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN
   * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN
   * 675221955f01a2a4fdc138af346fc78a2d11a41b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11513)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251867212

   
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512)
 
   * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan opened a new pull request, #6722: [HUDI-4326] Fix hive sync serde properties

2022-09-19 Thread GitBox



xushiyan opened a new pull request, #6722:
URL: https://github.com/apache/hudi/pull/6722

   ### Change Logs
   
   Improve API and refactor code about metasync for serde properties.
   
   ### Impact
   
   **Risk level: low**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-09-19 Thread GitBox



xushiyan commented on code in PR #5920:
URL: https://github.com/apache/hudi/pull/5920#discussion_r974895469


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java:
##
@@ -290,6 +290,9 @@ private boolean syncSchema(String tableName, boolean 
tableExists, boolean useRea
 // Sync the table properties if the schema has changed
 if (config.getString(HIVE_TABLE_PROPERTIES) != null || 
config.getBoolean(HIVE_SYNC_AS_DATA_SOURCE_TABLE)) {
   syncClient.updateTableProperties(tableName, tableProperties);
+  HoodieFileFormat baseFileFormat = 
HoodieFileFormat.valueOf(config.getStringOrDefault(META_SYNC_BASE_FILE_FORMAT).toUpperCase());
+  String serDeFormatClassName = 
HoodieInputFormatUtils.getSerDeClassName(baseFileFormat);
+  syncClient.updateTableSerDeInfo(tableName, serDeFormatClassName, 
serdeProperties);

Review Comment:
   we don't need to pass serde class from the API. it's controlled by the base 
file format, which is taken from the sync config.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [MINOR] fix indent to make build pass (#6721)

2022-09-19 Thread xushiyan

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6b7157536d [MINOR] fix indent to make build pass (#6721)
6b7157536d is described below

commit 6b7157536d52ee1853a0ff362d750f2d473566c2
Author: Yann Byron 
AuthorDate: Tue Sep 20 13:08:42 2022 +0800

[MINOR] fix indent to make build pass (#6721)
---
 .../src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
 
b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
index 7b58f5c3f3..ba9d33a662 100644
--- 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
+++ 
b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
@@ -160,8 +160,8 @@ public class TestHiveSyncTool {
 assertTrue(hiveClient.tableExists(HiveTestUtil.TABLE_NAME),
 "Table " + HiveTestUtil.TABLE_NAME + " should exist after sync 
completes");
 assertEquals("org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
-  
hiveClient.getTable(HiveTestUtil.TABLE_NAME).getSd().getSerdeInfo().getSerializationLib(),
-  "SerDe info not updated or does not match");
+
hiveClient.getTable(HiveTestUtil.TABLE_NAME).getSd().getSerdeInfo().getSerializationLib(),
+"SerDe info not updated or does not match");
 assertEquals(hiveClient.getMetastoreSchema(HiveTestUtil.TABLE_NAME).size(),
 hiveClient.getStorageSchema().getColumns().size() + 1,
 "Hive Schema should match the table schema + partition field");

[GitHub] [hudi] xushiyan merged pull request #6721: [MINOR] fix indent to make build pass

2022-09-19 Thread GitBox



xushiyan merged PR #6721:
URL: https://github.com/apache/hudi/pull/6721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6721: [MINOR] fix indent to make build pass

2022-09-19 Thread GitBox



hudi-bot commented on PR #6721:
URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251834580

   
   ## CI report:
   
   * d1f2d32e451793819be47b80d18ac96252165daa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11517)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



hudi-bot commented on PR #6697:
URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251834522

   
   ## CI report:
   
   * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507)
 
   * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515)
 
   * 539da709d04c12834b7105daa0fc3aa80f398ef2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11516)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251834066

   
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470)
 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512)
 
   * e75f6d0031490025107040c1b0093c3c5720a67d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Teng Huo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606881#comment-17606881
 ] 

Teng Huo commented on HUDI-4880:


Thanks Danny for replying.

Yeah, the new marker files are generated properly when doing writing. But the 
problem here is the old markers are deleted if the task of the same compaction 
request instant failed previously, then, commit action in the next task doesn't 
know the files left in the previous failed task, because all marker files are 
generated in the second compaction task. As result, reconciling code can't work 
properly.

Here, I assume every marker file is generated when a new parquet generated. I 
haven't read the code about how these marker files created. Please correct me 
if I'm wrong.

> Corrupted parquet file found in Hudi Flink MOR pipeline
> ---
>
> Key: HUDI-4880
> URL: https://issues.apache.org/jira/browse/HUDI-4880
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>
> h2. Env
> Hudi version : 0.11.1 (but I believe this issue still exist in the current 
> version)
> Flink version : 1.13
> Pipeline type: MOR, online compaction
> h2. TLDR
> Marker mechanism for cleaning corrupted parquet files is not effective now in 
> Flink MOR online compaction due to this PR: 
> [https://github.com/apache/hudi/pull/5611]
> h2. Issue description
> Recently, we suffered an issue which said there were corrupted parquet files 
> in Hudi table, so this Hudi table is not readable, or compaction task will 
> constantly fail.
> e.g. Spark application complained this parquet file is too small.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
> (TID 156) (executor 6): java.lang.RuntimeException: 
> hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
>  is not a Parquet file (too small length: 0)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
> {code}
> h2. Root cause
> After trouble shooting, I believe we might find the root cause of this issue.
> At the beginning, this Flink MOR pipeline failed due to some reason, which 
> left a bunch of unfinished parquet files in this Hudi table. It is acceptable 
> for Hudi because we can clean them later with "Marker" in the method 
> "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It 
> will find out these files which are in the marker folder, but not in the 
> commit metadata, mark them as corrupted files, then delete them.
> However, I found this part of code didn't work properly as expect, this 
> corrupted parquet file 
> "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
> not deleted in "20220919020324533.commit".
> Then, we found there is [a piece of 
> code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
>  deleting the marker folder at the beginning of every batch of compaction. 
> This causes the mechanism of deleting corrupt files to be a failure, since 
> all marker files created before the current batch were deleted.
> And we found HDFS audit logs showing this marker folder 
> "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
> single Flink application, which proved the current behavior of 
> "CompactionPlanOperator", it deletes marker folder every time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6721: [MINOR] fix indent to make build pass

2022-09-19 Thread GitBox



hudi-bot commented on PR #6721:
URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251828724

   
   ## CI report:
   
   * d1f2d32e451793819be47b80d18ac96252165daa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



hudi-bot commented on PR #6697:
URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251828487

   
   ## CI report:
   
   * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507)
 
   * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515)
 
   * 539da709d04c12834b7105daa0fc3aa80f398ef2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on pull request #6721: [MINOR] fix indent to make build pass

2022-09-19 Thread GitBox



YannByron commented on PR #6721:
URL: https://github.com/apache/hudi/pull/6721#issuecomment-1251822538

   @nsivabalan @xushiyan https://github.com/apache/hudi/pull/5920 brings an 
indent problem that will cause the failure of build/ci.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Danny Chen (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606876#comment-17606876
 ] 

Danny Chen commented on HUDI-4880:
--

Thanks for the nice analyze, the purpose for this PR is to clean the marker 
files on each compaction task start up, but then the compaction task would 
re-generate these markers when writing, so when committing compaction, the 
marker dir/files exists right ?

> Corrupted parquet file found in Hudi Flink MOR pipeline
> ---
>
> Key: HUDI-4880
> URL: https://issues.apache.org/jira/browse/HUDI-4880
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>
> h2. Env
> Hudi version : 0.11.1 (but I believe this issue still exist in the current 
> version)
> Flink version : 1.13
> Pipeline type: MOR, online compaction
> h2. TLDR
> Marker mechanism for cleaning corrupted parquet files is not effective now in 
> Flink MOR online compaction due to this PR: 
> [https://github.com/apache/hudi/pull/5611]
> h2. Issue description
> Recently, we suffered an issue which said there were corrupted parquet files 
> in Hudi table, so this Hudi table is not readable, or compaction task will 
> constantly fail.
> e.g. Spark application complained this parquet file is too small.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
> (TID 156) (executor 6): java.lang.RuntimeException: 
> hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
>  is not a Parquet file (too small length: 0)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
> {code}
> h2. Root cause
> After trouble shooting, I believe we might find the root cause of this issue.
> At the beginning, this Flink MOR pipeline failed due to some reason, which 
> left a bunch of unfinished parquet files in this Hudi table. It is acceptable 
> for Hudi because we can clean them later with "Marker" in the method 
> "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It 
> will find out these files which are in the marker folder, but not in the 
> commit metadata, mark them as corrupted files, then delete them.
> However, I found this part of code didn't work properly as expect, this 
> corrupted parquet file 
> "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
> not deleted in "20220919020324533.commit".
> Then, we found there is [a piece of 
> code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
>  deleting the marker folder at the beginning of every batch of compaction. 
> This causes the mechanism of deleting corrupt files to be a failure, since 
> all marker files created before the current batch were deleted.
> And we found HDFS audit logs showing this marker folder 
> "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
> single Flink application, which proved the current behavior of 
> "CompactionPlanOperator", it deletes marker folder every time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] YannByron opened a new pull request, #6721: [MINOR] fix indent to make build pass

2022-09-19 Thread GitBox



YannByron opened a new pull request, #6721:
URL: https://github.com/apache/hudi/pull/6721

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was 
copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance 
impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to 
mitigate the risks._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-19 Thread GitBox



wzx140 commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r974865936


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -505,7 +514,9 @@ public class HoodieWriteConfig extends HoodieConfig {
   private HoodieMetadataConfig metadataConfig;
   private HoodieMetastoreConfig metastoreConfig;
   private HoodieCommonConfig commonConfig;
+  private HoodieStorageConfig storageConfig;
   private EngineType engineType;
+  private HoodieRecordMerger recordMerger;

Review Comment:
   getRecordMerger will be called more than once for getting recordType(SPARK, 
AVRO). Holding recordMerger will be better? Or we can make it lazy loading.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TengHuo commented on pull request #5611: [HUDI-4108] Clean the marker files before starting new flink compaction

2022-09-19 Thread GitBox



TengHuo commented on PR #5611:
URL: https://github.com/apache/hudi/pull/5611#issuecomment-1251800694

   We found an issue which might be related with PR, detail in 
https://issues.apache.org/jira/browse/HUDI-4880
   
   May I ask if there is anyone can double check if it is the root cause? 
Really appreciate


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Teng Huo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606864#comment-17606864
 ] 

Teng Huo commented on HUDI-4880:


Related with this issue: https://issues.apache.org/jira/browse/HUDI-4108

> Corrupted parquet file found in Hudi Flink MOR pipeline
> ---
>
> Key: HUDI-4880
> URL: https://issues.apache.org/jira/browse/HUDI-4880
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>
> h2. Env
> Hudi version : 0.11.1 (but I believe this issue still exist in the current 
> version)
> Flink version : 1.13
> Pipeline type: MOR, online compaction
> h2. TLDR
> Marker mechanism for cleaning corrupted parquet files is not effective now in 
> Flink MOR online compaction due to this PR: 
> [https://github.com/apache/hudi/pull/5611]
> h2. Issue description
> Recently, we suffered an issue which said there were corrupted parquet files 
> in Hudi table, so this Hudi table is not readable, or compaction task will 
> constantly fail.
> e.g. Spark application complained this parquet file is too small.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
> (TID 156) (executor 6): java.lang.RuntimeException: 
> hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
>  is not a Parquet file (too small length: 0)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
> {code}
> h2. Root cause
> After trouble shooting, I believe we might find the root cause of this issue.
> At the beginning, this Flink MOR pipeline failed due to some reason, which 
> left a bunch of unfinished parquet files in this Hudi table. It is acceptable 
> for Hudi because we can clean them later with "Marker" in the method 
> "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It 
> will find out these files which are in the marker folder, but not in the 
> commit metadata, mark them as corrupted files, then delete them.
> However, I found this part of code didn't work properly as expect, this 
> corrupted parquet file 
> "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
> not deleted in "20220919020324533.commit".
> Then, we found there is [a piece of 
> code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
>  deleting the marker folder at the beginning of every batch of compaction. 
> This causes the mechanism of deleting corrupt files to be a failure, since 
> all marker files created before the current batch were deleted.
> And we found HDFS audit logs showing this marker folder 
> "hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
> single Flink application, which proved the current behavior of 
> "CompactionPlanOperator", it deletes marker folder every time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] IsisPolei opened a new issue, #6720: [SUPPORT]Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (

2022-09-19 Thread GitBox



IsisPolei opened a new issue, #6720:
URL: https://github.com/apache/hudi/issues/6720

   hudi:0.10.1
   spark:3.1.3_scala2.12
   
   background story:
   I use SparkRDDWriteClient to process hudi , both app and spark standalone 
cluster are running in docker. When the app and spark cluster container running 
in the same local machine, my app work well. But when i deploy the spark 
cluster in different machine i got a series of connection problems.
   machineA(192.168.64.107): spark driver(SparkRDDWriteClient app)
   machineB(192.168.64.121):spark standalone cluster(master and worker running 
in two containers)
   Due to the spark network connection mechanism, i  have set the connect parms 
below:
   
   spark.master.url: spark://192.168.64.121:7077
   spark.driver.bindAddress: 0.0.0.0
   spark.driver.host: 192.168.64.107
   spark.driver.port: 1
   
   The HoodieSparkContext init correctly and i can see the spark job running in 
the spark web UI. But when the code reach to sparkRDDWriteClient.upsert(), this 
exception occur:
   
   Caused by: org.apache.hudi.exception.HoodieRemoteException: Connect to 
192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (Connection 
refused) at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.refresh(RemoteHoodieTableFileSystemView.java:420)
   at 
org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.sync(RemoteHoodieTableFileSystemView.java:484)
at 
org.apache.hudi.common.table.view.PriorityBasedFileSystemView.sync(PriorityBasedFileSystemView.java:257)
at 
org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:493)
at 
org.apache.hudi.client.SparkRDDWriteClient.getTableAndInitCtx(SparkRDDWriteClient.java:448)
at 
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:157)
   Caused by: org.apache.http.conn.HttpHostConnectException: Connect to 
192.168.64.107:34446 [/192.168.64.107] failed: Connection refused (Connection 
refused)
at 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:156)
at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
at 
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
at 
org.apache.http.client.fluent.Request.internalExecute(Request.java:173)
at org.apache.http.client.fluent.Request.execute(Request.java:177)
at org.apache.hudi.common.table.view
   
   It seems like these two container can't connect to each other through the  
hoodie.filesystem.view.remote.port. So i expose this port of my app container 
but it doesn't work. Please tell me what i did wrong.
   These are my docker-compose.yml:
   app:
   app:
   image: xxx
   container_name: app
   ports:
 - "5008:5005"
 - "1:1"
 - "10001:10001"
   
   spark:
   version: '3'
   services:
 master:
   image: bitnami/spark:3.1
   container_name: master
   hostname: master
   environment:
  MASTER: spark://master:7077
   restart: always
   ports:
 - "7077:7077"
 - "9080:8080"
  worker:
   image: bitnami/spark:3.1
   container_name: worker
   restart: always
   environment:
 SPARK_WORKER_CORES: 5
 SPARK_WORKER_MEMORY: 2g
 SPARK_WORKER_PORT: 8881
   depends_on:
 - master
   links:
 - master
   ports:
 - "8081:8081"
   expose:
 - "8881"
   
   I hope i describe the situation clearly, please help.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-19 Thread GitBox



wzx140 commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r974857657


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecord.java:
##
@@ -291,59 +284,51 @@ public void checkState() {
 }
   }
 
-  
//
-
-  //
-  // NOTE: This method duplicates those ones of the HoodieRecordPayload and 
are placed here
-  //   for the duration of RFC-46 implementation, until migration off 
`HoodieRecordPayload`
-  //   is complete
-  //
-  public abstract HoodieRecord mergeWith(HoodieRecord other, Schema 
readerSchema, Schema writerSchema) throws IOException;
+  /**
+   * Get column in record to support RDDCustomColumnsSortPartitioner
+   */
+  public abstract Object getRecordColumnValues(Schema recordSchema, String[] 
columns, boolean consistentLogicalTimestampEnabled);
 
-  public abstract HoodieRecord rewriteRecord(Schema recordSchema, Schema 
targetSchema, TypedProperties props) throws IOException;
+  /**
+   * Support bootstrap.
+   */
+  public abstract HoodieRecord mergeWith(HoodieRecord other, Schema 
targetSchema) throws IOException;

Review Comment:
   What this method does is to stitch two record. The logic of bootstrap merge 
is fixed. We should not let users customize the implementation of this method, 
right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Teng Huo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Huo updated HUDI-4880:
---
Description: 
h2. Env

Hudi version : 0.11.1 (but I believe this issue still exist in the current 
version)
Flink version : 1.13
Pipeline type: MOR, online compaction
h2. TLDR

Marker mechanism for cleaning corrupted parquet files is not effective now in 
Flink MOR online compaction due to this PR: 
[https://github.com/apache/hudi/pull/5611]
h2. Issue description

Recently, we suffered an issue which said there were corrupted parquet files in 
Hudi table, so this Hudi table is not readable, or compaction task will 
constantly fail.

e.g. Spark application complained this parquet file is too small.
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
(TID 156) (executor 6): java.lang.RuntimeException: 
hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
 is not a Parquet file (too small length: 0)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
{code}
h2. Root cause

After trouble shooting, I believe we might find the root cause of this issue.

At the beginning, this Flink MOR pipeline failed due to some reason, which left 
a bunch of unfinished parquet files in this Hudi table. It is acceptable for 
Hudi because we can clean them later with "Marker" in the method 
"finalizeWrite". It will call a method named "reconcileAgainstMarkers". It will 
find out these files which are in the marker folder, but not in the commit 
metadata, mark them as corrupted files, then delete them.

However, I found this part of code didn't work properly as expect, this 
corrupted parquet file 
"0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
not deleted in "20220919020324533.commit".

Then, we found there is [a piece of 
code|https://github.com/apache/hudi/blob/13eb892081fc4ddd5e1592ef8698831972012666/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java#L134]
 deleting the marker folder at the beginning of every batch of compaction. This 
causes the mechanism of deleting corrupt files to be a failure, since all 
marker files created before the current batch were deleted.

And we found HDFS audit logs showing this marker folder 
"hdfs://.../.hoodie/.temp/20220919020324533" was deleted multiple times in a 
single Flink application, which proved the current behavior of 
"CompactionPlanOperator", it deletes marker folder every time.

  was:Adding content


> Corrupted parquet file found in Hudi Flink MOR pipeline
> ---
>
> Key: HUDI-4880
> URL: https://issues.apache.org/jira/browse/HUDI-4880
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>
> h2. Env
> Hudi version : 0.11.1 (but I believe this issue still exist in the current 
> version)
> Flink version : 1.13
> Pipeline type: MOR, online compaction
> h2. TLDR
> Marker mechanism for cleaning corrupted parquet files is not effective now in 
> Flink MOR online compaction due to this PR: 
> [https://github.com/apache/hudi/pull/5611]
> h2. Issue description
> Recently, we suffered an issue which said there were corrupted parquet files 
> in Hudi table, so this Hudi table is not readable, or compaction task will 
> constantly fail.
> e.g. Spark application complained this parquet file is too small.
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 66 in 
> stage 12.0 failed 4 times, most recent failure: Lost task 66.3 in stage 12.0 
> (TID 156) (executor 6): java.lang.RuntimeException: 
> hdfs://.../0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet
>  is not a Parquet file (too small length: 0)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:514)
> {code}
> h2. Root cause
> After trouble shooting, I believe we might find the root cause of this issue.
> At the beginning, this Flink MOR pipeline failed due to some reason, which 
> left a bunch of unfinished parquet files in this Hudi table. It is acceptable 
> for Hudi because we can clean them later with "Marker" in the method 
> "finalizeWrite". It will call a method named "reconcileAgainstMarkers". It 
> will find out these files which are in the marker folder, but not in the 
> commit metadata, mark them as corrupted files, then delete them.
> However, I found this part of code didn't work properly as expect, this 
> corrupted parquet file 
> "0012-5e09-42d0-bf4e-2823a1a4bc7b_3-32-29_20220919020324533.parquet" was 
> not deleted in

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r974403023


##
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordCompatibilityInterface.java:
##
@@ -18,21 +18,28 @@
 
 package org.apache.hudi.common.model;
 
+import java.io.IOException;
+import java.util.Properties;
 import org.apache.avro.Schema;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.keygen.BaseKeyGenerator;
 
-import java.io.IOException;
-import java.io.Serializable;
-import java.util.Properties;
+public interface HoodieRecordCompatibilityInterface {
 
-/**
- * HoodieMerge defines how to merge two records. It is a stateless component.
- * It can implement the merging logic of HoodieRecord of different engines
- * and avoid the performance consumption caused by the 
serialization/deserialization of Avro payload.
- */
-public interface HoodieMerge extends Serializable {
-  
-  HoodieRecord preCombine(HoodieRecord older, HoodieRecord newer);
+  /**
+   * This method used to extract HoodieKey not through keyGenerator.
+   */
+  HoodieRecord wrapIntoHoodieRecordPayloadWithParams(

Review Comment:
    



##
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieMergedLogRecordScanner.java:
##
@@ -147,19 +141,19 @@ public static HoodieMergedLogRecordScanner.Builder 
newBuilder() {
   }
 
   @Override
-  protected void processNextRecord(HoodieRecord hoodieRecord) throws 
IOException {
+  protected  void processNextRecord(HoodieRecord hoodieRecord) throws 
IOException {
 String key = hoodieRecord.getRecordKey();
 if (records.containsKey(key)) {
   // Merge and store the merged record. The HoodieRecordPayload 
implementation is free to decide what should be
   // done when a DELETE (empty payload) is encountered before or after an 
insert/update.
 
-  HoodieRecord oldRecord = records.get(key);
-  HoodieRecordPayload oldValue = oldRecord.getData();
-  HoodieRecordPayload combinedValue = (HoodieRecordPayload) 
merge.preCombine(oldRecord, hoodieRecord).getData();
+  HoodieRecord oldRecord = records.get(key);
+  T oldValue = oldRecord.getData();
+  T combinedValue = ((HoodieRecord) recordMerger.merge(oldRecord, 
hoodieRecord, readerSchema, 
this.hoodieTableMetaClient.getTableConfig().getProps()).get()).getData();
   // If combinedValue is oldValue, no need rePut oldRecord
   if (combinedValue != oldValue) {
-HoodieOperation operation = hoodieRecord.getOperation();
-records.put(key, new HoodieAvroRecord<>(new HoodieKey(key, 
hoodieRecord.getPartitionPath()), combinedValue, operation));
+hoodieRecord.setData(combinedValue);

Review Comment:
   Why are we resetting the data instead of using new `HoodieRecord` returned 
by the Merger?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -126,11 +129,17 @@ public class HoodieWriteConfig extends HoodieConfig {
   .withDocumentation("Payload class used. Override this, if you like to 
roll your own merge logic, when upserting/inserting. "
   + "This will render any value set for PRECOMBINE_FIELD_OPT_VAL 
in-effective");
 
-  public static final ConfigProperty MERGE_CLASS_NAME = ConfigProperty
-  .key("hoodie.datasource.write.merge.class")
-  .defaultValue(HoodieAvroRecordMerge.class.getName())
-  .withDocumentation("Merge class provide stateless component interface 
for merging records, and support various HoodieRecord "
-  + "types, such as Spark records or Flink records.");
+  public static final ConfigProperty MERGER_IMPLS = ConfigProperty
+  .key("hoodie.datasource.write.merger.impls")
+  .defaultValue(HoodieAvroRecordMerger.class.getName())
+  .withDocumentation("List of HoodieMerger implementations constituting 
Hudi's merging strategy -- based on the engine used. "
+  + "These merger impls will filter by 
hoodie.datasource.write.merger.strategy "
+  + "Hudi will pick most efficient implementation to perform 
merging/combining of the records (during update, reading MOR table, etc)");
+
+  public static final ConfigProperty MERGER_STRATEGY = ConfigProperty
+  .key("hoodie.datasource.write.merger.strategy")
+  .defaultValue(StringUtils.DEFAULT_MERGER_STRATEGY_UUID)

Review Comment:
   Let's move this to HoodieMerger, rather than `StringUtils` (we can do it in 
a follow-up)



##
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##
@@ -156,11 +155,10 @@ public class HoodieTableConfig extends HoodieConfig {
   .withDocumentation("Payload class to use for performing compactions, i.e 
merge delta logs with current base file and then "
   + " produce a new base file.");
 
-  public static final ConfigProperty MERGE_CLASS_NAME = ConfigProperty
-

[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



hudi-bot commented on PR #6580:
URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251791338

   
   ## CI report:
   
   * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN
   * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN
   * 11ba7cd991ca83773aae03b1fd7271364079be21 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11435)
 
   * 675221955f01a2a4fdc138af346fc78a2d11a41b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11513)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



hudi-bot commented on PR #6697:
URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251791430

   
   ## CI report:
   
   * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507)
 
   * 596d2554f2fff757e40c9f1fad4a02034123fa12 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11515)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251790938

   
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470)
 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11512)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Teng Huo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Teng Huo updated HUDI-4880:
---
Description: Adding content

> Corrupted parquet file found in Hudi Flink MOR pipeline
> ---
>
> Key: HUDI-4880
> URL: https://issues.apache.org/jira/browse/HUDI-4880
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Reporter: Teng Huo
>Assignee: Teng Huo
>Priority: Major
>
> Adding content



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] hudi-bot commented on pull request #6719: [HUDI-4718] Add Kerberos kinit command support for cli.

2022-09-19 Thread GitBox



hudi-bot commented on PR #6719:
URL: https://github.com/apache/hudi/pull/6719#issuecomment-1251788958

   
   ## CI report:
   
   * 4b88f9a2c614e5002bdb029a267a6c351386357b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11514)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



hudi-bot commented on PR #6697:
URL: https://github.com/apache/hudi/pull/6697#issuecomment-125170

   
   ## CI report:
   
   * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507)
 
   * 596d2554f2fff757e40c9f1fad4a02034123fa12 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



hudi-bot commented on PR #6580:
URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251788754

   
   ## CI report:
   
   * ff98ae0dda69ee611e4814fbae9c8ddc0a93a4f1 UNKNOWN
   * 99451dc89547f803eb6823b2baa620096e76459e UNKNOWN
   * 11ba7cd991ca83773aae03b1fd7271364079be21 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11435)
 
   * 675221955f01a2a4fdc138af346fc78a2d11a41b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark

2022-09-19 Thread GitBox



hudi-bot commented on PR #6516:
URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251788682

   
   ## CI report:
   
   * f5cc48018e7d4e542c3d0b2dc677f4b708f03f11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



hudi-bot commented on PR #6046:
URL: https://github.com/apache/hudi/pull/6046#issuecomment-1251788382

   
   ## CI report:
   
   * 5a16d35ec42bf86e5759ebb155cad40e83aba9f9 UNKNOWN
   * 20f64af242ac3e6df5d1555edf0766e7dcdd698a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11470)
 
   * 2ff0b70e69fcff7cd061a2512dc983ac92a3c87c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4878) Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS

2022-09-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4878:
-
Labels: pull-request-available  (was: )

> Fix incremental cleaning for clean based on LATEST_FILE_VERSIONS
> 
>
> Key: HUDI-4878
> URL: https://issues.apache.org/jira/browse/HUDI-4878
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: cleaning
>Reporter: sivabalan narayanan
>Assignee: nicolas paris
>Priority: Major
>  Labels: pull-request-available
>
> clean based on LATEST_FILE_VERSIONS can be improved further since incremental 
> clean is not enabled. lets see if we can improvise. 
>  
> context from author:
>  
>  
> Currently incremental cleaning is run for both KEEP_LATEST_COMMITS, 
> KEEP_LATEST_BY_HOURS
> policies. It is not run when KEEP_LATEST_FILE_VERSIONS.
> This can lead to not cleaning files. This PR fixes this problem by enabling 
> incremental cleaning for KEEP_LATEST_FILE_VERSIONS only.
> Here is the scenario of the problem:
> Say we have 3 committed files in partition-A and we add a new commit in 
> partition-B, and we trigger cleaning for the first time (full partition scan):
>  {{partition-A/
> commit-0.parquet
> commit-1.parquet
> commit-2.parquet
> partition-B/
> commit-3.parquet}}
> In the case say we have KEEP_LATEST_COMMITS with CLEANER_COMMITS_RETAINED=3, 
> the cleaner will remove the commit-0.parquet to keep 3 commits.
> For the next cleaning, incremental cleaning will trigger, and won't consider 
> partition-A/ until a new commit change it. In case no later commit changes 
> partition-A then commit-1.parquet will stay forever. However it should be 
> removed by the cleaner.
> Now if in case of KEEP_LATEST_FILE_VERSIONS, the cleaner will only keep 
> commit-2.parquet. Then it makes sense that incremental cleaning won't 
> consider partition-A until it is changed. Because there is only one commit.
> This is why incremental cleaning should only be enabled with 
> KEEP_LATEST_FILE_VERSIONS
> Hope this is clear enough
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] nsivabalan commented on pull request #6498: [HUDI-4878] Fix incremental cleaner use case

2022-09-19 Thread GitBox



nsivabalan commented on PR #6498:
URL: https://github.com/apache/hudi/pull/6498#issuecomment-1251786668

   @parisni : hey hi. we have a code freeze coming up in a weeks time for 
0.12.1. Just wanted to keep you informed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6719: [HUDI-4718] Add Kerberos kinit command support for cli.

2022-09-19 Thread GitBox



hudi-bot commented on PR #6719:
URL: https://github.com/apache/hudi/pull/6719#issuecomment-1251786271

   
   ## CI report:
   
   * 4b88f9a2c614e5002bdb029a267a6c351386357b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark

2022-09-19 Thread GitBox



hudi-bot commented on PR #6516:
URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251786002

   
   ## CI report:
   
   * f5cc48018e7d4e542c3d0b2dc677f4b708f03f11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11502)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-19 Thread GitBox



nsivabalan commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1251785335

   Also, do you think you can write up a unit test to cover the scenario. We 
have some tests around corrupt blocks already 
[here](https://github.com/apache/hudi/blob/e03c0388d9ddea7e4d650b3f7a64dff41e180d50/hudi-common/src/test/java/org/apache/hudi/common/functional/TestHoodieLogFormat.java#L606).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #4015: [HUDI-2780] Fix the issue of Mor log skipping complete blocks when reading data

2022-09-19 Thread GitBox



nsivabalan commented on PR #4015:
URL: https://github.com/apache/hudi/pull/4015#issuecomment-1251784271

   @hj2016 : thanks for the elaborate explanation. the fix makes sense. only 
comment I have is, within isBlockCorrupted(), we return in 2 to 3 places incase 
of corrupt. so, may be we could avoid seeking within isBlockCorrupt incase of 
corrupt block and reset it from the callers side. 
   
   ```
   int blockSize = inputStream.readLong();
   
   .
   .
   boolean isBlockCorrupt = isBlockCorrupted(blockSize);
   if (isBlockCorrupt) {
 inputStream.seek(blockStartPos);
 return createCorruptBlock();
   }
   ```
   
   within isBlockCorrupted(blockSize): we should not do any seek incase of 
corrupt block. 
   
   let me know what do you think.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a diff in pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-09-19 Thread GitBox



xushiyan commented on code in PR #5920:
URL: https://github.com/apache/hudi/pull/5920#discussion_r974843935


##
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveSyncClient.java:
##
@@ -316,4 +353,11 @@ public void updateTableComments(String tableName, 
List fromMetastor
 }
   }
 
+  Table getTable(String tableName) {
+try {
+  return client.getTable(databaseName, tableName);
+} catch (TException e) {
+  throw new HoodieHiveSyncException(String.format("Database: %s, Table: %s 
 does not exist", databaseName, tableName), e);
+}
+  }

Review Comment:
   where do we use this method? other methods from this class just call 
client.getTable(). we should not introduce random helpers that lowers the code 
quality



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-4880) Corrupted parquet file found in Hudi Flink MOR pipeline

2022-09-19 Thread Teng Huo (Jira)

Teng Huo created HUDI-4880:
--

 Summary: Corrupted parquet file found in Hudi Flink MOR pipeline
 Key: HUDI-4880
 URL: https://issues.apache.org/jira/browse/HUDI-4880
 Project: Apache Hudi
  Issue Type: Bug
  Components: compaction, flink
Reporter: Teng Huo
Assignee: Teng Huo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-19 Thread GitBox



wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251779085

   @alexeykudinkin I have changed func updateMetadataValues(Schema 
recordSchema, Properties props, MetadataValues metadataValues). 
truncateRecordKey will be put in HoodieRecordCompatibilityInterface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4326) Hudi spark datasource error after migrate from 0.8 to 0.11

2022-09-19 Thread sivabalan narayanan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan closed HUDI-4326.
-
Resolution: Fixed

> Hudi spark datasource error after migrate from 0.8 to 0.11
> --
>
> Key: HUDI-4326
> URL: https://issues.apache.org/jira/browse/HUDI-4326
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: spark
>Reporter: Kyle Zhike Chen
>Assignee: Kyle Zhike Chen
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> After updated hudi to 0.11 from 0.8, using {{spark.table(fullTableName)}} to 
> read a hudi table is not working, the table has been sync to hive metastore 
> and spark is connected to the metastore. the error is
> org.sparkproject.guava.util.concurrent.UncheckedExecutionException: 
> org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
> at org.apache.spark.sql.catalyst.catalog.SessionCatalog.
> ...
> Caused by: org.apache.hudi.exception.HoodieException: 'path' or 'Key: 
> 'hoodie.datasource.read.paths' , default: null description: Comma separated 
> list of file paths to read within a Hudi table. since version: version is not 
> defined deprecated after: version is not defined)' or both must be specified.
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:78)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
> at 
> org.apache.spark.sql.execution.datasources.FindDataSourceTable.$anonfun$readDataSourceTable$1(DataSourceStrategy.scala:261)
> at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
> at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
> at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> After changing the table to the spark data source table, the table SerDeInfo 
> is missing. I created a pull request.
>  
> related GH issue:
> https://github.com/apache/hudi/issues/5861



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[hudi] branch master updated (b5dc4312a4 -> e03c0388d9)

2022-09-19 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


from b5dc4312a4 [HUDI-4877] Fix 
org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not 
work correct issue (#6717)
 add e03c0388d9 [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool 
(#5920)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/hive/HiveSyncTool.java|  3 ++
 .../org/apache/hudi/hive/HoodieHiveSyncClient.java | 44 ++
 .../org/apache/hudi/hive/TestHiveSyncTool.java |  3 ++
 .../hudi/sync/common/HoodieMetaSyncOperations.java |  6 +++
 4 files changed, 56 insertions(+)

[GitHub] [hudi] nsivabalan merged pull request #5920: [HUDI-4326] add updateTableSerDeInfo for HiveSyncTool

2022-09-19 Thread GitBox



nsivabalan merged PR #5920:
URL: https://github.com/apache/hudi/pull/5920


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



nsivabalan commented on PR #6580:
URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251769488

   @parisni : can u checkout the CI failures from last run 
https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=11435=results
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



nsivabalan commented on PR #6580:
URL: https://github.com/apache/hudi/pull/6580#issuecomment-1251769268

   have pushed out a commit addressing feedback. we should be good to land once 
CI succeeds.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] microbearz commented on pull request #6516: [HUDI-4729] Fix fq can not be queried in pending compaction when query ro table with spark

2022-09-19 Thread GitBox



microbearz commented on PR #6516:
URL: https://github.com/apache/hudi/pull/6516#issuecomment-1251768404

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a diff in pull request #6580: [HUDI-4792] Batch clean files to delete

2022-09-19 Thread GitBox



nsivabalan commented on code in PR #6580:
URL: https://github.com/apache/hudi/pull/6580#discussion_r974834548


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -290,9 +296,10 @@ private Pair> 
getFilesToCleanKeepingLatestCommits(S
* @return A {@link Pair} whose left is boolean indicating whether partition 
itself needs to be deleted,
* and right is a list of {@link CleanFileInfo} about the files in 
the partition that needs to be deleted.
*/
-  private Pair> 
getFilesToCleanKeepingLatestCommits(String partitionPath, int commitsRetained, 
HoodieCleaningPolicy policy) {
+  private Map>> 
getFilesToCleanKeepingLatestCommits(List partitionPath, int 
commitsRetained, HoodieCleaningPolicy policy) {
 LOG.info("Cleaning " + partitionPath + ", retaining latest " + 
commitsRetained + " commits. ");
 List deletePaths = new ArrayList<>();
+Map>> map = new HashMap<>();

Review Comment:
   minor. `map` -> `cleanFileInfoPerPartitionMap`



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -290,9 +296,10 @@ private Pair> 
getFilesToCleanKeepingLatestCommits(S
* @return A {@link Pair} whose left is boolean indicating whether partition 
itself needs to be deleted,
* and right is a list of {@link CleanFileInfo} about the files in 
the partition that needs to be deleted.
*/
-  private Pair> 
getFilesToCleanKeepingLatestCommits(String partitionPath, int commitsRetained, 
HoodieCleaningPolicy policy) {
+  private Map>> 
getFilesToCleanKeepingLatestCommits(List partitionPath, int 
commitsRetained, HoodieCleaningPolicy policy) {

Review Comment:
   minor. lets name the argument as plural. `partitionPath` -> `partitionPaths`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974834574


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+  this.keyField = populateMetaFields ? 
HoodieRecord.RECORD_KEY_METADATA_FIELD
+  : tableConfig.getRecordKeyFieldProp();
+  this.cdcWriter = cdcWriter;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse(
+  
config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE));
+
+  if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA;
+  } else if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE;
+  } else {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY;
+  }
+
+  this.cdcData = new ExternalSpillableMap<>(
+  maxInMemorySizeInBytes,
+  config.getSpillableMapBasePath(),
+  new DefaultSizeEstimator<>(),
+  new DefaultSizeEstimator<>(),
+  config.getCommonConfig().getSpillableDiskMapType(),
+  config.getCommonConfig().isBitCaskDiskMapCompressionEnabled()
+  );
+} catch (IOException e) {
+  throw new HoodieUpsertException("Failed to

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974829925


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+  this.keyField = populateMetaFields ? 
HoodieRecord.RECORD_KEY_METADATA_FIELD
+  : tableConfig.getRecordKeyFieldProp();
+  this.cdcWriter = cdcWriter;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse(
+  
config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE));
+
+  if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA;
+  } else if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE;
+  } else {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY;
+  }
+
+  this.cdcData = new ExternalSpillableMap<>(
+  maxInMemorySizeInBytes,
+  config.getSpillableMapBasePath(),
+  new DefaultSizeEstimator<>(),
+  new DefaultSizeEstimator<>(),
+  config.getCommonConfig().getSpillableDiskMapType(),
+  config.getCommonConfig().isBitCaskDiskMapCompressionEnabled()
+  );
+} catch (IOException e) {
+  throw new HoodieUpsertException("Failed to

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974829925


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+  this.keyField = populateMetaFields ? 
HoodieRecord.RECORD_KEY_METADATA_FIELD
+  : tableConfig.getRecordKeyFieldProp();
+  this.cdcWriter = cdcWriter;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse(
+  
config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE));
+
+  if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA;
+  } else if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE;
+  } else {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY;
+  }
+
+  this.cdcData = new ExternalSpillableMap<>(
+  maxInMemorySizeInBytes,
+  config.getSpillableMapBasePath(),
+  new DefaultSizeEstimator<>(),
+  new DefaultSizeEstimator<>(),
+  config.getCommonConfig().getSpillableDiskMapType(),
+  config.getCommonConfig().isBitCaskDiskMapCompressionEnabled()
+  );
+} catch (IOException e) {
+  throw new HoodieUpsertException("Failed to

[jira] [Updated] (HUDI-4718) Hudi cli does not support Kerberized Hadoop cluster

2022-09-19 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4718:
-
Labels: pull-request-available  (was: )

> Hudi cli does not support Kerberized Hadoop cluster
> ---
>
> Key: HUDI-4718
> URL: https://issues.apache.org/jira/browse/HUDI-4718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: cli
>Reporter: Yao Zhang
>Assignee: Yao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>
> Hudi cli connect command cannot read table from Kerberized Hadoop cluster and 
> there is no way to perform Kerberos authentication. 
> I plan to add this feature.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] paul8263 opened a new pull request, #6719: [HUDI-4718] Add Kerberos kinit command support for cli.

2022-09-19 Thread GitBox



paul8263 opened a new pull request, #6719:
URL: https://github.com/apache/hudi/pull/6719

   ### Change Logs
   
   Added kerberos kinit command for hudi cli. Now it supports connecting to 
Kerberized Hadoop cluster.
   
   As it may be complicated to prepare a temporary Kerberized environment for 
unit tests, I have tested it with local Kerberized Hadoop cluster.
   
   ### Impact
   
   It has no impact on performance.
   
   **Risk level: none**
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974827018


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java:
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.cdc;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import org.apache.hudi.exception.HoodieException;
+
+public class HoodieCDCUtils {
+
+  public static final String CDC_LOGFILE_SUFFIX = "-cdc";
+
+  /* the `op` column represents how a record is changed. */
+  public static final String CDC_OPERATION_TYPE = "op";
+
+  /* the `ts` column represents when a record is changed. */
+  public static final String CDC_COMMIT_TIMESTAMP = "ts";
+
+  /* the pre-image before one record is changed */
+  public static final String CDC_BEFORE_IMAGE = "before";
+
+  /* the post-image after one record is changed */
+  public static final String CDC_AFTER_IMAGE = "after";
+
+  /* the key of the changed record */
+  public static final String CDC_RECORD_KEY = "record_key";
+
+  public static final String[] CDC_COLUMNS = new String[] {
+  CDC_OPERATION_TYPE,
+  CDC_COMMIT_TIMESTAMP,
+  CDC_BEFORE_IMAGE,
+  CDC_AFTER_IMAGE
+  };
+
+  /**
+   * This is the standard CDC output format.
+   * Also, this is the schema of cdc log file in the case 
`hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before_after'.
+   */
+  public static final String CDC_SCHEMA_STRING = 
"{\"type\":\"record\",\"name\":\"Record\","
+  + "\"fields\":["
+  + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"ts\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"before\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"after\",\"type\":[\"string\",\"null\"]}"
+  + "]}";
+
+  public static final Schema CDC_SCHEMA = new 
Schema.Parser().parse(CDC_SCHEMA_STRING);
+
+  /**
+   * The schema of cdc log file in the case 
`hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before'.
+   */
+  public static final String CDC_SCHEMA_OP_RECORDKEY_BEFORE_STRING = 
"{\"type\":\"record\",\"name\":\"Record\","
+  + "\"fields\":["
+  + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"record_key\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"before\",\"type\":[\"string\",\"null\"]}"
+  + "]}";
+
+  public static final Schema CDC_SCHEMA_OP_RECORDKEY_BEFORE =
+  new Schema.Parser().parse(CDC_SCHEMA_OP_RECORDKEY_BEFORE_STRING);
+
+  /**
+   * The schema of cdc log file in the case 
`hoodie.table.cdc.supplemental.logging.mode` is 'op_key'.
+   */
+  public static final String CDC_SCHEMA_OP_AND_RECORDKEY_STRING = 
"{\"type\":\"record\",\"name\":\"Record\","
+  + "\"fields\":["
+  + "{\"name\":\"op\",\"type\":[\"string\",\"null\"]},"
+  + "{\"name\":\"record_key\",\"type\":[\"string\",\"null\"]}"
+  + "]}";
+
+  public static final Schema CDC_SCHEMA_OP_AND_RECORDKEY =
+  new Schema.Parser().parse(CDC_SCHEMA_OP_AND_RECORDKEY_STRING);
+
+  public static final Schema 
schemaBySupplementalLoggingMode(HoodieCDCSupplementalLoggingMode 
supplementalLoggingMode) {
+switch (supplementalLoggingMode) {
+  case WITH_BEFORE_AFTER:
+return CDC_SCHEMA;
+  case WITH_BEFORE:
+return CDC_SCHEMA_OP_RECORDKEY_BEFORE;
+  case OP_KEY:
+return CDC_SCHEMA_OP_AND_RECORDKEY;
+  default:
+throw new HoodieException("not support this supplemental logging mode: 
" + supplementalLoggingMode);
+}
+  }
+
+  /**
+   * Build the cdc record which has all the cdc fields when 
`hoodie.table.cdc.supplemental.logging.mode` is 'cdc_data_before_after'.
+   */
+  public static GenericData.Record cdcRecord(
+  String op, String commitTime, GenericRecord before, GenericRecord after) 
{
+String beforeJsonStr = recordToJson(before);

Review Comment:
   if the before/after value is store as the json string, we can load the cdc 
log file and return without any serialize/deserialize. IMO, persisting cdc data 
is an operation that consumes more storage in

[GitHub] [hudi] wzx140 commented on pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-19 Thread GitBox



wzx140 commented on PR #6132:
URL: https://github.com/apache/hudi/pull/6132#issuecomment-1251754051

   @alexeykudinkin Thank you for your suggestion. This will be fixed soon


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wzx140 commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-19 Thread GitBox



wzx140 commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1251753429

   @alexeykudinkin I'm already rebased on master and add the config 
mergerStrategy with uuid. You can do final review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974821737


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+  this.keyField = populateMetaFields ? 
HoodieRecord.RECORD_KEY_METADATA_FIELD
+  : tableConfig.getRecordKeyFieldProp();
+  this.cdcWriter = cdcWriter;
+
+  this.cdcEnabled = 
config.getBooleanOrDefault(HoodieTableConfig.CDC_ENABLED);
+  this.cdcSupplementalLoggingMode = HoodieCDCSupplementalLoggingMode.parse(
+  
config.getStringOrDefault(HoodieTableConfig.CDC_SUPPLEMENTAL_LOGGING_MODE));
+
+  if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE_AFTER))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA;
+  } else if 
(cdcSupplementalLoggingMode.equals(HoodieCDCSupplementalLoggingMode.WITH_BEFORE))
 {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_RECORDKEY_BEFORE;
+  } else {
+this.cdcSchema = HoodieCDCUtils.CDC_SCHEMA_OP_AND_RECORDKEY;
+  }
+
+  this.cdcData = new ExternalSpillableMap<>(

Review Comment:
   thank you for pointing this out. i will think deeply about it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:

[GitHub] [hudi] ankitchandnani commented on issue #6404: [SUPPORT] Hudi Deltastreamer CSV ingestion issue

2022-09-19 Thread GitBox



ankitchandnani commented on issue #6404:
URL: https://github.com/apache/hudi/issues/6404#issuecomment-1251749105

   Hi @nsivabalan, could you please share the exact properties, config, and 
test data you tried out. I am having difficulty making this work for my use 
case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] YannByron commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



YannByron commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974820909


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java:
##
@@ -0,0 +1,150 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.cdc;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import org.apache.hudi.exception.HoodieException;
+
+public class HoodieCDCUtils {
+
+  public static final String CDC_LOGFILE_SUFFIX = "-cdc";
+
+  /* the `op` column represents how a record is changed. */
+  public static final String CDC_OPERATION_TYPE = "op";
+
+  /* the `ts_ms` column represents when a record is changed. */
+  public static final String CDC_COMMIT_TIMESTAMP = "ts_ms";

Review Comment:
   see https://github.com/apache/hudi/pull/6476#discussion_r974703425
   so, we will choose `ts` or `ts_ms`. @xushiyan @alexeykudinkin 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974818835


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala:
##
@@ -160,9 +160,6 @@ class BaseFileOnlyRelation(sqlContext: SQLContext,
 fileFormat = fileFormat,
 optParams)(sparkSession)
 } else {
-  val readPathsStr = optParams.get(DataSourceReadOptions.READ_PATHS.key)

Review Comment:
   I see, thanks for clear me!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] boneanxs commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



boneanxs commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974817505


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##
@@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging {
 partitioner.repartitionRecords(trimmedDF, 
config.getBulkInsertShuffleParallelism)
   }
 
+  /**
+   * Perform bulk insert for [[Dataset]], will not change timeline/index, 
return
+   * information about write files.
+   */
+  def bulkInsert(dataset: Dataset[Row],
+ instantTime: String,
+ table: HoodieTable[_ <: HoodieRecordPayload[_ <: 
HoodieRecordPayload[_ <: AnyRef]], _, _, _],
+ writeConfig: HoodieWriteConfig,
+ partitioner: BulkInsertPartitioner[Dataset[Row]],
+ parallelism: Int,
+ shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
+val repartitionedDataset = partitioner.repartitionRecords(dataset, 
parallelism)
+val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted
+val schema = dataset.schema
+val writeStatuses = 
repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => {
+  val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
+  val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+  val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+  val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+  val writer = new BulkInsertDataInternalWriterHelper(
+table,
+writeConfig,
+instantTime,
+taskPartitionId,
+taskId,
+taskEpochId,
+schema,
+writeConfig.populateMetaFields,
+arePartitionRecordsSorted,
+shouldPreserveHoodieMetadata)
+
+  try {
+iter.foreach(writer.write)
+  } catch {
+case t: Throwable =>
+  writer.abort()
+  throw t
+  } finally {
+writer.close()
+  }
+
+  writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
+}).collect()
+table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   `writeStatuses` is an `Array`, but parallelize func needs `list`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated: [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue (#6717)

2022-09-19 Thread danny0405

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new b5dc4312a4 [HUDI-4877] Fix 
org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not 
work correct issue (#6717)
b5dc4312a4 is described below

commit b5dc4312a4038e62374c6a12f601df2889cced56
Author: FocusComputing 
AuthorDate: Tue Sep 20 09:31:50 2022 +0800

[HUDI-4877] Fix 
org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not 
work correct issue (#6717)


Co-authored-by: xiaoxingstack 
---
 .../org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java
index ea6418696c..34728c6cf3 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/index/bucket/TestHoodieSimpleBucketIndex.java
@@ -139,7 +139,9 @@ public class TestHoodieSimpleBucketIndex extends 
HoodieClientTestHarness {
 .filter(r -> 
BucketIdentifier.bucketIdFromFileId(r.getCurrentLocation().getFileId())
 != getRecordBucketId(r)).findAny().isPresent());
 assertTrue(taggedRecordRDD.collectAsList().stream().filter(r -> 
r.getPartitionPath().equals("2015/01/31")
-&& !r.isCurrentLocationKnown()).count() == 1L);
+&& !r.isCurrentLocationKnown()).count() == 1L);
+assertTrue(taggedRecordRDD.collectAsList().stream().filter(r -> 
r.getPartitionPath().equals("2016/01/31")
+&& r.isCurrentLocationKnown()).count() == 3L);
   }
 
   private HoodieWriteConfig makeConfig() {

[GitHub] [hudi] danny0405 merged pull request #6717: [HUDI-4877] Fix org.apache.hudi.index.bucket.TestHoodieSimpleBucketIndex#testTagLocation not work correct issue

2022-09-19 Thread GitBox



danny0405 merged PR #6717:
URL: https://github.com/apache/hudi/pull/6717


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5875: [SUPPORT] Hoodie Delta streamer Job with Kafka Source fetching the same offset again and again Commiting the same offset again and again

2022-09-19 Thread GitBox



nsivabalan commented on issue #5875:
URL: https://github.com/apache/hudi/issues/5875#issuecomment-1251731380

   these are the latest checkpoints as per the logs shared. 
   
   ```
   22/06/14 18:27:12 INFO DeltaSync: Checkpoint to resume from : 
Option{val=dp.packet,0:23816747280,1:23766354517,2:23605513356,3:23434358328,4:23529418069,5:13014012706,6:13111477149,7:13326541492,8:13418171123,9:12818988564,10:13199646406,11:13018719613,12:13171335493,13:13202912834,14:13139217975,15:13177133210,16:1285564,17:13190680738,18:13384909168,19:13076367894,20:12931040506,21:13179604368,22:13103929138,23:13129130931,24:12962399070,25:13209169847,26:13261971379,27:13223540895,28:12800622145,29:13135630345,30:13286314622,31:12910064270,32:13012723305,33:12942700314,34:13300903762,35:13452813697,36:12774627787,37:13149143084,38:13397339159,39:12943639180,40:12850660061,41:13287830095,42:13416968091,43:13251840311,44:12975405300,45:13129020620,46:13319529463,47:13645113762,48:13171132949,49:13341802693,50:13160916594,51:12797360849,52:13231051973,53:13159710596,54:13462835274,55:13218075514,56:13228939350,57:13026346757,58:13197365542,59:12782050600,60:13274602048,61:1301
 
9911553,62:13093034410,63:12946710535,64:12735821947,65:13521932586,66:12885611345,67:12804964853,68:13190226613,69:13119906383,70:13037133163,71:13037077649,72:13249184425,73:13034553149,74:12596466583,75:13197572654,76:13068376212,77:13394048883,78:12949166912,79:12947874565,80:12766226593,81:12887001480,82:12961256747,83:12640403833,84:13209947935,85:12990821869,86:12967972824,87:13062323012,88:12801634102,89:13377026742,90:13075492590,91:12899740426,92:13105955253,93:12811456735,94:13018855871,95:12837481047,96:13143601548,97:12797197623,98:12990305191,99:13092561101,100:13133162523,101:12559759129,102:13091848951,103:12889825622,104:12749143212,105:13041769115,106:13023952197,107:13081277534,108:13043234272,109:13020451301,110:12607811366,111:13056149918,112:13283818745,113:12922522456,114:12828248592,115:12997400759,116:12837921515,117:13035132730,118:12979892771,119:13093824502}
   22/06/14 20:08:29 INFO DeltaSync: Checkpoint to resume from : 
Option{val=dp.packet,0:23821747280,1:23771354517,2:23610513356,3:23439358328,4:23534418069,5:13019012706,6:13116477149,7:13331541492,8:13423171123,9:12823988564,10:13204646406,11:13023719613,12:13176335493,13:13207912834,14:13144217975,15:13182133210,16:1286064,17:13195680738,18:13389909168,19:13081367894,20:12936040506,21:13184604368,22:13108929138,23:13134130931,24:12967399070,25:13214169847,26:13266971379,27:13228540895,28:12805622145,29:13140630345,30:13291314622,31:12915064270,32:13017723305,33:12947700314,34:13305903762,35:13457813697,36:12779627787,37:13154143084,38:13402339159,39:12948639180,40:12855660061,41:13292830095,42:13421968091,43:13256840311,44:12980405300,45:13134020620,46:13324529463,47:13650113762,48:13176132949,49:13346802693,50:13165916594,51:12802360849,52:13236051973,53:13164710596,54:13467835274,55:13223075514,56:13233939350,57:13031346757,58:13202365542,59:12787050600,60:13279602048,61:1302
 
4911553,62:13098034410,63:12951710535,64:12740821947,65:13526932586,66:12890611345,67:12809964853,68:13195226613,69:13124906383,70:13042133163,71:13042077649,72:13254184425,73:13039553149,74:12601466583,75:13202572654,76:13073376212,77:13399048883,78:12954166912,79:12952874565,80:12771226593,81:12892001480,82:12966256747,83:12645403833,84:13214947935,85:12995821869,86:12972972824,87:13067323012,88:12806634102,89:13382026742,90:13080492590,91:12904740426,92:13110955253,93:12816456735,94:13023855871,95:12842481047,96:13148601548,97:12802197623,98:12995305191,99:13097561101,100:13138162523,101:12564759129,102:13096848951,103:12894825622,104:12754143212,105:13046769115,106:13028952197,107:13086277534,108:13048234272,109:13025451301,110:12612811366,111:13061149918,112:13288818745,113:12927522456,114:12833248592,115:13002400759,116:12842921515,117:13040132730,118:12984892771,119:13098824502}
   22/06/14 20:53:50 INFO DeltaSync: Checkpoint to resume from : 
Option{val=dp.packet,0:23826747280,1:23776354517,2:23615513356,3:23444358328,4:23539418069,5:13024012706,6:13121477149,7:13336541492,8:13428171123,9:12828988564,10:13209646406,11:13028719613,12:13181335493,13:13212912834,14:13149217975,15:13187133210,16:1286564,17:13200680738,18:13394909168,19:13086367894,20:12941040506,21:13189604368,22:13113929138,23:13139130931,24:12972399070,25:13219169847,26:13271971379,27:13233540895,28:12810622145,29:13145630345,30:13296314622,31:12920064270,32:13022723305,33:12952700314,34:13310903762,35:13462813697,36:12784627787,37:13159143084,38:13407339159,39:12953639180,40:12860660061,41:13297830095,42:13426968091,43:13261840311,44:12985405300,45:13139020620,46:13329529463,47:13655113762,48:13181132949,49:13351802693,50:13170916594,51:12807360849,52:13241051973,53:13169710596,54:13472835274,55:13228075514,56:13238939350,57:13036346757,58:13207365542,59:12792050600,60:13284602048,61:1302

[GitHub] [hudi] nsivabalan commented on issue #6101: [SUPPORT] Hudi Delete Not working with EMR, AWS Glue & S3

2022-09-19 Thread GitBox



nsivabalan commented on issue #6101:
URL: https://github.com/apache/hudi/issues/6101#issuecomment-1251727223

   hey I tried to test delete_partitions for multiple partitions and could not 
reproduce. Would you mind giving us a reproducible script. would help us find 
the root cause. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6635: [SUPPORT] Failed to build hudi 0.12.0 with spark 3.2.2

2022-09-19 Thread GitBox



alexeykudinkin commented on issue #6635:
URL: https://github.com/apache/hudi/issues/6635#issuecomment-1251722860

   @xushiyan you can assign this one to me


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6349: [HUDI-4433] Hudi-CLI repair deduplicate not working with non-partitio…

2022-09-19 Thread GitBox



hudi-bot commented on PR #6349:
URL: https://github.com/apache/hudi/pull/6349#issuecomment-1251684276

   
   ## CI report:
   
   * f6fde8b6313b3b0e250c41858c77cc425325a6db Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11508)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

2022-09-19 Thread GitBox



nsivabalan commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1251682905

   @jiangjiguang : can you respond to above requests. we could not proceed if 
not for further logs and info requested. whenever you get a chance, can you 
please respond. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6404: [SUPPORT] Hudi Deltastreamer CSV ingestion issue

2022-09-19 Thread GitBox



nsivabalan commented on issue #6404:
URL: https://github.com/apache/hudi/issues/6404#issuecomment-1251682385

   sure, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #6716: [SUPPORT] Unable to archive if no non-table service actions are performed on the data table

2022-09-19 Thread GitBox



nsivabalan commented on issue #6716:
URL: https://github.com/apache/hudi/issues/6716#issuecomment-1251679200

   @yihua : Can you take this up please. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



hudi-bot commented on PR #6697:
URL: https://github.com/apache/hudi/pull/6697#issuecomment-1251678959

   
   ## CI report:
   
   * da97ef7f2f1d034d15851c3676b27f06294dc557 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11507)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] hudi-bot commented on pull request #6665: [HUDI-4850] Incremental Ingestion from GCS

2022-09-19 Thread GitBox



hudi-bot commented on PR #6665:
URL: https://github.com/apache/hudi/pull/6665#issuecomment-1251678908

   
   ## CI report:
   
   * a068aefd47b77ceb65c0f7ca3857e438af2d2d2b Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11509)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6697: [HUDI-3478] Spark CDC Write

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6697:
URL: https://github.com/apache/hudi/pull/6697#discussion_r974740166


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java:
##
@@ -358,11 +357,10 @@ private Pair> 
getFilesToCleanKeepingLatestCommits(S
 deletePaths.add(new 
CleanFileInfo(hoodieDataFile.getBootstrapBaseFile().get().getPath(), true));
   }
 });
-if (hoodieTable.getMetaClient().getTableType() == 
HoodieTableType.MERGE_ON_READ) {
-  // If merge on read, then clean the log files for the commits as 
well
-  deletePaths.addAll(aSlice.getLogFiles().map(lf -> new 
CleanFileInfo(lf.getPath().toString(), false))
-  .collect(Collectors.toList()));
-}
+// since the cow tables may also write out the log files in cdc 
scenario, we need to clean the log files
+// for this commit no matter the table type is mor or cow.
+deletePaths.addAll(aSlice.getLogFiles().map(lf -> new 
CleanFileInfo(lf.getPath().toString(), false))

Review Comment:
   We should scope this: guard this by checking that CDC is enabled and only 
cleaning up CDC files (assuming we will have separate naming scheme for these).
   
   Overly broad conditionals like this one (cleaning all log-files) is a 
time-bomb.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieAvroPayload;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.cdc.HoodieCDCOperation;
+import org.apache.hudi.common.table.cdc.HoodieCDCSupplementalLoggingMode;
+import org.apache.hudi.common.table.cdc.HoodieCDCUtils;
+import org.apache.hudi.common.table.log.AppendResult;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.block.HoodieCDCDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieUpsertException;
+
+import java.io.Closeable;
+import java.io.IOException;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This class encapsulates all the cdc-writing functions.
+ */
+public class HoodieCDCLogger implements Closeable {
+
+  private final String commitTime;
+
+  private final String keyField;
+
+  private final Schema dataSchema;
+
+  private final boolean populateMetaFields;
+
+  // writer for cdc data
+  private final HoodieLogFormat.Writer cdcWriter;
+
+  private final boolean cdcEnabled;
+
+  private final HoodieCDCSupplementalLoggingMode cdcSupplementalLoggingMode;
+
+  private final Schema cdcSchema;
+
+  // the cdc data
+  private final Map cdcData;
+
+  public HoodieCDCLogger(
+  String commitTime,
+  HoodieWriteConfig config,
+  HoodieTableConfig tableConfig,
+  Schema schema,
+  HoodieLogFormat.Writer cdcWriter,
+  long maxInMemorySizeInBytes) {
+try {
+  this.commitTime = commitTime;
+  this.dataSchema = HoodieAvroUtils.removeMetadataFields(schema);
+  this.populateMetaFields = config.populateMetaFields();
+

[GitHub] [hudi] hudi-bot commented on pull request #6349: [HUDI-4433] Hudi-CLI repair deduplicate not working with non-partitio…

2022-09-19 Thread GitBox



hudi-bot commented on PR #6349:
URL: https://github.com/apache/hudi/pull/6349#issuecomment-1251630547

   
   ## CI report:
   
   * ed04e960bbc93c1d4d47c54f15f47191fef06fa3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11506)
 
   * f6fde8b6313b3b0e250c41858c77cc425325a6db Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11508)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6561: [HUDI-4760] Fixing repeated trigger of data file creations w/ clustering

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6561:
URL: https://github.com/apache/hudi/pull/6561#discussion_r974727375


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java:
##
@@ -249,7 +249,7 @@ protected HoodieWriteMetadata> 
executeClustering(HoodieC
 HoodieData statuses = updateIndex(writeStatusList, 
writeMetadata);
 
writeMetadata.setWriteStats(statuses.map(WriteStatus::getStat).collectAsList());
 
writeMetadata.setPartitionToReplaceFileIds(getPartitionToReplacedFileIds(clusteringPlan,
 writeMetadata));
-validateWriteResult(clusteringPlan, writeMetadata);
+// if we don't cache the write statuses above, validation will call 
isEmpty which might retrigger the execution again.

Review Comment:
   @nsivabalan was this addressed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r974723729


##
rfc/rfc-46/rfc-46.md:
##
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
1. Accepting `HoodieRecord`
2. Will be engine-specific (so that they're able to handle internal record 
representation)
 3. `HoodieRealtimeRecordReader`s 
1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the 
MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your 
engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement 
functionality similar to AvroUtils in HoodieRecord for different data(avro, 
InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+* Get column in record to support RDDCustomColumnsSortPartitioner
+*/
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,
+   boolean consistentLogicalTimestampEnabled);
+
+   /**
+* Support bootstrap.
+*/
+   HoodieRecord mergeWith(HoodieRecord other, Schema targetSchema) throws 
IOException;
+
+   /**
+* Rewrite record into new schema(add meta columns)
+*/
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema 
targetSchema)
+   throws IOException;
+
+   /**
+* Support schema evolution.
+*/
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties 
props, Schema newSchema,
+   Map renameCols) throws IOException;
+
+   HoodieRecord updateValues(Schema recordSchema, Properties props,

Review Comment:
   @wzx140 we should split these up: 
   
- Only legitimate use-case for us to update fields is Hudi's metadata
- `HoodieHFileDataBlock` shouldn't be modifying existing payload but should 
instead be _rewriting_ w/o the field it wants to omit. We will tackle that 
separately, and for the sake of RFC-46 we can create temporary method 
`truncateRecordKey` which will be overwriting record-key value for now (we will 
deprecate and remove this method after we address this)
   
   We should not leave a loophole where we allow a record to be modified to 
make sure that nobody can start building against this API



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r938178095


##
rfc/rfc-46/rfc-46.md:
##
@@ -156,13 +187,76 @@ Following major components will be refactored:
 3. `HoodieRealtimeRecordReader`s 
1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the 
MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your 
engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to transfer some func 
in AvroUtils into HoodieRecord for different data(avro, InternalRow, RowData).
+Its public API will look like following:
+
+```java
+class HoodieRecord {
+
+   /**
+* Get column in record to support RDDCustomColumnsSortPartitioner
+*/
+   Object getRecordColumnValues(Schema recordSchema, String[] columns, boolean 
consistentLogicalTimestampEnabled);
+
+   /**
+* Support bootstrap.
+*/
+   HoodieRecord mergeWith(HoodieRecord other) throws IOException;
+
+   /**
+* Rewrite record into new schema(add meta columns)
+*/
+   HoodieRecord rewriteRecord(Schema recordSchema, Properties props, Schema 
targetSchema) throws IOException;
+
+   /**
+* Support schema evolution.
+*/
+   HoodieRecord rewriteRecordWithNewSchema(Schema recordSchema, Properties 
props, Schema newSchema, Map renameCols) throws IOException;
+
+   HoodieRecord addMetadataValues(Schema recordSchema, Properties props, 
Map metadataValues) throws IOException;
+
+   /**
+* Is deleted.
+*/
+   boolean isPresent(Schema recordSchema, Properties props) throws IOException;
+
+   /**
+* Is EmptyRecord. Generated by ExpressionPayload.
+*/
+   boolean shouldIgnore(Schema recordSchema, Properties props) throws 
IOException;

Review Comment:
   We should probably update the java-doc then to avoid ref to any particular 
implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6132: [RFC-46][HUDI-4414] Update the RFC-46 doc to fix comments feedback

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6132:
URL: https://github.com/apache/hudi/pull/6132#discussion_r974718156


##
rfc/rfc-46/rfc-46.md:
##
@@ -128,21 +173,88 @@ Following major components will be refactored:
 
 1. `HoodieWriteHandle`s will be  
1. Accepting `HoodieRecord` instead of raw Avro payload (avoiding Avro 
conversion)
-   2. Using Combining API engine to merge records (when necessary) 
+   2. Using Record Merge API to merge records (when necessary) 
3. Passes `HoodieRecord` as is to `FileWriter`
 2. `HoodieFileWriter`s will be 
1. Accepting `HoodieRecord`
2. Will be engine-specific (so that they're able to handle internal record 
representation)
 3. `HoodieRealtimeRecordReader`s 
1. API will be returning opaque `HoodieRecord` instead of raw Avro payload
 
+### Config for Record Merge
+The MERGE_CLASS_NAME config is engine-aware. If you are not specified the 
MERGE_CLASS_NAME, MERGE_CLASS_NAME will be specified default according to your 
engine type.
+
+### Public Api in HoodieRecord
+Because we implement different types of records, we need to implement 
functionality similar to AvroUtils in HoodieRecord for different data(avro, 
InternalRow, RowData).
+Its public API will look like following:
+
+```java
+import java.util.Properties;
+
+class HoodieRecord {
+
+   /**
+* Get column in record to support RDDCustomColumnsSortPartitioner
+*/
+   Object getRecordColumnValues(Schema recordSchema, String[] columns,

Review Comment:
   @wzx140 understand where you're coming from.
   
   We should have already deprecated `getRecordColumnValues` as this method is 
heavily coupled to where it's used currently and unfortunately isn't generic 
enough to serve its purpose. In this particular case converting the values and 
concat-ing them as strings doesn't make sense for a generic utility -- whenever 
someone requests a list of column values they expect to get a list of values 
(as they are) as compared to receiving a string (!) of concatenated values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-4836) Remove "hbase-default.xml" colliding w/ "hbase-site.xml" in Hudi bundles

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-4836.
-
Resolution: Not A Problem

User was able to resolve their original issue by overriding such setting in 
Cloudera Manager:
[https://github.com/apache/hudi/issues/6398#issuecomment-1250633915]

This seems to me that Cloudera isn't actually affecting "hbase-default.xml", 
but rather is writing out its own "hbase-site.xml" that overrides one we're 
providing in our bundles.

> Remove "hbase-default.xml" colliding w/ "hbase-site.xml" in Hudi bundles
> 
>
> Key: HUDI-4836
> URL: https://issues.apache.org/jira/browse/HUDI-4836
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
> Fix For: 0.12.1
>
>
> This has been discovered in the process of testing 0.12:
> [https://github.com/apache/hudi/issues/6398#issuecomment-1244930086]
>  
> Original issues we meant to be addressed by 
> [https://github.com/apache/hudi/pull/6114,] but it ultimately led to 
> "hbase-default.xml" (from "hbase-common") been bundled along w/ our 
> "hbase-site.xml" purposed to override the \{hbase.defaults.for.version.skip} 
> configuration.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

2022-09-19 Thread GitBox



kazdy commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r974709033


##
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty RECONCILE_SCHEMA = ConfigProperty
   .key("hoodie.datasource.write.reconcile.schema")
-  .defaultValue(false)
+  .defaultValue(true)

Review Comment:
   @alexeykudinkin afaik Schema Evolution config is there because it's an 
experimental feature and soon it will become GA? Then this config should be 
enabled by default or deprecated, will this logic hold then? I feel like hudi 
config is already very broad and therefore a bit hard to grasp and users would 
appreciate if it was one switch instead of a combination of two



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on issue #6398: [SUPPORT] Metadata table thows hbase exceptions

2022-09-19 Thread GitBox



alexeykudinkin commented on issue #6398:
URL: https://github.com/apache/hudi/issues/6398#issuecomment-1251581367

   @rbtrtr glad that you've sorted out! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6476: [HUDI-3478] Support CDC for Spark in Hudi

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6476:
URL: https://github.com/apache/hudi/pull/6476#discussion_r974703425


##
hudi-common/src/main/java/org/apache/hudi/common/table/cdc/HoodieCDCUtils.java:
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.cdc;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.exception.HoodieException;
+
+public class HoodieCDCUtils {
+
+  /* the `op` column represents how a record is changed. */
+  public static final String CDC_OPERATION_TYPE = "op";
+
+  /* the `ts_ms` column represents when a record is changed. */
+  public static final String CDC_COMMIT_TIMESTAMP = "ts_ms";

Review Comment:
   Yeah, i've realized this later that it's also `ts_ms` in Debezium. Let's 
keep it as is to keep it consistent then. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4875) NoSuchTableException is thrown while dropping temporary view after applied HoodieSparkSessionExtension in Spark 3.2

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4875:
--
Fix Version/s: 0.12.1

> NoSuchTableException is thrown while dropping temporary view after applied 
> HoodieSparkSessionExtension in Spark 3.2
> ---
>
> Key: HUDI-4875
> URL: https://issues.apache.org/jira/browse/HUDI-4875
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.11.1
> Environment: Spark 3.2.2
>Reporter: dohongdayi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.1
>
>
> NoSuchTableException is thrown while dropping temporary view after applied 
> HoodieSparkSessionExtension in Spark 3.2:
> {code:java}
> org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 
> 'test_view' not found in database 'default'
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:225)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableRawMetadata(SessionCatalog.scala:516)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:502)
>   at 
> org.apache.spark.sql.hudi.SparkAdapter.isHoodieTable(SparkAdapter.scala:160)
>   at 
> org.apache.spark.sql.hudi.SparkAdapter.isHoodieTable$(SparkAdapter.scala:159)
>   at 
> org.apache.spark.sql.adapter.BaseSpark3Adapter.isHoodieTable(BaseSpark3Adapter.scala:45)
>   at 
> org.apache.spark.sql.hudi.analysis.HoodiePostAnalysisRule.apply(HoodieAnalysis.scala:539)
>   at 
> org.apache.spark.sql.hudi.analysis.HoodiePostAnalysisRule.apply(HoodieAnalysis.scala:530)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:211)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
>   at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
>   at scala.collection.immutable.List.foldLeft(List.scala:91)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
>   at scala.collection.immutable.List.foreach(List.scala:431)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:222)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:218)
>   at 
> org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:167)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:218)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:182)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
>   at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:203)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:330)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:202)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:75)
>   at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
>   at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:183)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:788)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:183)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:75)
>   at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:73)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:629)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:788)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:620)
>   ... 51 elided {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6046: [HUDI-4363] Support Clustering row writer to improve performance

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6046:
URL: https://github.com/apache/hudi/pull/6046#discussion_r974679847


##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##
@@ -105,6 +108,52 @@ object HoodieDatasetBulkInsertHelper extends Logging {
 partitioner.repartitionRecords(trimmedDF, 
config.getBulkInsertShuffleParallelism)
   }
 
+  /**
+   * Perform bulk insert for [[Dataset]], will not change timeline/index, 
return
+   * information about write files.
+   */
+  def bulkInsert(dataset: Dataset[Row],
+ instantTime: String,
+ table: HoodieTable[_ <: HoodieRecordPayload[_ <: 
HoodieRecordPayload[_ <: AnyRef]], _, _, _],
+ writeConfig: HoodieWriteConfig,
+ partitioner: BulkInsertPartitioner[Dataset[Row]],
+ parallelism: Int,
+ shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
+val repartitionedDataset = partitioner.repartitionRecords(dataset, 
parallelism)
+val arePartitionRecordsSorted = partitioner.arePartitionRecordsSorted
+val schema = dataset.schema
+val writeStatuses = 
repartitionedDataset.queryExecution.toRdd.mapPartitions(iter => {
+  val taskContextSupplier: TaskContextSupplier = 
table.getTaskContextSupplier
+  val taskPartitionId = taskContextSupplier.getPartitionIdSupplier.get
+  val taskId = taskContextSupplier.getStageIdSupplier.get.toLong
+  val taskEpochId = taskContextSupplier.getAttemptIdSupplier.get
+  val writer = new BulkInsertDataInternalWriterHelper(
+table,
+writeConfig,
+instantTime,
+taskPartitionId,
+taskId,
+taskEpochId,
+schema,
+writeConfig.populateMetaFields,
+arePartitionRecordsSorted,
+shouldPreserveHoodieMetadata)
+
+  try {
+iter.foreach(writer.write)
+  } catch {
+case t: Throwable =>
+  writer.abort()
+  throw t
+  } finally {
+writer.close()
+  }
+
+  writer.getWriteStatuses.asScala.map(_.toWriteStatus).iterator
+}).collect()
+table.getContext.parallelize(writeStatuses.toList.asJava)

Review Comment:
   nit: no need for `toList`



##
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/SparkAdapterSupport.scala:
##
@@ -26,17 +26,6 @@ import org.apache.spark.sql.hudi.SparkAdapter
  */
 trait SparkAdapterSupport {
 
-  lazy val sparkAdapter: SparkAdapter = {

Review Comment:
   Instead of moving this to Java let's dot he following:
   
   - Create companion object `ScalaAdapterSupport` 
   - Move this conditional there
   - Keep this var (for compatibility) referencing static one from the object



##
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java:
##
@@ -183,8 +183,17 @@ public boolean accept(Path path) {
 metaClientCache.put(baseDir.toString(), metaClient);
   }
 
-  fsView = 
FileSystemViewManager.createInMemoryFileSystemView(engineContext,
-  metaClient, 
HoodieInputFormatUtils.buildMetadataConfig(getConf()));
+  if (getConf().get("as.of.instant") != null) {

Review Comment:
   Good catch!



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDSpatialCurveSortPartitioner.java:
##
@@ -91,27 +81,4 @@ public JavaRDD> 
repartitionRecords(JavaRDD> reco
   return hoodieRecord;
 });
   }
-
-  private Dataset reorder(Dataset dataset, int numOutputGroups) {

Review Comment:
   Thanks for cleaning that up!



##
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieSparkMergeOnReadTableClustering.java:
##
@@ -60,20 +61,30 @@ class TestHoodieSparkMergeOnReadTableClustering extends 
SparkClientFunctionalTes
 
   private static Stream testClustering() {
 return Stream.of(
-Arguments.of(true, true, true),
-Arguments.of(true, true, false),
-Arguments.of(true, false, true),
-Arguments.of(true, false, false),
-Arguments.of(false, true, true),
-Arguments.of(false, true, false),
-Arguments.of(false, false, true),
-Arguments.of(false, false, false)
-);
+Arrays.asList(true, true, true),
+Arrays.asList(true, true, false),
+Arrays.asList(true, false, true),
+Arrays.asList(true, false, false),
+Arrays.asList(false, true, true),
+Arrays.asList(false, true, false),
+Arrays.asList(false, false, true),
+Arrays.asList(false, false, false))
+.flatMap(arguments -> {
+  ArrayList enableRowClusteringArgs = new ArrayList<>();
+  enableRowClusteringArgs.add(true);
+  enableRowClusteringArgs.addAll(arguments);
+  ArrayList disableRowClusteringArgs = new ArrayList<>();
+  disableRowClusteringArgs.add(false);

Review

[GitHub] [hudi] hudi-bot commented on pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

2022-09-19 Thread GitBox



hudi-bot commented on PR #5629:
URL: https://github.com/apache/hudi/pull/5629#issuecomment-1251566082

   
   ## CI report:
   
   * d0f078159313f8b35a41b1d1e016583204811383 UNKNOWN
   * 8bd34a6bee3084bdc6029f3c0740cf06906acfd5 UNKNOWN
   * ef85e9b3eefd14a098a1bc5b277fd5989ef8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11504)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6256: [RFC-51][HUDI-3478] Update RFC: CDC support

2022-09-19 Thread GitBox



alexeykudinkin commented on code in PR #6256:
URL: https://github.com/apache/hudi/pull/6256#discussion_r971371531


##
rfc/rfc-51/rfc-51.md:
##
@@ -62,73 +63,79 @@ We follow the debezium output format: four columns as shown 
below
 - u: represent `update`; when `op` is `u`, both `before` and `after` don't be 
null;
 - d: represent `delete`; when `op` is `d`, `after` is always null;
 
-Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
+**Note**
 
-## Goals
+* In case of the same record having operations like insert -> delete -> 
insert, CDC data should be produced to reflect the exact behaviors.
+* The illustration above ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-1. Support row-level CDC records generation and persistence;
-2. Support both MOR and COW tables;
-3. Support all the write operations;
-4. Support Spark DataFrame/SQL/Streaming Query;
+## Design Goals
 
-## Implementation
+1. Support row-level CDC records generation and persistence
+2. Support both MOR and COW tables
+3. Support all the write operations
+4. Support incremental queries in CDC format across supported engines

Review Comment:
   Let's also explicitly call out that:
   
   - For CDC-enabled Tables performance of non-CDC queries should not be 
affected



##
rfc/rfc-51/rfc-51.md:
##
@@ -64,69 +65,72 @@ We follow the debezium output format: four columns as shown 
below
 
 Note: the illustration here ignores all the Hudi metadata columns like 
`_hoodie_commit_time` in `before` and `after` columns.
 
-## Goals
+## Design Goals
 
 1. Support row-level CDC records generation and persistence;
 2. Support both MOR and COW tables;
 3. Support all the write operations;
 4. Support Spark DataFrame/SQL/Streaming Query;
 
-## Implementation
+## Configurations
 
-### CDC Architecture
+| key | default  | description 

 |
+|-|--|--|
+| hoodie.table.cdc.enabled| `false`  | The master 
switch of the CDC features. If `true`, writers and readers will respect CDC 
configurations and behave accordingly.|
+| hoodie.table.cdc.supplemental.logging   | `false`  | If `true`, 
persist the required information about the changed data, including `before`. If 
`false`, only `op` and record keys will be persisted. |
+| hoodie.table.cdc.supplemental.logging.include_after | `false`  | If `true`, 
persist `after` as well.
  |
 
-![](arch.jpg)
+To perform CDC queries, users need to set `hoodie.table.cdc.enable=true` and 
`hoodie.datasource.query.type=incremental`.
 
-Note: Table operations like `Compact`, `Clean`, `Index` do not write/change 
any data. So we don't need to consider them in CDC scenario.
- 
-### Modifiying code paths
+| key| default| description
  |
+|||--|
+| hoodie.table.cdc.enabled   | `false`| set to `true` for CDC 
queries|
+| hoodie.datasource.query.type   | `snapshot` | set to `incremental` 
for CDC queries |
+| hoodie.datasource.read.start.timestamp | -  | requried.  
  |
+| hoodie.datasource.read.end.timestamp   | -  | optional.  
  |
 
-![](points.jpg)
+### Logical File Types
 
-### Config Definitions
+We define 4 logical file types for the CDC scenario.

Review Comment:
   Agree w/ @xushiyan proposal, let's simplify this -- having a table mapping 
an action on the Data table into the action on CDC log makes message much more 
clear.
   



##
rfc/rfc-51/rfc-51.md:
##
@@ -148,20 +155,46 @@ hudi_cdc_table/
 
 Under a partition directory, the `.log` file with `CDCBlock` above will keep 
the changing data we have to materialize.
 
-There is an option to control what data is written to `CDCBlock`, that is 
`hoodie.table.cdc.supplemental.logging`. See the description of this config 
above.
+ Persisting CDC in MOR: Write-on-indexing vs Write-on-compaction
+
+2 design choices on when to persist CDC in MOR tables:
+
+Write-on-indexing allows CDC info to be persisted at the earliest, however, in 
case of Flink writer or Bucket
+indexing, `op` (I/U/D) data is not available at indexing.
+
+Write-on-compaction can always persist CDC info and achieve standardization of 
implementation

[GitHub] [hudi] alexeykudinkin commented on issue #6354: [SUPPORT] Sparksql cow non-partition table execute 'merge into ' sqlstatment occure error after setting the tblproperties param "hoodie.dat

2022-09-19 Thread GitBox



alexeykudinkin commented on issue #6354:
URL: https://github.com/apache/hudi/issues/6354#issuecomment-1251523902

   Created https://issues.apache.org/jira/browse/HUDI-4879


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4879:
--
Description: 
As reported by the user:

[https://github.com/apache/hudi/issues/6354]

 

Currently, setting \{{hoodie.datasource.write.payload.class = 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in the 
following exception:
{code:java}
org.apache.hudi.exception.HoodieUpsertExceptio
n: Error upserting bucketType UPDATE for partition :0 at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg
e new record with old value in storage, for new record 
{HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
{{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
"20220810095824514_0_0", "_hoodie_record_key": "id:1", 
"_hoodie_partition_path": "", "_hoodie_file_name": 
"60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", 
"id": 1, "name": "a0", "ts": 1000}} at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new 
record with old value in storage, for
new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
{{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
"20220810095824514_0_0", "_hoodie_record_key": "id:1", 
"_hoodie_partition_path": "", "_hoodie_file_name":

[jira] [Updated] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4879:
--
Description: 
As reported by the user:

[https://github.com/apache/hudi/issues/6354]

 

Currently, setting {{hoodie.datasource.write.payload.class = 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload' }}will result in the 
following exception:
{code:java}
org.apache.hudi.exception.HoodieUpsertExceptio
n: Error upserting bucketType UPDATE for partition :0 at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:329)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.lambda$mapPartitionsAsRDD$a3ab3c4$1(BaseSparkCommitActionExecutor.java:244)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1(JavaRDDLike.scala:102)
at 
org.apache.spark.api.java.JavaRDDLike.$anonfun$mapPartitionsWithIndex$1$adapted(JavaRDDLike.scala:102)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:915)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:915)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at 
org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1498)
at 
org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1408)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1472)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1295)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hudi.exception.HoodieException: 
org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merg
e new record with old value in storage, for new record 
{HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
{{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
"20220810095824514_0_0", "_hoodie_record_key": "id:1", 
"_hoodie_partition_path": "", "_hoodie_file_name": 
"60c04f95-ca5e-4f82-9558-40da29cc022e-0_0-937-24808_20220810095846644.parquet", 
"id": 1, "name": "a0", "ts": 1000}} at 
org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdateInternal(BaseSparkCommitActionExecutor.java:358)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpdate(BaseSparkCommitActionExecutor.java:349)
at 
org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.handleUpsertPartition(BaseSparkCommitActionExecutor.java:322)
... 28 more
Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: 
org.apache.hudi.exception.HoodieUpsertException: Failed to combine/merge new 
record with old value in storage, for
new record {HoodieRecord{key=HoodieKey { recordKey=id:1 partitionPath=}, 
currentLocation='HoodieRecordLocation {instantTime=20220810095846644, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}', 
newLocation='HoodieRecordLocation {instantTime=20220810101719437, 
fileId=60c04f95-ca5e-4f82-9558-40da29cc022e-0}'}}, old value 
{{"_hoodie_commit_time": "20220810095824514", "_hoodie_commit_seqno": 
"20220810095824514_0_0", "_hoodie_record_key": "id:1", 
"_hoodie_partition_path": "", "_hoodie_file_name":

[jira] [Created] (HUDI-4879) MERGE INTO fails when setting "hoodie.datasource.write.payload.class"

2022-09-19 Thread Alexey Kudinkin (Jira)

Alexey Kudinkin created HUDI-4879:
-

 Summary: MERGE INTO fails when setting 
"hoodie.datasource.write.payload.class"
 Key: HUDI-4879
 URL: https://issues.apache.org/jira/browse/HUDI-4879
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.12.1


As reported by the user:

[https://github.com/apache/hudi/issues/6354]

 

Currently, setting {{hoodie.datasource.write.payload.class = 
'org.apache.hudi.common.model.DefaultHoodieRecordPayload'}}

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-47) Revisit null checks in the Log Blocks, merge lazyreading with this null check #340

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-47.
---
Resolution: Fixed

> Revisit null checks in the Log Blocks, merge lazyreading with this null check 
> #340
> --
>
> Key: HUDI-47
> URL: https://issues.apache.org/jira/browse/HUDI-47
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: code-quality, storage-management
>Reporter: Vinoth Chandar
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: help-requested, newbie, starter
> Fix For: 0.11.0
>
>
> https://github.com/uber/hudi/issues/340



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-2788) Z-ordering Layout Optimization Strategy fails w/ Data Skipping enabled

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-2788.
-
Resolution: Fixed

> Z-ordering Layout Optimization Strategy fails w/ Data Skipping enabled
> --
>
> Key: HUDI-2788
> URL: https://issues.apache.org/jira/browse/HUDI-2788
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> During testing of Z-ordering in test environment i've discovered following 
> issues:
>  # Queries failing for tables w/ enabled Clustering w/ Z-ordering Layout and 
> data-skipping enabled, being unable to read `_SUCCESS` file (automatically 
> created by Spark)
>  # Some of the translations of the original query predicates into Z-index 
> table predicates are translated incorrectly (`!=`, `not like`, etc)
>  # Join merging indexes across commits incorrectly always checks for null 
> first column (instead of Nth column) when picking the result of the merge



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (HUDI-2814) Address issues w/ Z-order Layout Optimization

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-2814.
-
Resolution: Fixed

> Address issues w/ Z-order Layout Optimization
> -
>
> Key: HUDI-2814
> URL: https://issues.apache.org/jira/browse/HUDI-2814
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> During extensive testing following issues have been discovered, which we're 
> planning to addres in the upcoming PR:
>  * Data-skipping seq incorrectly handles cases when columns that are not 
> Z-sorted are present in the query (it simply ignores this fact, while it 
> should abandon pruning altogether[1])
>  * Exception w/in file-pruning seq should not be affecting overall query (it 
> should in the worst case fallback to full-scan)
>  * Merging seq prefers records from the old Z-index table, while should 
> prefer those from the new one.
>  * After clustering columns change, Z-index should simply overwrite index 
> (currently it actually does the opposite – it skips updating the index in 
> case old and new tables diverge in schemas)
>  * Incorrect type conversions (for ex, Decimal is converted to Double)
> Additionally we're planning to beef up current Z-index implementation 
> test-suite making sure that all critical flows of the Z-indexing have 
> appropriate coverage.
> [1] Actually, with more advanced analysis we could still prune the search 
> space, but this requires substantial sophistication of the analysis 
> conducted, which is beyond our current focus



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (HUDI-2814) Address issues w/ Z-order Layout Optimization

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reopened HUDI-2814:
---

> Address issues w/ Z-order Layout Optimization
> -
>
> Key: HUDI-2814
> URL: https://issues.apache.org/jira/browse/HUDI-2814
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> During extensive testing following issues have been discovered, which we're 
> planning to addres in the upcoming PR:
>  * Data-skipping seq incorrectly handles cases when columns that are not 
> Z-sorted are present in the query (it simply ignores this fact, while it 
> should abandon pruning altogether[1])
>  * Exception w/in file-pruning seq should not be affecting overall query (it 
> should in the worst case fallback to full-scan)
>  * Merging seq prefers records from the old Z-index table, while should 
> prefer those from the new one.
>  * After clustering columns change, Z-index should simply overwrite index 
> (currently it actually does the opposite – it skips updating the index in 
> case old and new tables diverge in schemas)
>  * Incorrect type conversions (for ex, Decimal is converted to Double)
> Additionally we're planning to beef up current Z-index implementation 
> test-suite making sure that all critical flows of the Z-indexing have 
> appropriate coverage.
> [1] Actually, with more advanced analysis we could still prune the search 
> space, but this requires substantial sophistication of the analysis 
> conducted, which is beyond our current focus



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Reopened] (HUDI-2788) Z-ordering Layout Optimization Strategy fails w/ Data Skipping enabled

2022-09-19 Thread Alexey Kudinkin (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-2788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reopened HUDI-2788:
---

> Z-ordering Layout Optimization Strategy fails w/ Data Skipping enabled
> --
>
> Key: HUDI-2788
> URL: https://issues.apache.org/jira/browse/HUDI-2788
> Project: Apache Hudi
>  Issue Type: Task
>  Components: index
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> During testing of Z-ordering in test environment i've discovered following 
> issues:
>  # Queries failing for tables w/ enabled Clustering w/ Z-ordering Layout and 
> data-skipping enabled, being unable to read `_SUCCESS` file (automatically 
> created by Spark)
>  # Some of the translations of the original query predicates into Z-index 
> table predicates are translated incorrectly (`!=`, `not like`, etc)
>  # Join merging indexes across commits incorrectly always checks for null 
> first column (instead of Nth column) when picking the result of the merge



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

1 2 3 4 5 >

1 - 100 of 485 matches

Mail list logo