[GitHub] [hudi] Trevor-zhang commented on a change in pull request #2449: [HUDI-1528] hudi-sync-tools supports synchronization to remote hive

2021-01-20 Thread GitBox


Trevor-zhang commented on a change in pull request #2449:
URL: https://github.com/apache/hudi/pull/2449#discussion_r561662407



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   (1) When synchronizing hudi data with local hive,` 
hiveSyncConfig.hiveMetaStoreUri` can be set to null.
   (2) There is no priority distinction between `hiveSyncConfig` and `props`. 
Because there is no props attribute in `hiveSyncConfig`, only a single 
attribute is set.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vrtrepp commented on issue #2461: All records are present in athena query result on glue crawled Hudi tables

2021-01-20 Thread GitBox


vrtrepp commented on issue #2461:
URL: https://github.com/apache/hudi/issues/2461#issuecomment-76438


   Hi Rubenssoto,
   That is how we are planning but it will involve writing few more steps in 
the pipeline.However our current architecture is based on running glue crawlers 
and removing Glue crawlers will come with making changes in many pipelines 
again a month's task atleast.
   
   What I was curious about will there be any support that Hudi is going to add 
in future ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] 01/02: [MINOR] Disable flaky tests

2021-01-20 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch release-0.7.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit bead4331ba17c67b31425722f9bc22d3c303979d
Author: Vinoth Chandar 
AuthorDate: Wed Jan 20 18:58:27 2021 -0800

[MINOR] Disable flaky tests
---
 .../java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java | 6 +-
 .../apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java | 2 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
index 027b2b8..8383255 100644
--- 
a/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
+++ 
b/hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java
@@ -407,10 +407,14 @@ public class TestHoodieBackedMetadata extends 
HoodieClientTestHarness {
   /**
* Test sync of table operations.
*/
+  /*
   @ParameterizedTest
   @EnumSource(HoodieTableType.class)
   public void testSync(HoodieTableType tableType) throws Exception {
-init(tableType);
+  */
+  @Test
+  public void testSync() throws Exception {
+init(HoodieTableType.COPY_ON_WRITE);
 HoodieSparkEngineContext engineContext = new HoodieSparkEngineContext(jsc);
 
 String newCommitTime;
diff --git 
a/hudi-flink/src/test/java/org/apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java
 
b/hudi-flink/src/test/java/org/apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java
index af5b755..98066e9 100644
--- 
a/hudi-flink/src/test/java/org/apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java
+++ 
b/hudi-flink/src/test/java/org/apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java
@@ -32,6 +32,7 @@ import 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
 import org.junit.jupiter.api.AfterEach;
 import org.junit.jupiter.api.Assertions;
 import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Disabled;
 import org.junit.jupiter.api.Test;
 
 import java.util.List;
@@ -57,6 +58,7 @@ public class TestJsonStringToHoodieRecordMapFunction extends 
HoodieFlinkClientTe
   }
 
   @Test
+  @Disabled
   public void testMapFunction() throws Exception {
 final String newCommitTime = "001";
 final int numRecords = 10;



[hudi] 02/02: [MINOR] Make a separate travis CI job for hudi-utilities

2021-01-20 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch release-0.7.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ab4319ddbc07dacf6f3e34c4e08d1156391ae63a
Author: Vinoth Chandar 
AuthorDate: Wed Jan 20 20:07:26 2021 -0800

[MINOR] Make a separate travis CI job for hudi-utilities
---
 .travis.yml | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/.travis.yml b/.travis.yml
index d36c0cb..532099c 100644
--- a/.travis.yml
+++ b/.travis.yml
@@ -20,10 +20,12 @@ jdk:
   - openjdk8
 jobs:
   include:
-- name: "Unit tests except hudi-spark-client"
-  env: MODE=unit MODULES='!hudi-client/hudi-spark-client' 
HUDI_QUIETER_LOGGING=1
 - name: "Unit tests for hudi-spark-client"
   env: MODE=unit MODULES=hudi-client/hudi-spark-client 
HUDI_QUIETER_LOGGING=1
+- name: "Unit tests for hudi-utilities"
+  env: MODE=unit MODULES=hudi-utilities HUDI_QUIETER_LOGGING=1
+- name: "All other unit tests"
+  env: MODE=unit MODULES='!hudi-utilities,!hudi-client/hudi-spark-client' 
HUDI_QUIETER_LOGGING=1
 - name: "Functional tests"
   env: MODE=functional HUDI_QUIETER_LOGGING=1
 - name: "Integration tests"



[hudi] branch release-0.7.0 updated (2c69f69 -> ab4319d)

2021-01-20 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch release-0.7.0
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 2c69f69  Create release branch for version 0.7.0.
 new bead433  [MINOR] Disable flaky tests
 new ab4319d  [MINOR] Make a separate travis CI job for hudi-utilities

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .travis.yml | 6 --
 .../java/org/apache/hudi/metadata/TestHoodieBackedMetadata.java | 6 +-
 .../apache/hudi/source/TestJsonStringToHoodieRecordMapFunction.java | 2 ++
 3 files changed, 11 insertions(+), 3 deletions(-)



svn commit: r45525 - /release/hudi/KEYS

2021-01-20 Thread vinoth
Author: vinoth
Date: Thu Jan 21 06:59:35 2021
New Revision: 45525

Log:
Adding Vinoth Chandar's key to hudi release keys

Modified:
release/hudi/KEYS

Modified: release/hudi/KEYS
==
--- release/hudi/KEYS (original)
+++ release/hudi/KEYS Thu Jan 21 06:59:35 2021
@@ -519,3 +519,83 @@ zRv8WUBPVlDHlvNhmBR+gElZQ8GAOp1kgZcMas03
 FSsYB4CPrA==
 =dm1b
 -END PGP PUBLIC KEY BLOCK-
+pub   rsa3072 2019-08-02 [SC]
+  0508FFAE2B15EEAC9077C7870CF177E7BD9D3924
+uid   [ unknown] Vinoth Chandar 
+sig 30CF177E7BD9D3924 2019-08-02  Vinoth Chandar 
+sub   rsa3072 2019-08-02 [E]
+sig  0CF177E7BD9D3924 2019-08-02  Vinoth Chandar 
+
+pub   rsa2048 2021-01-21 [SC]
+  7F2A3BEB922181B06ACB1AA45F7D09E581D2BCB6
+uid   [ultimate] Vinoth Chandar 
+sig 35F7D09E581D2BCB6 2021-01-21  Vinoth Chandar 
+sub   rsa2048 2021-01-21 [E]
+sig  5F7D09E581D2BCB6 2021-01-21  Vinoth Chandar 
+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQGNBF1EGRABDACuH0C46g/K5v4CgtdXnPyPPJPLCt8jqqdPSXN9AjU4/mIR62MP
+6Hvnh261Vw9AcYcOGLQrIsY7lL+vYRjyjjW0gsVIo6taap2k1OlCQBwA1Y81WQYw
+YPXoobA2uyU80xAupcWXBJoD+Uv6jMpfCT3dskbURjjvQ+lGMljdq+GVp5fe58OG
+FFIrP/6H+yugnpZdL99dLz4z3uarUHCEEsTRugZtvY9rGP8Vm1GPwCELgXgeB81d
+nVSLSUEeRQBwzRqyG02479sVq42W4fMqk8rytn8+F3ocONWNyPK+pR37oubPJWqp
+3ZcRqx5gqZE5qpBPkRGnywuivgtlUIpgBh0xoONzsmHQmR4QlQgf1ZscVD+93thD
+TSD5ynaCndXgr1ucMDzOJd0G62bv/UTgF1mGggycVfaoDJmsTHUoGNcVmeavuz/W
+IO0bnjznXPZXliWtWGlwnVv4nZ3hYj/fe5+6iF1ZLlU25WJPVpA9R1xZyrclO009
+TLkgiIpLV1RVGxsAEQEAAbQiVmlub3RoIENoYW5kYXIgPHZpbm90aEBhcGFjaGUu
+b3JnPokBzgQTAQoAOBYhBAUI/64rFe6skHfHhwzxd+e9nTkkBQJdRBkQAhsDBQsJ
+CAcCBhUKCQgLAgQWAgMBAh4BAheAAAoJEAzxd+e9nTkkVpIL/A/p1k4mnf0oZ/q3
+kRpiRciez9g8Od7xuntFGsUO5rlBLwqqNQEhI8qb2YdW19TBDO7k5uOzsFJIvIbT
+NtyqsxNnZ9Wxt0U2j2z8NXAgfE2X+dXWvWb6MZ8WAqaEcEDt6M5CP6+IUIcysoFn
+Vxxzug5fQrNxrDpWwzPD7ff03urS1eBP0oIVpmesW1WfEQqXBQFxg5CWoejUQH0U
+lzIJmGVBDVojlZhKk2r4uHjVYr3cktwSotSB89dqlT1eimYKmGliQCEFP3dQApGR
+A7iUdBAiZs5gwWR0dCpLzlMCmw4w5EQ9wkHHk4E5Ug6j8xQq0DYRJ3aSDr+GqgPb
+U3rnlJrA1X4t6gLKGb6GzyLO6NuSkjzGrkDrDUGI8w7jn8HbSQ2rfQFRPfiyx94h
+V0icvManUpgAnHtXSltTBPd9zE2DCPOcTaGbh5jT2e4gpDDUxPg1uFg8/0cOoOQR
+7oaL9i8kgs7S5ax099enz1xzxojI9IhJLdCh1wsvOifNXEYq8bkBjQRdRBkQAQwA
+yRuHEg3VUgRt4EXgfOFdP0EOEmr9CELtNfj+GI6J2tA6Cx1I5FJD1SG7KUMELJ1C
+eoY2wFTl2ckrvhHcboGCYBXGxdl3XJ3ycKa/oWIY2VUD1SuncfMXc8hxudQo6XO/
+9GFaJuljINBQmm5LD69ZYNjblw4uc6kp5thvQ0sOpBtJoym93mzP6KQytEpveWRi
+ZyF6csFJZ4VZVxlNGRxhZVdDDQuzBMVeeQ2M4XkwHs1zvCgmYcEGUnrN9/a7gxKx
+cXatwTY86o9uncrQZdpds9gACMwBiTmWmUV40W4CNTozAGBwgNDMknV50Kdonh2/
+wAiPaCH0iAXTTE0KA2mF6pZiKeI1AjyyHxa92JlhUkf7guGQCaYdD9NlyQpbv8Q6
+soipXOF9PrDiuy+K/bB3wBzXOebEJRGocuCQF/2R+WnIZfYbDdvgA9iJyhgP3JKX
+k2nMqfJxGc3FgdYdTvREYByC4wE7la/bcmeHvS+qT9eMlKGkRUOBsez+oAsY2WXn
+ABEBAAGJAbYEGAEKACAWIQQFCP+uKxXurJB3x4cM8XfnvZ05JAUCXUQZEAIbDAAK
+CRAM8XfnvZ05JAiODACTPTRtgVfPGxffgXSyVxK2LyHS66use6bC+JII6BPlhTwk
+cWDe+4Y7C8bxN3O5cHGdHmHYX7EX6PeymbX/62hRpBwgi/otBxDtAREXN1XSNAas
+SIIkJUsHarFFWrizGWZxyXDvh2DU/B/y9B0Dwhsoik1zxUXoylbeLPMldOo5hwli
+rvvUiNWviDTnKOkAn2iNNvpOYzIzisnxPdqx8DMWBUvj36055Px4YxKUKlaWd2ID
+JBF75Nu3cpt0FQ4Vq2nSRn1Pj6ah3lU98Wo0ROVljiZGmMIswoVLrDKkwFZJVMWJ
+CZClOdL+gL3i5lio0xi+AtJDH4nmbWDGrKcXtFKoXzA0UFTD208CtEcNYaKleuPm
+B4upCA6ESTgdzZB6hB/lmyQ4fKO+viBQfXflVeKFWAoj1lJ7ZFJBET/pgFF5mufk
+8znDuJnqpNAZsfaanurMVUE8HKut2VYlCR0B6KugYaRHrMoEp/Zgn3OybCf6tUTt
+t56Yr5dP3Bxd6Er6h6iZAQ0EYAklPwEIAMhEijSHBoRX/H6qDvyKE1+lMz5k0E94
+PljKplWPcB+QyTO0tmN90aihZNSoh9BJIS+irBfrL7NHWAeyyrlPgwZZlXm7AMgP
++P9P9dYsZA08xG2d9YYbnI3Ti9vD0HuF86ofwCn8O3TRKuuYsMkBIuUqDrcbZfMm
+5VLQjy8r2n/pNWMavbeWqO7KAWZQvf6eirkZo+Dpg+1Ya47xunQE2Em+aBXMAmbl
+F8PjNeFLQhUREa092DW0XortuS0J7W8PDLtMbN8trbiOp3NRB7+A33Jg0Okwz5Dk
+vDBJaykunmDFzrwVMPOVjnVtVS2vCtm+lRFN1DBJJtmi5xtVQgRiwisAEQEAAbQi
+Vmlub3RoIENoYW5kYXIgPHZpbm90aEBhcGFjaGUub3JnPokBTgQTAQgAOBYhBH8q
+O+uSIYGwassapF99CeWB0ry2BQJgCSU/AhsDBQsJCAcCBhUKCQgLAgQWAgMBAh4B
+AheAAAoJEF99CeWB0ry2z7YH/iz0j+8CcQ3zlhgF29pbvpT6VaqKqzx6Yte84eNP
+POyZZAfjb0A282onzyHFVCaWhxIA6O0VMpbicg8EXmi0srBr/7ezSDnZeYJQjPJD
+0M7RXGB9Mzq2Bw9FmOuqGYZDxaw/ks13a/rc7ST4hWmVW/6z1sekDzr7jYdrbszd
+LVJIiDr1bk4/J952DnyrFH+WrnzJ9uxi569SzV/j8MyXJRHL4WGzVpmB0X2jR1nD
+vhuoTSEuTjg/H823gZM5vOtlc/HMsa16KczqwmCLMggYCY5FozE4Z/FBmDqVGK4g
+WPgn4Talzv8ZYgAdFNXc9bYkySHWivhD0D/nDzeFwdKEcq25AQ0EYAklPwEIAO/N
++urtXh3wbCwRpxCUFFmooO9cloC0f1DFGuP6YVADO64wBXPt21BIwOaBSVEtpUNr
+DhBlzGUbQ4BAbC0ctcZ9tc51c0YjZjGPMkUaFaSNEKJpkw8MQglxCQkpDEJV8Z0Z
+5/LLo0GmrDY4T01Wh71ysoscQyeUM6wgIgVQ5hUcjZBlAra3aIdp3F9vp9KIbBD0
+xGejetBgqVfs4wUaMnFB/RR2lOlxjjMN4Qmk5FZLtwXU5lcEGYJGKpYVusFZ/bJq
+nbWEGsCLQE4AL9YOUSbXPOn42wCK7owk31+HO7ilSSYfI0DQsQrjhlzWZTKqKkJb
+m8BHAPZ+fiYYxFUn7QMAEQEAAYkBNgQYAQgAIBYhBH8qO+uSIYGwassapF99CeWB
+0ry2BQJgCSU/AhsMAAoJEF99CeWB0ry2OUEH/2h4ypr72h6WJHPzd58lIxizLP0c
+ECppJfvmGRuNapsZ+KCXiY2wjnM9/EopD5Nsr3E7YL5pQ5KG/vh+mFkipiES4y5X
+3LcL79seFIOoi3yrX2Kd+eRNV0GwbcGQhHf390mwOr3+mYPa/z8elvLA/vx3Jf5t

svn commit: r45524 - /dev/hudi/KEYS

2021-01-20 Thread vinoth
Author: vinoth
Date: Thu Jan 21 06:59:10 2021
New Revision: 45524

Log:
Updating Vinoth Chandar's key to hudi keys 

Modified:
dev/hudi/KEYS

Modified: dev/hudi/KEYS
==
--- dev/hudi/KEYS (original)
+++ dev/hudi/KEYS Thu Jan 21 06:59:10 2021
@@ -521,3 +521,83 @@ zRv8WUBPVlDHlvNhmBR+gElZQ8GAOp1kgZcMas03
 FSsYB4CPrA==
 =dm1b
 -END PGP PUBLIC KEY BLOCK-
+pub   rsa3072 2019-08-02 [SC]
+  0508FFAE2B15EEAC9077C7870CF177E7BD9D3924
+uid   [ unknown] Vinoth Chandar 
+sig 30CF177E7BD9D3924 2019-08-02  Vinoth Chandar 
+sub   rsa3072 2019-08-02 [E]
+sig  0CF177E7BD9D3924 2019-08-02  Vinoth Chandar 
+
+pub   rsa2048 2021-01-21 [SC]
+  7F2A3BEB922181B06ACB1AA45F7D09E581D2BCB6
+uid   [ultimate] Vinoth Chandar 
+sig 35F7D09E581D2BCB6 2021-01-21  Vinoth Chandar 
+sub   rsa2048 2021-01-21 [E]
+sig  5F7D09E581D2BCB6 2021-01-21  Vinoth Chandar 
+
+-BEGIN PGP PUBLIC KEY BLOCK-
+
+mQGNBF1EGRABDACuH0C46g/K5v4CgtdXnPyPPJPLCt8jqqdPSXN9AjU4/mIR62MP
+6Hvnh261Vw9AcYcOGLQrIsY7lL+vYRjyjjW0gsVIo6taap2k1OlCQBwA1Y81WQYw
+YPXoobA2uyU80xAupcWXBJoD+Uv6jMpfCT3dskbURjjvQ+lGMljdq+GVp5fe58OG
+FFIrP/6H+yugnpZdL99dLz4z3uarUHCEEsTRugZtvY9rGP8Vm1GPwCELgXgeB81d
+nVSLSUEeRQBwzRqyG02479sVq42W4fMqk8rytn8+F3ocONWNyPK+pR37oubPJWqp
+3ZcRqx5gqZE5qpBPkRGnywuivgtlUIpgBh0xoONzsmHQmR4QlQgf1ZscVD+93thD
+TSD5ynaCndXgr1ucMDzOJd0G62bv/UTgF1mGggycVfaoDJmsTHUoGNcVmeavuz/W
+IO0bnjznXPZXliWtWGlwnVv4nZ3hYj/fe5+6iF1ZLlU25WJPVpA9R1xZyrclO009
+TLkgiIpLV1RVGxsAEQEAAbQiVmlub3RoIENoYW5kYXIgPHZpbm90aEBhcGFjaGUu
+b3JnPokBzgQTAQoAOBYhBAUI/64rFe6skHfHhwzxd+e9nTkkBQJdRBkQAhsDBQsJ
+CAcCBhUKCQgLAgQWAgMBAh4BAheAAAoJEAzxd+e9nTkkVpIL/A/p1k4mnf0oZ/q3
+kRpiRciez9g8Od7xuntFGsUO5rlBLwqqNQEhI8qb2YdW19TBDO7k5uOzsFJIvIbT
+NtyqsxNnZ9Wxt0U2j2z8NXAgfE2X+dXWvWb6MZ8WAqaEcEDt6M5CP6+IUIcysoFn
+Vxxzug5fQrNxrDpWwzPD7ff03urS1eBP0oIVpmesW1WfEQqXBQFxg5CWoejUQH0U
+lzIJmGVBDVojlZhKk2r4uHjVYr3cktwSotSB89dqlT1eimYKmGliQCEFP3dQApGR
+A7iUdBAiZs5gwWR0dCpLzlMCmw4w5EQ9wkHHk4E5Ug6j8xQq0DYRJ3aSDr+GqgPb
+U3rnlJrA1X4t6gLKGb6GzyLO6NuSkjzGrkDrDUGI8w7jn8HbSQ2rfQFRPfiyx94h
+V0icvManUpgAnHtXSltTBPd9zE2DCPOcTaGbh5jT2e4gpDDUxPg1uFg8/0cOoOQR
+7oaL9i8kgs7S5ax099enz1xzxojI9IhJLdCh1wsvOifNXEYq8bkBjQRdRBkQAQwA
+yRuHEg3VUgRt4EXgfOFdP0EOEmr9CELtNfj+GI6J2tA6Cx1I5FJD1SG7KUMELJ1C
+eoY2wFTl2ckrvhHcboGCYBXGxdl3XJ3ycKa/oWIY2VUD1SuncfMXc8hxudQo6XO/
+9GFaJuljINBQmm5LD69ZYNjblw4uc6kp5thvQ0sOpBtJoym93mzP6KQytEpveWRi
+ZyF6csFJZ4VZVxlNGRxhZVdDDQuzBMVeeQ2M4XkwHs1zvCgmYcEGUnrN9/a7gxKx
+cXatwTY86o9uncrQZdpds9gACMwBiTmWmUV40W4CNTozAGBwgNDMknV50Kdonh2/
+wAiPaCH0iAXTTE0KA2mF6pZiKeI1AjyyHxa92JlhUkf7guGQCaYdD9NlyQpbv8Q6
+soipXOF9PrDiuy+K/bB3wBzXOebEJRGocuCQF/2R+WnIZfYbDdvgA9iJyhgP3JKX
+k2nMqfJxGc3FgdYdTvREYByC4wE7la/bcmeHvS+qT9eMlKGkRUOBsez+oAsY2WXn
+ABEBAAGJAbYEGAEKACAWIQQFCP+uKxXurJB3x4cM8XfnvZ05JAUCXUQZEAIbDAAK
+CRAM8XfnvZ05JAiODACTPTRtgVfPGxffgXSyVxK2LyHS66use6bC+JII6BPlhTwk
+cWDe+4Y7C8bxN3O5cHGdHmHYX7EX6PeymbX/62hRpBwgi/otBxDtAREXN1XSNAas
+SIIkJUsHarFFWrizGWZxyXDvh2DU/B/y9B0Dwhsoik1zxUXoylbeLPMldOo5hwli
+rvvUiNWviDTnKOkAn2iNNvpOYzIzisnxPdqx8DMWBUvj36055Px4YxKUKlaWd2ID
+JBF75Nu3cpt0FQ4Vq2nSRn1Pj6ah3lU98Wo0ROVljiZGmMIswoVLrDKkwFZJVMWJ
+CZClOdL+gL3i5lio0xi+AtJDH4nmbWDGrKcXtFKoXzA0UFTD208CtEcNYaKleuPm
+B4upCA6ESTgdzZB6hB/lmyQ4fKO+viBQfXflVeKFWAoj1lJ7ZFJBET/pgFF5mufk
+8znDuJnqpNAZsfaanurMVUE8HKut2VYlCR0B6KugYaRHrMoEp/Zgn3OybCf6tUTt
+t56Yr5dP3Bxd6Er6h6iZAQ0EYAklPwEIAMhEijSHBoRX/H6qDvyKE1+lMz5k0E94
+PljKplWPcB+QyTO0tmN90aihZNSoh9BJIS+irBfrL7NHWAeyyrlPgwZZlXm7AMgP
++P9P9dYsZA08xG2d9YYbnI3Ti9vD0HuF86ofwCn8O3TRKuuYsMkBIuUqDrcbZfMm
+5VLQjy8r2n/pNWMavbeWqO7KAWZQvf6eirkZo+Dpg+1Ya47xunQE2Em+aBXMAmbl
+F8PjNeFLQhUREa092DW0XortuS0J7W8PDLtMbN8trbiOp3NRB7+A33Jg0Okwz5Dk
+vDBJaykunmDFzrwVMPOVjnVtVS2vCtm+lRFN1DBJJtmi5xtVQgRiwisAEQEAAbQi
+Vmlub3RoIENoYW5kYXIgPHZpbm90aEBhcGFjaGUub3JnPokBTgQTAQgAOBYhBH8q
+O+uSIYGwassapF99CeWB0ry2BQJgCSU/AhsDBQsJCAcCBhUKCQgLAgQWAgMBAh4B
+AheAAAoJEF99CeWB0ry2z7YH/iz0j+8CcQ3zlhgF29pbvpT6VaqKqzx6Yte84eNP
+POyZZAfjb0A282onzyHFVCaWhxIA6O0VMpbicg8EXmi0srBr/7ezSDnZeYJQjPJD
+0M7RXGB9Mzq2Bw9FmOuqGYZDxaw/ks13a/rc7ST4hWmVW/6z1sekDzr7jYdrbszd
+LVJIiDr1bk4/J952DnyrFH+WrnzJ9uxi569SzV/j8MyXJRHL4WGzVpmB0X2jR1nD
+vhuoTSEuTjg/H823gZM5vOtlc/HMsa16KczqwmCLMggYCY5FozE4Z/FBmDqVGK4g
+WPgn4Talzv8ZYgAdFNXc9bYkySHWivhD0D/nDzeFwdKEcq25AQ0EYAklPwEIAO/N
++urtXh3wbCwRpxCUFFmooO9cloC0f1DFGuP6YVADO64wBXPt21BIwOaBSVEtpUNr
+DhBlzGUbQ4BAbC0ctcZ9tc51c0YjZjGPMkUaFaSNEKJpkw8MQglxCQkpDEJV8Z0Z
+5/LLo0GmrDY4T01Wh71ysoscQyeUM6wgIgVQ5hUcjZBlAra3aIdp3F9vp9KIbBD0
+xGejetBgqVfs4wUaMnFB/RR2lOlxjjMN4Qmk5FZLtwXU5lcEGYJGKpYVusFZ/bJq
+nbWEGsCLQE4AL9YOUSbXPOn42wCK7owk31+HO7ilSSYfI0DQsQrjhlzWZTKqKkJb
+m8BHAPZ+fiYYxFUn7QMAEQEAAYkBNgQYAQgAIBYhBH8qO+uSIYGwassapF99CeWB
+0ry2BQJgCSU/AhsMAAoJEF99CeWB0ry2OUEH/2h4ypr72h6WJHPzd58lIxizLP0c
+ECppJfvmGRuNapsZ+KCXiY2wjnM9/EopD5Nsr3E7YL5pQ5KG/vh+mFkipiES4y5X
+3LcL79seFIOoi3yrX2Kd+eRNV0GwbcGQhHf390mwOr3+mYPa/z8elvLA/vx3Jf5t
+1tBwisUXe9/3aC4i1XmEokSXfD+b3mGuYrAinAlxNq7HpB94FbFuQUzUioPy608g

[GitHub] [hudi] jessica0530 commented on issue #143: Tracking ticket for folks to be added to slack group

2021-01-20 Thread GitBox


jessica0530 commented on issue #143:
URL: https://github.com/apache/hudi/issues/143#issuecomment-764415027


   please add me  wjxdtc10...@gmail.com thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1424) Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1424:
-
Status: Open  (was: New)

> Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true
> -
>
> Key: HUDI-1424
> URL: https://issues.apache.org/jira/browse/HUDI-1424
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-11-30-21-17-29-247.png
>
>
> I write data to hudi with UPSERT mode, when set 
> ENABLE_ROW_WRITER_OPT_KEY=true, the write type will change to BULK_INSERT. 
> This can be very confusing to users.
> This was caused by the  if-test condition  in HoodieSparkSqlWriter#write as 
> follow:
> !image-2020-11-30-21-17-29-247.png|width=1343,height=242!
> There is a lack of judgment on the write type,which make all of the write 
> type change to bulk write.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1424) Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1424:
-
Fix Version/s: (was: 0.8.0)
   0.7.0

> Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true
> -
>
> Key: HUDI-1424
> URL: https://issues.apache.org/jira/browse/HUDI-1424
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-11-30-21-17-29-247.png
>
>
> I write data to hudi with UPSERT mode, when set 
> ENABLE_ROW_WRITER_OPT_KEY=true, the write type will change to BULK_INSERT. 
> This can be very confusing to users.
> This was caused by the  if-test condition  in HoodieSparkSqlWriter#write as 
> follow:
> !image-2020-11-30-21-17-29-247.png|width=1343,height=242!
> There is a lack of judgment on the write type,which make all of the write 
> type change to bulk write.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1424) Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1424.

Resolution: Fixed

> Write Type changed to BULK_INSERT when set ENABLE_ROW_WRITER_OPT_KEY=true
> -
>
> Key: HUDI-1424
> URL: https://issues.apache.org/jira/browse/HUDI-1424
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-11-30-21-17-29-247.png
>
>
> I write data to hudi with UPSERT mode, when set 
> ENABLE_ROW_WRITER_OPT_KEY=true, the write type will change to BULK_INSERT. 
> This can be very confusing to users.
> This was caused by the  if-test condition  in HoodieSparkSqlWriter#write as 
> follow:
> !image-2020-11-30-21-17-29-247.png|width=1343,height=242!
> There is a lack of judgment on the write type,which make all of the write 
> type change to bulk write.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1427) Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1427:
-
Fix Version/s: (was: 0.8.0)
   0.7.0

> Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true
> ---
>
> Key: HUDI-1427
> URL: https://issues.apache.org/jira/browse/HUDI-1427
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> I have test write hudi with COW & MOR write type. When set 
> HOODIE_AUTO_COMMIT_PROP to true, a FileAlreadyExistsException throw out:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commitException in thread 
> "main" org.apache.hudi.exception.HoodieIOException: Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commit at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:474)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:350)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:325)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:144)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:181)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:101)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:413)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:210) 
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125) at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> {code}
> It seems that _../.hoodie/20201202104150.commit_ has commit twice and result 
> in this exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1427) Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1427.

Resolution: Fixed

> Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true
> ---
>
> Key: HUDI-1427
> URL: https://issues.apache.org/jira/browse/HUDI-1427
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> I have test write hudi with COW & MOR write type. When set 
> HOODIE_AUTO_COMMIT_PROP to true, a FileAlreadyExistsException throw out:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commitException in thread 
> "main" org.apache.hudi.exception.HoodieIOException: Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commit at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:474)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:350)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:325)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:144)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:181)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:101)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:413)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:210) 
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125) at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> {code}
> It seems that _../.hoodie/20201202104150.commit_ has commit twice and result 
> in this exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1427) Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1427:
-
Status: Open  (was: New)

> Throw a FileAlreadyExistsException when set HOODIE_AUTO_COMMIT_PROP to true
> ---
>
> Key: HUDI-1427
> URL: https://issues.apache.org/jira/browse/HUDI-1427
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> I have test write hudi with COW & MOR write type. When set 
> HOODIE_AUTO_COMMIT_PROP to true, a FileAlreadyExistsException throw out:
> {code:java}
> Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commitException in thread 
> "main" org.apache.hudi.exception.HoodieIOException: Failed to create file 
> /tmp/hudi/tbl_price_cow1/.hoodie/20201202104150.commit at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:474)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:350)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:325)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.saveAsComplete(HoodieActiveTimeline.java:144)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commitStats(AbstractHoodieWriteClient.java:181)
>  at 
> org.apache.hudi.client.SparkRDDWriteClient.commit(SparkRDDWriteClient.java:101)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:413)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:210) 
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:125) at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
> {code}
> It seems that _../.hoodie/20201202104150.commit_ has commit twice and result 
> in this exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 commented on a change in pull request #2449: [HUDI-1528] hudi-sync-tools supports synchronization to remote hive

2021-01-20 Thread GitBox


danny0405 commented on a change in pull request #2449:
URL: https://github.com/apache/hudi/pull/2449#discussion_r561628258



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   Looks weird, in which case the `hiveSyncConfig.hiveMetaStoreUri` can be 
null and what is the priority between the `hiveSyncConfig` and the `props` ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-993) Use hoodie.delete.shuffle.parallelism for Delete API

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-993.
---
Resolution: Fixed

> Use hoodie.delete.shuffle.parallelism for Delete API
> 
>
> Key: HUDI-993
> URL: https://issues.apache.org/jira/browse/HUDI-993
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Dongwook Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> While HUDI-328 introduced Delete API, I noticed 
> [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
>  method doesn't allow any parallelism for RDD operation while 
> [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
>  for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table 
> has large files, certain Spark job's performance is suffered from low 
> parallelism. so in this case,  upsert performance with 
> "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by 
> default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as 
> "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is 
> false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-993) Use hoodie.delete.shuffle.parallelism for Delete API

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-993:

Status: Open  (was: New)

> Use hoodie.delete.shuffle.parallelism for Delete API
> 
>
> Key: HUDI-993
> URL: https://issues.apache.org/jira/browse/HUDI-993
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Dongwook Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While HUDI-328 introduced Delete API, I noticed 
> [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
>  method doesn't allow any parallelism for RDD operation while 
> [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
>  for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table 
> has large files, certain Spark job's performance is suffered from low 
> parallelism. so in this case,  upsert performance with 
> "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by 
> default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as 
> "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is 
> false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-993) Use hoodie.delete.shuffle.parallelism for Delete API

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-993:

Fix Version/s: (was: 0.8.0)
   0.7.0

> Use hoodie.delete.shuffle.parallelism for Delete API
> 
>
> Key: HUDI-993
> URL: https://issues.apache.org/jira/browse/HUDI-993
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Dongwook Kwon
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> While HUDI-328 introduced Delete API, I noticed 
> [deduplicateKeys|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/DeleteHelper.java#L51-L57]
>  method doesn't allow any parallelism for RDD operation while 
> [deduplicateRecords|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/action/commit/WriteHelper.java#L104]
>  for upsert uses parallelism on RDD.
> {{And "hoodie.delete.shuffle.parallelism" doesn't seem to be used.}}
>  
> I found certain cases, like input RDD has few parallelism but target table 
> has large files, certain Spark job's performance is suffered from low 
> parallelism. so in this case,  upsert performance with 
> "EmptyHoodieRecordPayload" is faster than delete API.
> Also this is due to the fact that "hoodie.combine.before.upsert" is true by 
> default, when it's not enabled, the issue would be the same.
> So I wonder input RDD should be repartition as 
> "hoodie.delete.shuffle.parallelism" when " hoodie.combine.before.delete" is 
> false for better performance regardless of "hoodie.combine.before.delete"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1284) preCombine all HoodieRecords and update all fields(which is not DefaultValue) according to orderingVal

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1284:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> preCombine all HoodieRecords and update all fields(which is not DefaultValue) 
> according to  orderingVal
> ---
>
> Key: HUDI-1284
> URL: https://issues.apache.org/jira/browse/HUDI-1284
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: karl wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> When more than one HoodieRecord have the same HoodieKey, this function 
> combines all fields(which is not DefaultValue)
>  before attempting to insert/upsert (if combining turned on in 
> HoodieClientConfig).
>  eg: 1)
>  Before:
>  id name age ts
>  1 Karl null 0.0
>  1 null 18 0.0
>  After:
>  id name age ts
>  1 Karl 18 0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1363) Provide Option to drop columns after they are used to generate partition or record keys

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1363:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> Provide Option to drop columns after they are used to generate partition or 
> record keys
> ---
>
> Key: HUDI-1363
> URL: https://issues.apache.org/jira/browse/HUDI-1363
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Context: https://github.com/apache/hudi/issues/2213



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1353) Incremental timeline support for pending clustering operations

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1353:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> Incremental timeline support for pending clustering operations
> --
>
> Key: HUDI-1353
> URL: https://issues.apache.org/jira/browse/HUDI-1353
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1280) Add tool to capture earliest or latest offsets in kafka topics

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1280:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> Add tool to capture earliest or latest offsets in kafka topics 
> ---
>
> Key: HUDI-1280
> URL: https://issues.apache.org/jira/browse/HUDI-1280
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.8.0
>
>
> For bootstrapping cases using spark.write(), we need to capture offsets from 
> kafka topic and use it as checkpoint for subsequent read from Kafka topics.
>  
> [https://github.com/apache/hudi/issues/1985]
> We need to build this integration for smooth transition to deltastreamer.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1264) incremental read support with replace

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1264:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> incremental read support with replace
> -
>
> Key: HUDI-1264
> URL: https://issues.apache.org/jira/browse/HUDI-1264
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> initial version, we could fail incremental reads if there is a REPLACE 
> instant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1214) Need ability to set deltastreamer checkpoints when doing Spark datasource writes

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1214:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> Need ability to set deltastreamer checkpoints when doing Spark datasource 
> writes
> 
>
> Key: HUDI-1214
> URL: https://issues.apache.org/jira/browse/HUDI-1214
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.8.0
>
>
> Such support is needed  for bootstrapping cases when users use spark write to 
> do initial bootstrap and then subsequently use deltastreamer.
> DeltaStreamer manages checkpoints inside hoodie commit files and expects 
> checkpoints in previously committed metadata. Users are expected to pass 
> checkpoint or initial checkpoint provider when performing bootstrap through 
> deltastreamer. Such support is not present when doing bootstrap using Spark 
> Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1201) HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset when commit files do not have checkpoint

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1201:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset 
> when commit files do not have checkpoint
> -
>
> Key: HUDI-1201
> URL: https://issues.apache.org/jira/browse/HUDI-1201
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Balaji Varadarajan
>Assignee: Trevorzhang
>Priority: Major
> Fix For: 0.8.0
>
>
> [https://github.com/apache/hudi/issues/1985]
>  
> It would be easier for user to just specify deltastreamer to read from 
> earliest offset instead  of implementing -initial-checkpoint-provider or 
> passing raw kafka checkpoints when the table was initially bootstrapped 
> through spark.write().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-945) Cleanup spillable map files eagerly as part of close

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-945:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Sreeram Ramji
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Currently, files used by external spillable map are deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # HoodieMergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-956) Test COW : Presto Realtime Query with metadata bootstrap

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-956:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Test COW : Presto Realtime Query with metadata bootstrap
> 
>
> Key: HUDI-956
> URL: https://issues.apache.org/jira/browse/HUDI-956
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-837) Fix AvroKafkaSource to use the latest schema for reading

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-837:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Fix AvroKafkaSource to use the latest schema for reading
> 
>
> Key: HUDI-837
> URL: https://issues.apache.org/jira/browse/HUDI-837
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.8.0
>
>
> Currently we specify KafkaAvroDeserializer as the value for 
> value.deserializer in AvroKafkaSource. This implies the published record is 
> read using the same schema with which it was written even though the schema 
> got evolved in between. As a result, messages in incoming batch can have 
> different schemas. This has to be handled at the time of actually writing 
> records in parquet. 
> This Jira aims at providing an option to read all the messages with the same 
> schema by implementing a new custom deserializer class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-954) Test COW : Presto Read Optimized Query with metadata bootstrap

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-954:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Test COW : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-954
> URL: https://issues.apache.org/jira/browse/HUDI-954
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-955) Test MOR : Presto Read Optimized Query with metadata bootstrap

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-955:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Test MOR : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-955
> URL: https://issues.apache.org/jira/browse/HUDI-955
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1120) Support spotless for scala

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1120:
-
Fix Version/s: (was: 0.7.0)
   0.8.0

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1502) Restore on MOR table leaves metadata table out-of-sync from data table

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1502:
-
Status: Closed  (was: Patch Available)

> Restore on MOR table leaves metadata table out-of-sync from data table
> --
>
> Key: HUDI-1502
> URL: https://issues.apache.org/jira/browse/HUDI-1502
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2021-01-03-22-48-54-646.png
>
>
> Below is the stack trace from running `TestHoodieBackedMetadata#testSync` on 
> MOR tables. This seems like a more fundamental issue with deleting instant 
> files, during restore. 
> So what happens is that we restore which rolls back a delta commit that has 
> not been synced yet. (20210103224054 in the e.g) And that delta commit has 
> introduced a new log file, which has not been added to the metadata table. 
> But the restore effectively deletes the 20210103224054.deltacommit. 
> {code}
> Commit 20210103224042 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_4-2-6_20210103224041.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet,
>  25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_1-2-3_20210103224041.parquet,
>  4733dbda-7824-4411-a708-4b2d978f887b-0_4-9-22_20210103224042.parquet, 
> 532a6f9b-ca89-4b96-84b7-0e3b13068b4b-0_3-9-21_20210103224042.parquet, 
> 6842e596-46b3-4546-9faa-8a7f8c674a17-0_0-2-2_20210103224041.parquet, 
> 7f0635d7-126e-40b6-9677-7fd8a123d5b9-0_3-2-5_20210103224041.parquet, 
> d1906fdc-66ca-48a4-86b6-687c865d939d-0_2-9-20_20210103224042.parquet, 
> fd446460-a662-434a-a6ab-1cd498af94ca-0_2-2-4_20210103224041.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } 
>  Syncing [20210103224045__deltacommit__COMPLETED] to metadata table.
> Commit 20210103224045 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_0-31-52_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_2-31-54_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } >>> (after compaction) State at 
> 20210103224051 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after delete) State at 20210103224052 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after clean) State at 20210103224053 files 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after update) State at 20210103224054 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224051.log.1_1-160-262 
>

[jira] [Updated] (HUDI-909) Integrate hudi with flink engine

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-909:

Fix Version/s: (was: 0.7.0)

> Integrate hudi with flink engine
> 
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>
> Integrate hudi with flink engine



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-538) [UMBRELLA] Restructuring hudi client module for multi engine support

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-538:

Fix Version/s: (was: 0.7.0)

> [UMBRELLA] Restructuring hudi client module for multi engine support
> 
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-651:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> 

[jira] [Updated] (HUDI-1502) Restore on MOR table leaves metadata table out-of-sync from data table

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1502:
-
Status: Patch Available  (was: In Progress)

> Restore on MOR table leaves metadata table out-of-sync from data table
> --
>
> Key: HUDI-1502
> URL: https://issues.apache.org/jira/browse/HUDI-1502
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2021-01-03-22-48-54-646.png
>
>
> Below is the stack trace from running `TestHoodieBackedMetadata#testSync` on 
> MOR tables. This seems like a more fundamental issue with deleting instant 
> files, during restore. 
> So what happens is that we restore which rolls back a delta commit that has 
> not been synced yet. (20210103224054 in the e.g) And that delta commit has 
> introduced a new log file, which has not been added to the metadata table. 
> But the restore effectively deletes the 20210103224054.deltacommit. 
> {code}
> Commit 20210103224042 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_4-2-6_20210103224041.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet,
>  25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_1-2-3_20210103224041.parquet,
>  4733dbda-7824-4411-a708-4b2d978f887b-0_4-9-22_20210103224042.parquet, 
> 532a6f9b-ca89-4b96-84b7-0e3b13068b4b-0_3-9-21_20210103224042.parquet, 
> 6842e596-46b3-4546-9faa-8a7f8c674a17-0_0-2-2_20210103224041.parquet, 
> 7f0635d7-126e-40b6-9677-7fd8a123d5b9-0_3-2-5_20210103224041.parquet, 
> d1906fdc-66ca-48a4-86b6-687c865d939d-0_2-9-20_20210103224042.parquet, 
> fd446460-a662-434a-a6ab-1cd498af94ca-0_2-2-4_20210103224041.parquet], 
> deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } 
>  Syncing [20210103224045__deltacommit__COMPLETED] to metadata table.
> Commit 20210103224045 added HoodieKey { recordKey=2016/03/15 
> partitionPath=files} HoodieMetadataPayload {key=2016/03/15, type=2, 
> creations=[6b8f2187-5505-40ae-845e-a71a2163d064-0_0-31-52_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/16 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/16, type=2, 
> creations=[25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=2015/03/17 partitionPath=files} 
> HoodieMetadataPayload {key=2015/03/17, type=2, 
> creations=[2ab899de-4745-43c5-9fa4-d09721d3aa91-0_2-31-54_20210103224045.parquet],
>  deletions=[], }
>   HoodieKey { recordKey=__all_partitions__ partitionPath=files} 
> HoodieMetadataPayload {key=__all_partitions__, type=1, creations=[2015/03/16, 
> 2015/03/17, 2016/03/15], deletions=[], } >>> (after compaction) State at 
> 20210103224051 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after delete) State at 20210103224052 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224042.log.1_0-100-148 
>028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_1-9-19_20210103224042.parquet 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_0-9-18_20210103224042.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after clean) State at 20210103224053 files 
>
> 028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_3-110-170_20210103224051.parquet 
>25c9a174-4c07-43a1-a1a2-40454a3f0310-0_1-31-53_20210103224045.parquet 
>  
> >>> (after update) State at 20210103224054 files 
>.028cc15e-85ef-4b6f-b6f1-a1aa01131dbc-0_20210103224051.log.1_1-160-262 
>

[jira] [Updated] (HUDI-304) Bring back spotless plugin

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-304:

Fix Version/s: (was: 0.7.0)
   0.8.0

> Bring back spotless plugin 
> ---
>
> Key: HUDI-304
> URL: https://issues.apache.org/jira/browse/HUDI-304
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup, Testing
>Reporter: Balaji Varadarajan
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> spotless plugin has been turned off as the eclipse style format it was 
> referencing was removed due to compliance reasons. 
> We use google style eclipse format with some changes
> 90c90
> < 
> ---
> > 
> 242c242
> <  value="100"/>
> ---
> >  > value="120"/>
>  
> The eclipse style sheet was originally obtained from 
> [https://github.com/google/styleguide] which CC -By 3.0 license which is not 
> compatible for source distribution (See 
> [https://www.apache.org/legal/resolved.html#cc-by]) 
>  
> We need to figure out a way to bring this back
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1423) Support delete in java client

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1423:
-
Fix Version/s: 0.7.0

> Support delete in java client
> -
>
> Key: HUDI-1423
> URL: https://issues.apache.org/jira/browse/HUDI-1423
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1493) Schema evolution does not allow INT to LONG

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1493:
-
Status: Open  (was: New)

> Schema evolution does not allow INT to LONG
> ---
>
> Key: HUDI-1493
> URL: https://issues.apache.org/jira/browse/HUDI-1493
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> [https://github.com/apache/hudi/issues/2063]
> As per the reported issue, INT to LONG schema evolution throws an error. But 
> this should be legally allowed evolution though. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1493) Schema evolution does not allow INT to LONG

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1493.

Resolution: Fixed

> Schema evolution does not allow INT to LONG
> ---
>
> Key: HUDI-1493
> URL: https://issues.apache.org/jira/browse/HUDI-1493
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> [https://github.com/apache/hudi/issues/2063]
> As per the reported issue, INT to LONG schema evolution throws an error. But 
> this should be legally allowed evolution though. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1147) Generate valid timestamp and partition for data generator

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1147:
-
Fix Version/s: 0.7.0

> Generate valid timestamp and partition for data generator
> -
>
> Key: HUDI-1147
> URL: https://issues.apache.org/jira/browse/HUDI-1147
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1147) Generate valid timestamp and partition for data generator

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1147.

Resolution: Fixed

> Generate valid timestamp and partition for data generator
> -
>
> Key: HUDI-1147
> URL: https://issues.apache.org/jira/browse/HUDI-1147
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1147) Generate valid timestamp and partition for data generator

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1147:
-
Status: Open  (was: New)

> Generate valid timestamp and partition for data generator
> -
>
> Key: HUDI-1147
> URL: https://issues.apache.org/jira/browse/HUDI-1147
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1493) Schema evolution does not allow INT to LONG

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1493:
-
Fix Version/s: 0.7.0

> Schema evolution does not allow INT to LONG
> ---
>
> Key: HUDI-1493
> URL: https://issues.apache.org/jira/browse/HUDI-1493
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> [https://github.com/apache/hudi/issues/2063]
> As per the reported issue, INT to LONG schema evolution throws an error. But 
> this should be legally allowed evolution though. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1398) Align insert file size for reducing IO

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1398:
-
Status: Open  (was: New)

> Align insert file size for reducing IO
> --
>
> Key: HUDI-1398
> URL: https://issues.apache.org/jira/browse/HUDI-1398
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: steven zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> currently we insert totalUnassignedInserts into new file if we have anything 
> more records
> and set number of new bucket records as follow:
> recordsPerBucket.add(totalUnassignedInserts / insertBuckets); 
> ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]
>  L 188)
> it just compute the avg records. and it may create new small file
> for example:
> totalUnassignedInserts = 250
> insertRecordsPerBucket = 120
> so insertBuckets = 3 (eg. file_a,file_b,file_c)
> then  file_a = file_b = file_c = 83 
> the small files will include above three file when next delta process
> and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1398) Align insert file size for reducing IO

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1398:
-
Fix Version/s: 0.7.0

> Align insert file size for reducing IO
> --
>
> Key: HUDI-1398
> URL: https://issues.apache.org/jira/browse/HUDI-1398
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: steven zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> currently we insert totalUnassignedInserts into new file if we have anything 
> more records
> and set number of new bucket records as follow:
> recordsPerBucket.add(totalUnassignedInserts / insertBuckets); 
> ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]
>  L 188)
> it just compute the avg records. and it may create new small file
> for example:
> totalUnassignedInserts = 250
> insertRecordsPerBucket = 120
> so insertBuckets = 3 (eg. file_a,file_b,file_c)
> then  file_a = file_b = file_c = 83 
> the small files will include above three file when next delta process
> and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1398) Align insert file size for reducing IO

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1398.

Resolution: Fixed

> Align insert file size for reducing IO
> --
>
> Key: HUDI-1398
> URL: https://issues.apache.org/jira/browse/HUDI-1398
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: steven zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> currently we insert totalUnassignedInserts into new file if we have anything 
> more records
> and set number of new bucket records as follow:
> recordsPerBucket.add(totalUnassignedInserts / insertBuckets); 
> ([https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java]
>  L 188)
> it just compute the avg records. and it may create new small file
> for example:
> totalUnassignedInserts = 250
> insertRecordsPerBucket = 120
> so insertBuckets = 3 (eg. file_a,file_b,file_c)
> then  file_a = file_b = file_c = 83 
> the small files will include above three file when next delta process
> and we can reduce io by set file_a = 120 file_b = 120 file_c = 10 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1331) Improving Hudi test suite framework to support proper validation and long running tests

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1331:
-
Status: Open  (was: New)

> Improving Hudi test suite framework to support proper validation and long 
> running tests
> ---
>
> Key: HUDI-1331
> URL: https://issues.apache.org/jira/browse/HUDI-1331
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Improve hudi test suite framework to support proper validation and long 
> running tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1331) Improving Hudi test suite framework to support proper validation and long running tests

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1331.

Resolution: Fixed

> Improving Hudi test suite framework to support proper validation and long 
> running tests
> ---
>
> Key: HUDI-1331
> URL: https://issues.apache.org/jira/browse/HUDI-1331
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Improve hudi test suite framework to support proper validation and long 
> running tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1487) after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1487.

Resolution: Fixed

> after HUDI-1376 merged unit test testCopyOnWriteStorage will failed random
> --
>
> Key: HUDI-1487
> URL: https://issues.apache.org/jira/browse/HUDI-1487
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
>  
>  
> TestCOWDataSource.testCopyOnWriteStorage will failed random. Because  before 
> the  incremental read, add a new upsert commit.
> // pull the latest commit
>  val hoodieIncViewDF2 = spark.read.format("org.apache.hudi")
>  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
> DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
>  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, commitInstantTime2)
>  .load(basePath)
> the new commit is :
> // Upsert based on the written table with Hudi metadata columns
>  val verificationRowKey = 
> snapshotDF1.limit(1).select("_row_key").first.getString(0)
> as verificationRowKey will contains in "uniqueKeyCnt", so will failed as : 
> "expected: <65> but was: <66>"
>  
>  
> [https://travis-ci.com/github/apache/hudi/jobs/463879606]
>  
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:173)
>  at 
> org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getLatestFileSlices(RemoteHoodieTableFileSystemView.java:275)
>  ... 30 more
>  [WARN ] 2020-12-22 12:32:40,788 
> org.apache.hudi.testutils.HoodieClientTestHarness - Closing file-system 
> instance used in previous test-run
>  [ERROR] Tests run: 7, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
> 35.352 s <<< FAILURE! - in org.apache.hudi.functional.TestCOWDataSource
>  [ERROR] org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage 
> Time elapsed: 15.275 s <<< FAILURE!
>  org.opentest4j.AssertionFailedError: expected: <65> but was: <66>
>  at 
> org.apache.hudi.functional.TestCOWDataSource.testCopyOnWriteStorage(TestCOWDataSource.scala:160)
>  [INFO] Running org.apache.hudi.functional.TestDataSourceForBootstrap
>  [WARN ] 2020-12-22 12:32:43,641 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:32:47,818 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:32:50,921 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:32:56,169 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:32:56,793 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:32:57,388 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:33:05,191 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:33:10,221 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:33:17,985 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.
>  [WARN ] 2020-12-22 12:33:22,498 org.apache.hudi.DefaultSource - Loading Base 
> File Only View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1331) Improving Hudi test suite framework to support proper validation and long running tests

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1331:
-
Fix Version/s: 0.7.0

> Improving Hudi test suite framework to support proper validation and long 
> running tests
> ---
>
> Key: HUDI-1331
> URL: https://issues.apache.org/jira/browse/HUDI-1331
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Improve hudi test suite framework to support proper validation and long 
> running tests. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1489) Not able to read after updating bootstrap table with written table

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1489.

Resolution: Fixed

> Not able to read after updating bootstrap table with written table
> --
>
> Key: HUDI-1489
> URL: https://issues.apache.org/jira/browse/HUDI-1489
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> After updating Hudi table with the written bootstrap table, it would fail to 
> read the latest bootstrap table.
> h3. Reproduction steps
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.hudi.config.HoodieBootstrapConfig
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> val bucketName = "wenningd-dev"
> val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
> val recordKeyName = "event_id"
> val partitionKeyName = "event_type"
> val precombineKeyName = "event_time"
> val verificationRecordKey = "4"
> val verificationColumn = "event_name"
> val originalVerificationValue = "event_d"
> val updatedVerificationValue = "event_test"
> val sourceTableLocation = "s3://wenningd-dev/hudi/test-data/source_table/"
> val tableType = HoodieTableType.COPY_ON_WRITE.name()
> val verificationSqlQuery = "select " + verificationColumn + " from " + 
> tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
> val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
> val loadTablePath = tablePath + "/*/*"
> // Create table and sync with hive
> val df = spark.emptyDataFrame
> val tableType = HoodieTableType.COPY_ON_WRITE.name
> df.write
>   .format("hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
> sourceTableLocation)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
> "org.apache.hudi.keygen.SimpleKeyGenerator")
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
> recordKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // Verify create with spark sql query
> val result0 = spark.sql(verificationSqlQuery)
> if (!(result0.count == 1) || 
> !result0.collect.mkString.contains(originalVerificationValue)) {
>   throw new TestFailureException("Create table verification failed!")
> }
> val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
> val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
> val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
> df5.write.format("org.apache.hudi")
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> partitionKeyName)
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Append)
>   .save(tablePath)
>   val result1 = spark.sql(verificationSqlQuery)
>   val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)
> df6.show
> {code}
> df6.show would return:
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
>   at 
> 

[jira] [Updated] (HUDI-1489) Not able to read after updating bootstrap table with written table

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1489:
-
Fix Version/s: 0.7.0

> Not able to read after updating bootstrap table with written table
> --
>
> Key: HUDI-1489
> URL: https://issues.apache.org/jira/browse/HUDI-1489
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> After updating Hudi table with the written bootstrap table, it would fail to 
> read the latest bootstrap table.
> h3. Reproduction steps
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.hudi.config.HoodieBootstrapConfig
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> val bucketName = "wenningd-dev"
> val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
> val recordKeyName = "event_id"
> val partitionKeyName = "event_type"
> val precombineKeyName = "event_time"
> val verificationRecordKey = "4"
> val verificationColumn = "event_name"
> val originalVerificationValue = "event_d"
> val updatedVerificationValue = "event_test"
> val sourceTableLocation = "s3://wenningd-dev/hudi/test-data/source_table/"
> val tableType = HoodieTableType.COPY_ON_WRITE.name()
> val verificationSqlQuery = "select " + verificationColumn + " from " + 
> tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
> val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
> val loadTablePath = tablePath + "/*/*"
> // Create table and sync with hive
> val df = spark.emptyDataFrame
> val tableType = HoodieTableType.COPY_ON_WRITE.name
> df.write
>   .format("hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
> sourceTableLocation)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
> "org.apache.hudi.keygen.SimpleKeyGenerator")
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
> recordKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // Verify create with spark sql query
> val result0 = spark.sql(verificationSqlQuery)
> if (!(result0.count == 1) || 
> !result0.collect.mkString.contains(originalVerificationValue)) {
>   throw new TestFailureException("Create table verification failed!")
> }
> val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
> val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
> val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
> df5.write.format("org.apache.hudi")
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> partitionKeyName)
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Append)
>   .save(tablePath)
>   val result1 = spark.sql(verificationSqlQuery)
>   val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)
> df6.show
> {code}
> df6.show would return:
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
>   at 
> 

[jira] [Updated] (HUDI-1489) Not able to read after updating bootstrap table with written table

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1489:
-
Status: Open  (was: New)

> Not able to read after updating bootstrap table with written table
> --
>
> Key: HUDI-1489
> URL: https://issues.apache.org/jira/browse/HUDI-1489
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> After updating Hudi table with the written bootstrap table, it would fail to 
> read the latest bootstrap table.
> h3. Reproduction steps
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.hudi.config.HoodieBootstrapConfig
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql.SparkSession
> val bucketName = "wenningd-dev"
> val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8"
> val recordKeyName = "event_id"
> val partitionKeyName = "event_type"
> val precombineKeyName = "event_time"
> val verificationRecordKey = "4"
> val verificationColumn = "event_name"
> val originalVerificationValue = "event_d"
> val updatedVerificationValue = "event_test"
> val sourceTableLocation = "s3://wenningd-dev/hudi/test-data/source_table/"
> val tableType = HoodieTableType.COPY_ON_WRITE.name()
> val verificationSqlQuery = "select " + verificationColumn + " from " + 
> tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'"
> val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName
> val loadTablePath = tablePath + "/*/*"
> // Create table and sync with hive
> val df = spark.emptyDataFrame
> val tableType = HoodieTableType.COPY_ON_WRITE.name
> df.write
>   .format("hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, 
> sourceTableLocation)
>   .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, 
> "org.apache.hudi.keygen.SimpleKeyGenerator")
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, 
> recordKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   
> .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // Verify create with spark sql query
> val result0 = spark.sql(verificationSqlQuery)
> if (!(result0.count == 1) || 
> !result0.collect.mkString.contains(originalVerificationValue)) {
>   throw new TestFailureException("Create table verification failed!")
> }
> val df3 = spark.read.format("org.apache.hudi").load(loadTablePath)
> val df4 = df3.filter(col(recordKeyName) === verificationRecordKey)
> val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue))
> df5.write.format("org.apache.hudi")
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName)
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> partitionKeyName)
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName)
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> partitionKeyName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Append)
>   .save(tablePath)
>   val result1 = spark.sql(verificationSqlQuery)
>   val df6 = spark.read.format("org.apache.hudi").load(loadTablePath)
> df6.show
> {code}
> df6.show would return:
> {code:java}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
>   at 
> 

[jira] [Updated] (HUDI-1075) Implement a simple merge clustering strategy

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1075:
-
Fix Version/s: 0.7.0

> Implement a simple merge clustering strategy 
> -
>
> Key: HUDI-1075
> URL: https://issues.apache.org/jira/browse/HUDI-1075
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Provide action to merge N small parquet files into M parquet files (M < N). 
> Avoid serializing and deserializing records and just copy parquet blocks when 
> possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1419) Introduce base implementation of hudi-java-client

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1419.


> Introduce base implementation of hudi-java-client
> -
>
> Key: HUDI-1419
> URL: https://issues.apache.org/jira/browse/HUDI-1419
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1419) Introduce base implementation of hudi-java-client

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1419:
-
Fix Version/s: 0.7.0

> Introduce base implementation of hudi-java-client
> -
>
> Key: HUDI-1419
> URL: https://issues.apache.org/jira/browse/HUDI-1419
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1470) Hudi-test-suite - DFSHoodieDatasetInputReader.java - Use the latest writer schema, when reading the parquet files.

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1470.

Fix Version/s: 0.7.0
   Resolution: Fixed

> Hudi-test-suite - DFSHoodieDatasetInputReader.java -  Use the latest writer 
> schema,  when reading the parquet files.
> 
>
> Key: HUDI-1470
> URL: https://issues.apache.org/jira/browse/HUDI-1470
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Balajee Nagasubramaniam
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> In DFSHoodieDatasetInputReader, readParquetOrLogFiles(), when reading 
> existing parquet files from a dataset, use the writer schema (latest/evolved 
> schema) to mimic the Hudi writerCore behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1437.

Resolution: Fixed

> some description in spark ui  is not reality, Not good for performance 
> tracking
> ---
>
> Key: HUDI-1437
> URL: https://issues.apache.org/jira/browse/HUDI-1437
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-12-07-23-50-57-212.png
>
>
> some spark action in hudi ,not set the real description, it is not good for 
> performance tracking
>  
> !image-2020-12-07-23-50-57-212.png|width=693,height=375!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1470) Hudi-test-suite - DFSHoodieDatasetInputReader.java - Use the latest writer schema, when reading the parquet files.

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1470:
-
Status: Open  (was: New)

> Hudi-test-suite - DFSHoodieDatasetInputReader.java -  Use the latest writer 
> schema,  when reading the parquet files.
> 
>
> Key: HUDI-1470
> URL: https://issues.apache.org/jira/browse/HUDI-1470
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Balajee Nagasubramaniam
>Priority: Major
>  Labels: pull-request-available
>
> In DFSHoodieDatasetInputReader, readParquetOrLogFiles(), when reading 
> existing parquet files from a dataset, use the writer schema (latest/evolved 
> schema) to mimic the Hudi writerCore behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1437:
-
Status: Open  (was: New)

> some description in spark ui  is not reality, Not good for performance 
> tracking
> ---
>
> Key: HUDI-1437
> URL: https://issues.apache.org/jira/browse/HUDI-1437
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-12-07-23-50-57-212.png
>
>
> some spark action in hudi ,not set the real description, it is not good for 
> performance tracking
>  
> !image-2020-12-07-23-50-57-212.png|width=693,height=375!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1437) some description in spark ui is not reality, Not good for performance tracking

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1437:
-
Fix Version/s: 0.7.0

> some description in spark ui  is not reality, Not good for performance 
> tracking
> ---
>
> Key: HUDI-1437
> URL: https://issues.apache.org/jira/browse/HUDI-1437
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
> Attachments: image-2020-12-07-23-50-57-212.png
>
>
> some spark action in hudi ,not set the real description, it is not good for 
> performance tracking
>  
> !image-2020-12-07-23-50-57-212.png|width=693,height=375!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1406) Add new DFS path sector implementation for listing date based partitions

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1406:
-
Status: Open  (was: New)

> Add new DFS path sector implementation for listing date based partitions
> 
>
> Key: HUDI-1406
> URL: https://issues.apache.org/jira/browse/HUDI-1406
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Deltastreamer DFS source lists files from table path and determine files 
> changed recently based on modification time. For certain workloads where only 
> the latest partitions are affected, we might benefit by listing source input 
> only from recent partitions. This especially helps data  in S3 with multi 
> partition fields and  listing is time consuming. 
>  
> To support this, I propose adding a DFS selector implementation based on date 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1406) Add new DFS path sector implementation for listing date based partitions

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1406.

Resolution: Fixed

> Add new DFS path sector implementation for listing date based partitions
> 
>
> Key: HUDI-1406
> URL: https://issues.apache.org/jira/browse/HUDI-1406
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Deltastreamer DFS source lists files from table path and determine files 
> changed recently based on modification time. For certain workloads where only 
> the latest partitions are affected, we might benefit by listing source input 
> only from recent partitions. This especially helps data  in S3 with multi 
> partition fields and  listing is time consuming. 
>  
> To support this, I propose adding a DFS selector implementation based on date 
> partitions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1376:
-
Status: Open  (was: New)

> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1448) hudi dla sync skip rt create

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1448:
-
Status: Open  (was: New)

> hudi  dla sync skip rt create
> -
>
> Key: HUDI-1448
> URL: https://issues.apache.org/jira/browse/HUDI-1448
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1376:
-
Fix Version/s: 0.7.0

> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1448) hudi dla sync skip rt create

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1448.

Resolution: Fixed

> hudi  dla sync skip rt create
> -
>
> Key: HUDI-1448
> URL: https://issues.apache.org/jira/browse/HUDI-1448
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1448) hudi dla sync skip rt create

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1448:
-
Fix Version/s: 0.7.0

> hudi  dla sync skip rt create
> -
>
> Key: HUDI-1448
> URL: https://issues.apache.org/jira/browse/HUDI-1448
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1376.

Resolution: Fixed

> Drop Hudi metadata columns before Spark datasource writing 
> ---
>
> Key: HUDI-1376
> URL: https://issues.apache.org/jira/browse/HUDI-1376
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When updating a Hudi table through Spark datasource, it will use the schema 
> of the input dataframe as the schema stored in the commit files. Thus, when 
> upserted with rows containing metadata columns, the upsert commit file will 
> store the metadata columns schema in the commit file which is unnecessary for 
> common cases. And also this will bring an issue for bootstrap table.
> Since metadata columns are not used during the Spark datasource writing 
> process, we can drop those columns in the beginning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1435) Marker File Reconciliation failing for Non-Partitioned datasets when duplicate marker files present

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1435:
-
Status: Closed  (was: Patch Available)

> Marker File Reconciliation failing for Non-Partitioned datasets when 
> duplicate marker files present
> ---
>
> Key: HUDI-1435
> URL: https://issues.apache.org/jira/browse/HUDI-1435
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> GH : https://github.com/apache/hudi/issues/2294



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Trevor-zhang commented on a change in pull request #2449: [HUDI-1528] hudi-sync-tools supports synchronization to remote hive

2021-01-20 Thread GitBox


Trevor-zhang commented on a change in pull request #2449:
URL: https://github.com/apache/hudi/pull/2449#discussion_r561621697



##
File path: 
hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -284,6 +284,9 @@ public static HiveSyncConfig 
buildHiveSyncConfig(TypedProperties props, String b
 props.getString(DataSourceWriteOptions.HIVE_PASS_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_PASS_OPT_VAL());
 hiveSyncConfig.jdbcUrl =
 props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_URL_OPT_VAL());
+if (hiveSyncConfig.hiveMetaStoreUri != null) {
+  hiveSyncConfig.hiveMetaStoreUri = 
props.getString(DataSourceWriteOptions.HIVE_URL_OPT_KEY(), 
DataSourceWriteOptions.DEFAULT_HIVE_METASTORE_URI_OPT_VAL());
+}

Review comment:
   Because this is not a required option .





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1428) Clean old fileslice is invalid

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1428:
-
Status: Open  (was: New)

>  Clean old fileslice is invalid 
> 
>
> Key: HUDI-1428
> URL: https://issues.apache.org/jira/browse/HUDI-1428
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.6.0, 0.7.0
>Reporter: steven zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Reproduce : 
> Table type MERGE_ON_READ and set hoodie.cleaner.commits.retained=2
> Do insert into table three times  into same partition
> And it will create follow parquet file
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-213-1740_20201201200149.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-176-1702_20201201195638.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-139-1664_20201201195219.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-103-1632_20201201193835.parquet
> And 20201201200149.parquet is newest 
> The old parquet files have not be deleted when do client.clean()
> The reason is CleanPlanner  init commitTimeline with 
> hoodieTable.getCompletedCommitTimeline();
> And getCompletedCommitTimeline method not include DELTA_COMMIT_ACTION



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1428) Clean old fileslice is invalid

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1428.

Resolution: Fixed

>  Clean old fileslice is invalid 
> 
>
> Key: HUDI-1428
> URL: https://issues.apache.org/jira/browse/HUDI-1428
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Cleaner
>Affects Versions: 0.6.0, 0.7.0
>Reporter: steven zhang
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Reproduce : 
> Table type MERGE_ON_READ and set hoodie.cleaner.commits.retained=2
> Do insert into table three times  into same partition
> And it will create follow parquet file
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-213-1740_20201201200149.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-176-1702_20201201195638.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-139-1664_20201201195219.parquet
> a2c57b73-5ba9-4744-9027-640075a179ec-0_0-103-1632_20201201193835.parquet
> And 20201201200149.parquet is newest 
> The old parquet files have not be deleted when do client.clean()
> The reason is CleanPlanner  init commitTimeline with 
> hoodieTable.getCompletedCommitTimeline();
> And getCompletedCommitTimeline method not include DELTA_COMMIT_ACTION



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1412) Make HoodieWriteConfig support setting different default value according to engine type

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1412:
-
Fix Version/s: (was: 0.6.1)
   0.7.0

> Make HoodieWriteConfig support setting different default value according to 
> engine type
> ---
>
> Key: HUDI-1412
> URL: https://issues.apache.org/jira/browse/HUDI-1412
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently, `HoodieIndexConfig` set its default index type to bloom, which is 
> suitable for spark engine.
> But,since hoodie has supported flink engine, we should make the default 
> values setted according to engine user used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1349) spark sql support overwrite use replace action with dynamic partitioning

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1349:
-
Fix Version/s: 0.7.0

> spark sql support overwrite use  replace action with dynamic partitioning
> -
>
> Key: HUDI-1349
> URL: https://issues.apache.org/jira/browse/HUDI-1349
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> now spark sql overwrite just do like this.
> } else if (mode == SaveMode.Overwrite && tableExists) {
>  log.warn(s"hoodie table at $tablePath already exists. Deleting existing data 
> & overwriting with new data.")
>  fs.delete(tablePath, true)
>  tableExists = false
> }
> overwrite need to use replace action
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1357) Add a check to ensure there is no data loss when writing to HUDI dataset

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1357.

Resolution: Fixed

> Add a check to ensure there is no data loss when writing to HUDI dataset
> 
>
> Key: HUDI-1357
> URL: https://issues.apache.org/jira/browse/HUDI-1357
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When updating a HUDI dataset with updates + deletes, records from existing 
> base files are read and merged with updates+deletes and finally written to 
> newer base files.
> It should hold that:
> count(records_in_older_base file) + num_deletes = count(records_in_new_base 
> file)
> In our internal production deployment, we had an issue wherein due to parquet 
> bug in handling the schema, reading existing records returned null data. This 
> lead to many records not being written out from older parquet into newer 
> parquet file.
> This check will ensure that such issues do not lead to data loss by 
> triggering an exception when the expected record counts do not match. This 
> check is off by default and controlled through a HoodieWriteConfig parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1196) Record being placed in incorrect partition during upsert on COW/MOR global indexed tables

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1196.

Resolution: Fixed

> Record being placed in incorrect partition during upsert on COW/MOR global 
> indexed tables
> -
>
> Key: HUDI-1196
> URL: https://issues.apache.org/jira/browse/HUDI-1196
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ryan Pifer
>Assignee: Ryan Pifer
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When upserting a record in a global index table (global and hbase) where a 
> single batch has multiple versions of the record in different partitions, the 
> record is deduplicated correctly but placed in the incorrect partition. This 
> was with using "hoodie.bloom.update.partition.path=true" as well
>  
> Batch with multiple versions of a record in different partitions:
> ```
> scala> val inputDF = spark.read.format("parquet").load(inputDataPath).show()
> +-+++-++-             
>   
> |    wbn|    cs_ss|    action_date|          ad|  ad_updated|
> +-+++-++-
> |12345678|InTransit|1596716921000601|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000602|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000603|2020-08-06-13|2020-08-06-13|
> +-+++-++-
> ```
>  
> Values when querying _rt and _ro tables:
> ```
> scala> spark.sql("select * from gb_update_partition_1_ro").show()
> ++---++++++---++--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>   _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> ++---++++++---++--+
> |    20200817220935|  20200817220935_0_1|          12345678|        
> 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> ++---++++++---++--+
>   
> scala> spark.sql("select * from gb_update_partition_1_rt").show()
> ++---++++++---++--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>   _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> ++---++++++---++--+
> |    20200817221924|  20200817221924_0_1|          12345678|        
> 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> ++---++++++---++--+
>  ```
>  
> We can see that record displays most current version of the data except the 
> partition values are from the older versions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1357) Add a check to ensure there is no data loss when writing to HUDI dataset

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1357:
-
Fix Version/s: 0.7.0

> Add a check to ensure there is no data loss when writing to HUDI dataset
> 
>
> Key: HUDI-1357
> URL: https://issues.apache.org/jira/browse/HUDI-1357
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When updating a HUDI dataset with updates + deletes, records from existing 
> base files are read and merged with updates+deletes and finally written to 
> newer base files.
> It should hold that:
> count(records_in_older_base file) + num_deletes = count(records_in_new_base 
> file)
> In our internal production deployment, we had an issue wherein due to parquet 
> bug in handling the schema, reading existing records returned null data. This 
> lead to many records not being written out from older parquet into newer 
> parquet file.
> This check will ensure that such issues do not lead to data loss by 
> triggering an exception when the expected record counts do not match. This 
> check is off by default and controlled through a HoodieWriteConfig parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1196) Record being placed in incorrect partition during upsert on COW/MOR global indexed tables

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1196:
-
Fix Version/s: 0.7.0

> Record being placed in incorrect partition during upsert on COW/MOR global 
> indexed tables
> -
>
> Key: HUDI-1196
> URL: https://issues.apache.org/jira/browse/HUDI-1196
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ryan Pifer
>Assignee: Ryan Pifer
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> When upserting a record in a global index table (global and hbase) where a 
> single batch has multiple versions of the record in different partitions, the 
> record is deduplicated correctly but placed in the incorrect partition. This 
> was with using "hoodie.bloom.update.partition.path=true" as well
>  
> Batch with multiple versions of a record in different partitions:
> ```
> scala> val inputDF = spark.read.format("parquet").load(inputDataPath).show()
> +-+++-++-             
>   
> |    wbn|    cs_ss|    action_date|          ad|  ad_updated|
> +-+++-++-
> |12345678|InTransit|1596716921000601|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000602|2020-08-06-12|2020-08-06-12|
> |12345678|  Pending|1596716921000603|2020-08-06-13|2020-08-06-13|
> +-+++-++-
> ```
>  
> Values when querying _rt and _ro tables:
> ```
> scala> spark.sql("select * from gb_update_partition_1_ro").show()
> ++---++++++---++--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>   _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> ++---++++++---++--+
> |    20200817220935|  20200817220935_0_1|          12345678|        
> 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> ++---++++++---++--+
>   
> scala> spark.sql("select * from gb_update_partition_1_rt").show()
> ++---++++++---++--+
> |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
>   _hoodie_file_name|    wbn|  cs_ss|    action_date|  ad_updated|          ad|
> ++---++++++---++--+
> |    20200817221924|  20200817221924_0_1|          12345678|        
> 2020-08-06-12|4dddb6e8-87c4-4bd...|12345678|Pending|1596716921000603|2020-08-06-13|2020-08-06-12|
> ++---++++++---++--+
>  ```
>  
> We can see that record displays most current version of the data except the 
> partition values are from the older versions
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1373) Add Support for OpenJ9 JVM

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1373.


> Add Support for OpenJ9 JVM
> --
>
> Key: HUDI-1373
> URL: https://issues.apache.org/jira/browse/HUDI-1373
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Guy Khazma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently OpenJ9 JVM is not supported because the code for object size 
> calculation by estimating the size in the JVM is not supported for OpenJ9 JVM.
> (see here - 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java))])
> This issue was also mentioned HUDI-234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1373) Add Support for OpenJ9 JVM

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1373.
--
Resolution: Fixed

> Add Support for OpenJ9 JVM
> --
>
> Key: HUDI-1373
> URL: https://issues.apache.org/jira/browse/HUDI-1373
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Guy Khazma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently OpenJ9 JVM is not supported because the code for object size 
> calculation by estimating the size in the JVM is not supported for OpenJ9 JVM.
> (see here - 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java))])
> This issue was also mentioned HUDI-234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1373) Add Support for OpenJ9 JVM

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1373:
-
Status: Open  (was: New)

> Add Support for OpenJ9 JVM
> --
>
> Key: HUDI-1373
> URL: https://issues.apache.org/jira/browse/HUDI-1373
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Guy Khazma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently OpenJ9 JVM is not supported because the code for object size 
> calculation by estimating the size in the JVM is not supported for OpenJ9 JVM.
> (see here - 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java))])
> This issue was also mentioned HUDI-234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1373) Add Support for OpenJ9 JVM

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1373:
-
Fix Version/s: 0.7.0

> Add Support for OpenJ9 JVM
> --
>
> Key: HUDI-1373
> URL: https://issues.apache.org/jira/browse/HUDI-1373
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Guy Khazma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently OpenJ9 JVM is not supported because the code for object size 
> calculation by estimating the size in the JVM is not supported for OpenJ9 JVM.
> (see here - 
> [https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java|https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java))])
> This issue was also mentioned HUDI-234



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1396) Bootstrap job via Hudi datasource hangs at the end of the spark-submit job

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1396:
-
Status: Open  (was: New)

> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
> --
>
> Key: HUDI-1396
> URL: https://issues.apache.org/jira/browse/HUDI-1396
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
>  
> This issue is similar to https://issues.apache.org/jira/browse/HUDI-1230. 
> Basically, {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  will not be closed and as a result, the corresponding timeline server will 
> not stop at the end. Therefore the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1396) Bootstrap job via Hudi datasource hangs at the end of the spark-submit job

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1396:
-
Fix Version/s: 0.7.0

> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
> --
>
> Key: HUDI-1396
> URL: https://issues.apache.org/jira/browse/HUDI-1396
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
>  
> This issue is similar to https://issues.apache.org/jira/browse/HUDI-1230. 
> Basically, {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  will not be closed and as a result, the corresponding timeline server will 
> not stop at the end. Therefore the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1396) Bootstrap job via Hudi datasource hangs at the end of the spark-submit job

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1396.
--
Resolution: Fixed

> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
> --
>
> Key: HUDI-1396
> URL: https://issues.apache.org/jira/browse/HUDI-1396
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
>  
> This issue is similar to https://issues.apache.org/jira/browse/HUDI-1230. 
> Basically, {{HoodieWriteClient}} at 
> [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255]
>  will not be closed and as a result, the corresponding timeline server will 
> not stop at the end. Therefore the job hangs and never exits.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1364) Introduce HoodieJavaEngineContext to hudi-java-client

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1364:
-
Fix Version/s: 0.7.0

> Introduce HoodieJavaEngineContext to hudi-java-client
> -
>
> Key: HUDI-1364
> URL: https://issues.apache.org/jira/browse/HUDI-1364
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1393) Add compaction action in archive command

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1393:
-
Fix Version/s: 0.7.0

> Add compaction action in archive command
> 
>
> Key: HUDI-1393
> URL: https://issues.apache.org/jira/browse/HUDI-1393
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: CLI
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> show archived commits command cannot recognize compaction action, add the 
> case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1358) Memory Leak in HoodieLogFormatWriter

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1358:
-
Fix Version/s: 0.7.0

> Memory Leak in HoodieLogFormatWriter
> 
>
> Key: HUDI-1358
> URL: https://issues.apache.org/jira/browse/HUDI-1358
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> https://github.com/apache/hudi/issues/2215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1358) Memory Leak in HoodieLogFormatWriter

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1358.

Resolution: Fixed

> Memory Leak in HoodieLogFormatWriter
> 
>
> Key: HUDI-1358
> URL: https://issues.apache.org/jira/browse/HUDI-1358
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> https://github.com/apache/hudi/issues/2215



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1338) Adding Delete support to test suite framework

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1338:
-
Fix Version/s: 0.7.0

> Adding Delete support to test suite framework
> -
>
> Key: HUDI-1338
> URL: https://issues.apache.org/jira/browse/HUDI-1338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Add delete support to test suite framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-892) RealtimeParquetInputFormat should skip adding projection columns if there are no log files

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-892:

Fix Version/s: 0.7.0

> RealtimeParquetInputFormat should skip adding projection columns if there are 
> no log files
> --
>
> Key: HUDI-892
> URL: https://issues.apache.org/jira/browse/HUDI-892
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration, Performance
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-912) Refactor and relocate KeyGenerator to support more engines

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-912:

Fix Version/s: 0.7.0

> Refactor and relocate KeyGenerator to support more engines
> --
>
> Key: HUDI-912
> URL: https://issues.apache.org/jira/browse/HUDI-912
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> Currently, `keyGenerator`s are implemented in `hudi-spark` module,  they can 
> only be used by spark engine.
> Since `keyGenerator` is a core tool for hudi, they should be 
> engine-independent.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1351:
-
Status: Closed  (was: Patch Available)

> Improvements required to hudi-test-suite for scalable and repeated testing
> --
>
> Key: HUDI-1351
> URL: https://issues.apache.org/jira/browse/HUDI-1351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output 
> directories need to be manually cleaned. This is cumbersome for repeated 
> testing.
> 2. When running a long test, the input data generated by older DAG nodes is 
> not deleted and leads to high file count on the HDFS cluster. The older files 
> can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is 
> less than spark's default parallelism, a number of empty avro files are 
> created. This also leads to scalability issues on the HDFS cluster. Creating 
> large number of smaller AVRO files is slower and less scalable than single 
> AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the 
> data will be generated for or add a new partition. Hence we need a 
> start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition 
> number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not 
> correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) 
> as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1338) Adding Delete support to test suite framework

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1338.

Resolution: Fixed

> Adding Delete support to test suite framework
> -
>
> Key: HUDI-1338
> URL: https://issues.apache.org/jira/browse/HUDI-1338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Add delete support to test suite framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1338) Adding Delete support to test suite framework

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1338:
-
Status: Open  (was: New)

> Adding Delete support to test suite framework
> -
>
> Key: HUDI-1338
> URL: https://issues.apache.org/jira/browse/HUDI-1338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
>
> Add delete support to test suite framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1326) Publishing metrics from hudi-test-suite after each job

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-1326.

Resolution: Fixed

> Publishing metrics from hudi-test-suite after each job
> --
>
> Key: HUDI-1326
> URL: https://issues.apache.org/jira/browse/HUDI-1326
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> HUDI metrics are published at two stages:
> 1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics)
> 2. When the JVM is shutdown (Metrics.java addShutdownHook)
> hudi-test-suite includes multiple jobs each of which is an action on the 
> table. All these jobs are run in a single JVM. Hence, currently some metrics 
> are not published till the entire hudi-test-suite run is over.
> Enhancement:
> 1. Allow metrics to be published after each job
> 2. Flush metrics so redundant metrics (from the last job) are not published 
> again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1351:
-
Status: Patch Available  (was: In Progress)

> Improvements required to hudi-test-suite for scalable and repeated testing
> --
>
> Key: HUDI-1351
> URL: https://issues.apache.org/jira/browse/HUDI-1351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output 
> directories need to be manually cleaned. This is cumbersome for repeated 
> testing.
> 2. When running a long test, the input data generated by older DAG nodes is 
> not deleted and leads to high file count on the HDFS cluster. The older files 
> can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is 
> less than spark's default parallelism, a number of empty avro files are 
> created. This also leads to scalability issues on the HDFS cluster. Creating 
> large number of smaller AVRO files is slower and less scalable than single 
> AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the 
> data will be generated for or add a new partition. Hence we need a 
> start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition 
> number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not 
> correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) 
> as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1351) Improvements required to hudi-test-suite for scalable and repeated testing

2021-01-20 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1351:
-
Fix Version/s: 0.7.0

> Improvements required to hudi-test-suite for scalable and repeated testing
> --
>
> Key: HUDI-1351
> URL: https://issues.apache.org/jira/browse/HUDI-1351
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.7.0
>
>
> There are some shortcomings of the hudi-test-suite which would be good to fix:
> 1. When doing repeated testing with the same DAG, the input and output 
> directories need to be manually cleaned. This is cumbersome for repeated 
> testing.
> 2. When running a long test, the input data generated by older DAG nodes is 
> not deleted and leads to high file count on the HDFS cluster. The older files 
> can be deleted once the data has been ingested.
> 3. When generating input data, if the number of insert/update partitions is 
> less than spark's default parallelism, a number of empty avro files are 
> created. This also leads to scalability issues on the HDFS cluster. Creating 
> large number of smaller AVRO files is slower and less scalable than single 
> AVRO file.
> 4. When generating data to be inserted, we cannot control which partition the 
> data will be generated for or add a new partition. Hence we need a 
> start_offset parameter to control the partition offset.
> 5. BUG: Does not generate correct number of insert partitions as partition 
> number is chosen as a random long. 
> 6. BUG: Integer division used within Math.ceil in a couple of places is not 
> correct and leads to 0 value.  Math.ceil(5/10) == 0 and not 1 (as intended) 
> as 5 and 10 are integers.
>  
> 1. When generating input data, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   >