date:20210408

[jira] [Closed] (HUDI-1775) Add option for compaction parallelism

2021-04-08 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1775.
--
Resolution: Done

6786581c4842e47e1a8a8e942f54003dc151c7c6

> Add option for compaction parallelism
> -
>
> Key: HUDI-1775
> URL: https://issues.apache.org/jira/browse/HUDI-1775
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1775) Add option for compaction parallelism

2021-04-08 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-1775:
---
Fix Version/s: 0.9.0

> Add option for compaction parallelism
> -
>
> Key: HUDI-1775
> URL: https://issues.apache.org/jira/browse/HUDI-1775
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1775) Add option for compaction parallelism

2021-04-08 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned HUDI-1775:
--

Assignee: Danny Chen

> Add option for compaction parallelism
> -
>
> Key: HUDI-1775
> URL: https://issues.apache.org/jira/browse/HUDI-1775
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Flink Integration
>Reporter: Danny Chen
>Assignee: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated: [HUDI-1775] Add option for compaction parallelism (#2785)

2021-04-08 Thread vinoyang

This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6786581  [HUDI-1775] Add option for compaction parallelism (#2785)
6786581 is described below

commit 6786581c4842e47e1a8a8e942f54003dc151c7c6
Author: Danny Chan 
AuthorDate: Fri Apr 9 13:46:19 2021 +0800

[HUDI-1775] Add option for compaction parallelism (#2785)
---
 .../apache/hudi/configuration/FlinkOptions.java| 25 +++--
 .../sink/partitioner/BucketAssignFunction.java | 32 --
 .../org/apache/hudi/table/HoodieTableSink.java |  1 +
 .../java/org/apache/hudi/util/StreamerUtil.java|  4 +++
 .../org/apache/hudi/sink/TestWriteCopyOnWrite.java |  4 +++
 .../apache/hudi/table/HoodieDataSourceITCase.java  |  3 +-
 6 files changed, 58 insertions(+), 11 deletions(-)

diff --git 
a/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java 
b/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
index b66be2b..b120714 100644
--- a/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
+++ b/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
@@ -72,6 +72,15 @@ public class FlinkOptions {
   + " column value is null/empty string");
 
   // 
+  //  Index Options
+  // 
+  public static final ConfigOption INDEX_BOOTSTRAP_ENABLED = 
ConfigOptions
+  .key("index.bootstrap.enabled")
+  .booleanType()
+  .defaultValue(false)
+  .withDescription("Whether to bootstrap the index state from existing 
hoodie table, default false");
+
+  // 
   //  Read Options
   // 
   public static final ConfigOption READ_TASKS = ConfigOptions
@@ -255,8 +264,14 @@ public class FlinkOptions {
   public static final ConfigOption WRITE_BATCH_SIZE = ConfigOptions
   .key("write.batch.size.MB")
   .doubleType()
-  .defaultValue(2D) // 2MB
-  .withDescription("Batch buffer size in MB to flush data into the 
underneath filesystem");
+  .defaultValue(64D) // 64MB
+  .withDescription("Batch buffer size in MB to flush data into the 
underneath filesystem, default 64MB");
+
+  public static final ConfigOption WRITE_LOG_BLOCK_SIZE = 
ConfigOptions
+  .key("write.log_block.size.MB")
+  .intType()
+  .defaultValue(128)
+  .withDescription("Max log block size in MB for log file, default 128MB");
 
   // 
   //  Compaction Options
@@ -268,6 +283,12 @@ public class FlinkOptions {
   .defaultValue(true) // default true for MOR write
   .withDescription("Async Compaction, enabled by default for MOR");
 
+  public static final ConfigOption COMPACTION_TASKS = ConfigOptions
+  .key("compaction.tasks")
+  .intType()
+  .defaultValue(10) // default WRITE_TASKS * COMPACTION_DELTA_COMMITS * 
0.5 (assumes two commits generate one bucket)
+  .withDescription("Parallelism of tasks that do actual compaction, 
default is 10");
+
   public static final String NUM_COMMITS = "num_commits";
   public static final String TIME_ELAPSED = "time_elapsed";
   public static final String NUM_AND_TIME = "num_and_time";
diff --git 
a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
 
b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
index 9c23259..7e017cc 100644
--- 
a/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
+++ 
b/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
@@ -31,7 +31,6 @@ import org.apache.hudi.common.util.ParquetUtils;
 import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.configuration.FlinkOptions;
 import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.index.HoodieIndexUtils;
 import org.apache.hudi.table.HoodieTable;
 import org.apache.hudi.table.action.commit.BucketInfo;
@@ -103,6 +102,8 @@ public class BucketAssignFunction>
 
   private final boolean isChangingRecords;
 
+  private final boolean bootstrapIndex;
+
   /**
* State to book-keep which partition is loaded into the index state {@code 
indexState}.
*/
@@ -112,6 +113,7 @@ public class BucketAssignFunction>
 this.conf = conf;
 this.isChangingRecords = WriteOperationType.isChangingRecords(
 WriteOperationType.fromValue(conf.getString(FlinkOptions.OPERATION)));
+this.bootstrapIndex = 
conf.getBoolean(FlinkOptions.INDEX_BOOTSTRAP_ENABLED);

[GitHub] [hudi] yanghua merged pull request #2785: [HUDI-1775] Add option for compaction parallelism

2021-04-08 Thread GitBox



yanghua merged pull request #2785:
URL: https://github.com/apache/hudi/pull/2785


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2787: [SUPPORT] Error upserting bucketType UPDATE for partition

2021-04-08 Thread GitBox



nsivabalan commented on issue #2787:
URL: https://github.com/apache/hudi/issues/2787#issuecomment-816416482


   Can you give us the configs you used? is it failing at the beginning itself 
or after after few batch of writes? 
   looks like this is the root cause.
   ```
   Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file 
s3://30-ze-datalake-raw/freshdesk/tickets/3416b1b1-dc82-4256-8d4c-a2a62146e41c-0_0-299-7594_20210408071759.parquet
   ```
   also, can you give us the schema of the dataset in use. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-04-08 Thread Vinoth Govindarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1743:
--
Status: Patch Available  (was: In Progress)

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1743) Add support for Spark SQL File based transformer for deltastreamer

2021-04-08 Thread Vinoth Govindarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1743:
--
Status: In Progress  (was: Open)

> Add support for Spark SQL File based transformer for deltastreamer
> --
>
> Key: HUDI-1743
> URL: https://issues.apache.org/jira/browse/HUDI-1743
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: features, pull-request-available
>
> The current SQLQuery based transformer is limited in functionality, you can't 
> pass multiple Spark SQL statements separated by a semicolon which is 
> necessary if your transformation is complex.
>  
> The ask is to add a new SQLFileBasedTransformer which takes a Spark SQL file 
> as input with multiple Spark SQL statements and applies the transformation to 
> the delta streamer payload.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-08 Thread Vinoth Govindarajan (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317655#comment-17317655
 ] 

Vinoth Govindarajan commented on HUDI-1762:
---

PR merged.

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1762) Hive Sync is not working with Hive Style Partitioning

2021-04-08 Thread Vinoth Govindarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-1762:
--
Status: Closed  (was: Patch Available)

> Hive Sync is not working with Hive Style Partitioning
> -
>
> Key: HUDI-1762
> URL: https://issues.apache.org/jira/browse/HUDI-1762
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: hive, pull-request-available
>
> When you create a Hudi table with hive style partitioning and enable the hive 
> sync, it didn't work because it's assuming the partition will be separated by 
> a slash.
>  
> when the hive style partitioning is enabled for the target table like this:
> {code:java}
> hoodie.datasource.write.partitionpath.field=datestr
> hoodie.datasource.write.hive_style_partitioning=true
> {code}
> This is the error it throws:
> {code:java}
> 21/04/01 23:10:33 ERROR deltastreamer.HoodieDeltaStreamer: Got error running 
> delta sync once. Shutting down
> org.apache.hudi.exception.HoodieException: Got runtime exception when hive 
> syncing delta_streamer_test
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:122)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:560)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:475)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:282)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:170)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:168)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:470)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:690)
> Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync 
> partitions for table fact_scheduled_trip__1pc_trip_uuid
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:229)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:166)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:108)
>   ... 12 more
> Caused by: java.lang.IllegalArgumentException: Partition path 
> datestr=2021-03-28 is not in the form /mm/dd 
>   at 
> org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor.extractPartitionValuesInPath(SlashEncodedDayPartitionValueExtractor.java:55)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.getPartitionEvents(HoodieHiveClient.java:220)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:221)
>   ... 14 more
> {code}
> To fix this issue we need to create a new partition extractor class and 
> assign that class name as the hive sync partition extractor.
> After you define the new partition extractor class, you can configure it like 
> this:
> {code:java}
> hoodie.datasource.hive_sync.enable=true
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.HiveStylePartitionValueExtractor
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch master updated: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions (#2769)

2021-04-08 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 08e82c4  [HUDI-1762] Added HiveStylePartitionExtractor to support Hive 
style partitions (#2769)
08e82c4 is described below

commit 08e82c469c456fbafc66c8e232dcd070f05fadee
Author: Vinoth Govindarajan 
AuthorDate: Thu Apr 8 22:00:11 2021 -0700

[HUDI-1762] Added HiveStylePartitionExtractor to support Hive style 
partitions (#2769)
---
 .../hive/HiveStylePartitionValueExtractor.java}| 32 --
 .../hudi/hive/TestPartitionValueExtractor.java | 13 -
 2 files changed, 30 insertions(+), 15 deletions(-)

diff --git 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
similarity index 50%
copy from 
hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
copy to 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
index a248e49..4bb20f5 100644
--- 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
+++ 
b/hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
@@ -18,21 +18,25 @@
 
 package org.apache.hudi.hive;
 
-import org.junit.jupiter.api.Test;
-import java.util.ArrayList;
+import java.util.Collections;
 import java.util.List;
 
-import static org.junit.jupiter.api.Assertions.assertEquals;
-import static org.junit.jupiter.api.Assertions.assertThrows;
+/**
+ * Extractor for Hive Style Partitioned tables, when the parition folders are 
key value pairs.
+ *
+ * This implementation extracts the partition value of -mm-dd from the 
path of type datestr=-mm-dd.
+ */
+public class HiveStylePartitionValueExtractor implements 
PartitionValueExtractor {
+  private static final long serialVersionUID = 1L;
 
-public class TestPartitionValueExtractor {
-  @Test
-  public void testHourPartition() {
-SlashEncodedHourPartitionValueExtractor hourPartition = new 
SlashEncodedHourPartitionValueExtractor();
-List list = new ArrayList<>();
-list.add("2020-12-20-01");
-assertEquals(hourPartition.extractPartitionValuesInPath("2020/12/20/01"), 
list);
-assertThrows(IllegalArgumentException.class, () -> 
hourPartition.extractPartitionValuesInPath("2020/12/20"));
-
assertEquals(hourPartition.extractPartitionValuesInPath("update_time=2020/12/20/01"),
 list);
+  @Override
+  public List extractPartitionValuesInPath(String partitionPath) {
+// partition path is expected to be in this format 
partition_key=partition_value.
+String[] splits = partitionPath.split("=");
+if (splits.length != 2) {
+  throw new IllegalArgumentException(
+  "Partition path " + partitionPath + " is not in the form 
partition_key=partition_value.");
+}
+return Collections.singletonList(splits[1]);
   }
-}
\ No newline at end of file
+}
diff --git 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
 
b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
index a248e49..ba5a544 100644
--- 
a/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
+++ 
b/hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestPartitionValueExtractor.java
@@ -35,4 +35,15 @@ public class TestPartitionValueExtractor {
 assertThrows(IllegalArgumentException.class, () -> 
hourPartition.extractPartitionValuesInPath("2020/12/20"));
 
assertEquals(hourPartition.extractPartitionValuesInPath("update_time=2020/12/20/01"),
 list);
   }
-}
\ No newline at end of file
+
+  @Test
+  public void testHiveStylePartition() {
+HiveStylePartitionValueExtractor hiveStylePartition = new 
HiveStylePartitionValueExtractor();
+List list = new ArrayList<>();
+list.add("2021-04-02");
+
assertEquals(hiveStylePartition.extractPartitionValuesInPath("datestr=2021-04-02"),
 list);
+assertThrows(
+IllegalArgumentException.class,
+() -> hiveStylePartition.extractPartitionValuesInPath("2021/04/02"));
+  }
+}

[GitHub] [hudi] nsivabalan merged pull request #2769: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions

2021-04-08 Thread GitBox



nsivabalan merged pull request #2769:
URL: https://github.com/apache/hudi/pull/2769


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2769: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions

2021-04-08 Thread GitBox



nsivabalan commented on a change in pull request #2769:
URL: https://github.com/apache/hudi/pull/2769#discussion_r610346604



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hive;
+
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Extractor for Hive Style Partitioned tables, when the parition folders are 
key value pairs.
+ *
+ * This implementation extracts the partition value of -mm-dd from the 
path of type datestr=-mm-dd.
+ */
+public class HiveStylePartitionValueExtractor implements 
PartitionValueExtractor {
+  private static final long serialVersionUID = 1L;
+
+  @Override
+  public List extractPartitionValuesInPath(String partitionPath) {
+// partition path is expected to be in this format 
partition_key=partition_value.
+String[] splits = partitionPath.split("=");

Review comment:
   awesome, sounds good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vingov commented on a change in pull request #2769: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions

2021-04-08 Thread GitBox



vingov commented on a change in pull request #2769:
URL: https://github.com/apache/hudi/pull/2769#discussion_r610344542



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hive;
+
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Extractor for Hive Style Partitioned tables, when the parition folders are 
key value pairs.
+ *
+ * This implementation extracts the partition value of -mm-dd from the 
path of type datestr=-mm-dd.
+ */
+public class HiveStylePartitionValueExtractor implements 
PartitionValueExtractor {
+  private static final long serialVersionUID = 1L;
+
+  @Override
+  public List extractPartitionValuesInPath(String partitionPath) {
+// partition path is expected to be in this format 
partition_key=partition_value.
+String[] splits = partitionPath.split("=");

Review comment:
   @nsivabalan - Good question!
   I have recreated the same scenario with `=` on both partition key and 
partition value, but both Hive and Spark encode the `=` to `%3D` on both the 
partition keys and values.
   
   Look at these folders:
   `
   /tmp.db/ds_test/date%3Dstr=2021%3D04%3D09
   `
   `
   /tmp.db/ds_test/date%3Dstr=2021-04-08
   `
   Based on this test, we can see that the code won't break, let me know if I 
have missed anything.
   
   
[Here](https://github.com/apache/hudi/blob/3a926aacf6552fc06005db4a7880a233db904330/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java#L143)
 is the Hudi code to create hive style partitions.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vingov commented on a change in pull request #2769: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions

2021-04-08 Thread GitBox



vingov commented on a change in pull request #2769:
URL: https://github.com/apache/hudi/pull/2769#discussion_r610344542



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hive;
+
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Extractor for Hive Style Partitioned tables, when the parition folders are 
key value pairs.
+ *
+ * This implementation extracts the partition value of -mm-dd from the 
path of type datestr=-mm-dd.
+ */
+public class HiveStylePartitionValueExtractor implements 
PartitionValueExtractor {
+  private static final long serialVersionUID = 1L;
+
+  @Override
+  public List extractPartitionValuesInPath(String partitionPath) {
+// partition path is expected to be in this format 
partition_key=partition_value.
+String[] splits = partitionPath.split("=");

Review comment:
   @nsivabalan - Good question!
   I have recreated the same scenario with `=` on both partition key and 
partition value, but both Hive and Spark encode the `=` to `%3D` on both the 
partition keys and values.
   
   Look at this folder:
   `
   /tmp.db/ds_test/date%3Dstr=2021%3D04%3D09
   /tmp.db/ds_test/date%3Dstr=2021-04-08
   `
   Based on this test, we can see that the code won't break, let me know if I 
have missed anything.
   
   
[Here](https://github.com/apache/hudi/blob/3a926aacf6552fc06005db4a7880a233db904330/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java#L143)
 is the Hudi code to create hive style partitions.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vingov commented on a change in pull request #2769: [HUDI-1762] Added HiveStylePartitionExtractor to support Hive style partitions

2021-04-08 Thread GitBox



vingov commented on a change in pull request #2769:
URL: https://github.com/apache/hudi/pull/2769#discussion_r610344542



##
File path: 
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveStylePartitionValueExtractor.java
##
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hive;
+
+import java.util.Collections;
+import java.util.List;
+
+/**
+ * Extractor for Hive Style Partitioned tables, when the parition folders are 
key value pairs.
+ *
+ * This implementation extracts the partition value of -mm-dd from the 
path of type datestr=-mm-dd.
+ */
+public class HiveStylePartitionValueExtractor implements 
PartitionValueExtractor {
+  private static final long serialVersionUID = 1L;
+
+  @Override
+  public List extractPartitionValuesInPath(String partitionPath) {
+// partition path is expected to be in this format 
partition_key=partition_value.
+String[] splits = partitionPath.split("=");

Review comment:
   Good question!
   I have recreated the same scenario with `=` on both partition key and 
partition value, but both Hive and Spark encodes the `=` to `%3D` on both the 
partition keys and values.
   
   Look at this folder:
   `
   /tmp.db/ds_test/date%3Dstr=2021%3D04%3D09
   /tmp.db/ds_test/date%3Dstr=2021-04-08
   `
   Based on this test, we can see that the code won't break, let me know if I 
have missed anything.
   
   
[Here](https://github.com/apache/hudi/blob/3a926aacf6552fc06005db4a7880a233db904330/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/keygen/KeyGenUtils.java#L143)
 is the Hudi code to create hive style partitions.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch asf-site updated: Travis CI build asf-site

2021-04-08 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new a21c5a4  Travis CI build asf-site
a21c5a4 is described below

commit a21c5a40c396eec28a819698c8a9efd165fb6ada
Author: CI 
AuthorDate: Fri Apr 9 04:24:48 2021 +

Travis CI build asf-site
---
 content/releases.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/releases.html b/content/releases.html
index 165a482..0386562 100644
--- a/content/releases.html
+++ b/content/releases.html
@@ -248,7 +248,7 @@
   
 
 
-https://github.com/apache/hudi/releases/tag/release-0.8.0;>Release 
0.8.0 (docs)
+https://github.com/apache/hudi/releases/tag/release-0.8.0;>Release 
0.8.0 (docs)
 
 Download Information

[GitHub] [hudi] yanghua commented on pull request #2785: [HUDI-1775] Add option for compaction parallelism

2021-04-08 Thread GitBox



yanghua commented on pull request #2785:
URL: https://github.com/apache/hudi/pull/2785#issuecomment-816369965


   Will merge after the Travis turns to green.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2740: [HUDI-1055] Remove hardcoded parquet in tests

2021-04-08 Thread GitBox



yanghua commented on a change in pull request #2740:
URL: https://github.com/apache/hudi/pull/2740#discussion_r610313295



##
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/SparkHelpers.scala
##
@@ -40,7 +40,7 @@ import scala.collection.mutable._
 object SparkHelpers {
   @throws[Exception]
   def skipKeysAndWriteNewFile(instantTime: String, fs: FileSystem, sourceFile: 
Path, destinationFile: Path, keysToSkip: Set[String]) {
-val sourceRecords = ParquetUtils.readAvroRecords(fs.getConf, sourceFile)
+val sourceRecords = new ParquetUtils().readAvroRecords(fs.getConf, 
sourceFile)

Review comment:
   +1 it would be better to use the uniform style.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] danny0405 commented on pull request #2786: [HUDI-1782] Add more options for HUDI Flink

2021-04-08 Thread GitBox



danny0405 commented on pull request #2786:
URL: https://github.com/apache/hudi/pull/2786#issuecomment-816368238


   > Should we change the 0.8.0 doc as well? It will be merged soon. #2792
   
   I think it is not necessary ? People would always see the master document.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1782) Add more options for HUDI Flink

2021-04-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1782:
-
Labels: pull-request-available  (was: )

> Add more options for HUDI Flink
> ---
>
> Key: HUDI-1782
> URL: https://issues.apache.org/jira/browse/HUDI-1782
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Flink Integration
>Reporter: Danny Chen
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] danny0405 commented on pull request #2786: [HUDI-1782] Add more options for HUDI Flink

2021-04-08 Thread GitBox



danny0405 commented on pull request #2786:
URL: https://github.com/apache/hudi/pull/2786#issuecomment-816368080


   > > > @danny0405 Can you please correct the title of the PR?
   > > 
   > > 
   > > What title should i use ?
   > 
   > file a jira ticket and add the jira id?
   
   Added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1782) Add more options for HUDI Flink

2021-04-08 Thread Danny Chen (Jira)

Danny Chen created HUDI-1782:


 Summary: Add more options for HUDI Flink
 Key: HUDI-1782
 URL: https://issues.apache.org/jira/browse/HUDI-1782
 Project: Apache Hudi
  Issue Type: Task
  Components: Flink Integration
Reporter: Danny Chen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1781) Flink streaming reader throws ClassCastException when reading from empty table path

2021-04-08 Thread Danny Chen (Jira)

Danny Chen created HUDI-1781:


 Summary: Flink streaming reader throws ClassCastException when 
reading from empty table path
 Key: HUDI-1781
 URL: https://issues.apache.org/jira/browse/HUDI-1781
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Reporter: Danny Chen
 Fix For: 0.9.0


When there is no data under the table path, the expect input format is 
{{CollectionInputFormat}} with empty dataset, we should not force cast it into 
{{MergeOnReadInputFormat}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[hudi] branch asf-site updated: Fix 0.8.0 release doc link (#2795)

2021-04-08 Thread garyli

This is an automated email from the ASF dual-hosted git repository.

garyli pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 0ae8969  Fix 0.8.0 release doc link (#2795)
0ae8969 is described below

commit 0ae8969ee7d52bee9f065f154ec196979f08cddf
Author: Gary Li 
AuthorDate: Thu Apr 8 19:57:43 2021 -0700

Fix 0.8.0 release doc link (#2795)
---
 docs/_pages/releases.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_pages/releases.md b/docs/_pages/releases.md
index c507fc8..1356e56 100644
--- a/docs/_pages/releases.md
+++ b/docs/_pages/releases.md
@@ -5,7 +5,7 @@ layout: releases
 toc: true
 last_modified_at: 2020-05-28T08:40:00-07:00
 ---
-# [Release 0.8.0](https://github.com/apache/hudi/releases/tag/release-0.8.0) 
([docs](/docs/0.8.0-quick-start-guide.html))
+# [Release 0.8.0](https://github.com/apache/hudi/releases/tag/release-0.8.0) 
([docs](/docs/0.8.0-spark_quick-start-guide.html))
 
 ## Download Information
 * Source Release : [Apache Hudi 0.8.0 Source 
Release](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.8.0/hudi-0.8.0.src.tgz.sha512))

[GitHub] [hudi] garyli1019 merged pull request #2795: [DOCS]Fix 0.8.0 release doc link

2021-04-08 Thread GitBox



garyli1019 merged pull request #2795:
URL: https://github.com/apache/hudi/pull/2795


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] li36909 commented on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

2021-04-08 Thread GitBox



li36909 commented on issue #2544:
URL: https://github.com/apache/hudi/issues/2544#issuecomment-816365255


   @cdmikechen thank you for your explain. I use hudi 0.7 + spark 2.4.5 + hive 
3.1, and didn't test with hive 2.*, if possible please fix this issue for hive3 
also, thank you very much


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] pengzhiwei2018 commented on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

2021-04-08 Thread GitBox



pengzhiwei2018 commented on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-816360889


   Hi @nsivabalan @umehrot2 can you take a review on this pr again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



zherenyu831 commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610237185



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   @satishkotha 
   Same with deltacommit.requested, it is also empty, but when using 
`HoodieCommitMetadata.fromBytes() `, a new empty instance will be created. 
better to have a test case, it should be fine




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on pull request #2786: Add more options for HUDI Flink

2021-04-08 Thread GitBox



yanghua commented on pull request #2786:
URL: https://github.com/apache/hudi/pull/2786#issuecomment-816355918


   > > @danny0405 Can you please correct the title of the PR?
   > 
   > What title should i use ?
   
   file a jira ticket and add the jira id?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 opened a new pull request #2795: [DOCS]Fix 0.8.0 release doc link

2021-04-08 Thread GitBox



garyli1019 opened a new pull request #2795:
URL: https://github.com/apache/hudi/pull/2795


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yanghua commented on a change in pull request #2785: [HUDI-1775] Add option for compaction parallelism

2021-04-08 Thread GitBox



yanghua commented on a change in pull request #2785:
URL: https://github.com/apache/hudi/pull/2785#discussion_r610278459



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
##
@@ -159,8 +164,11 @@ public void processElement(I value, Context ctx, 
Collector out) throws Except
 final BucketInfo bucketInfo;
 final HoodieRecordLocation location;
 
-if (!partitionLoadState.contains(hoodieKey.getPartitionPath())) {
+if (bootstrapIndex && 
!partitionLoadState.contains(hoodieKey.getPartitionPath())) {
   // If the partition records are never loaded, load the records first.
+
+  // The dataset may be huge, thus the processing would block for long,

Review comment:
   I did not question the necessity of this comment, just to see if it has 
a more suitable position. Because when I see this comment, the method is 
already executing. And your comment contains "disabled by default", but this 
method does not accept a flag. Personally, I think a better position is placed 
before the above if statement, where the execution switch is.

##
File path: 
hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
##
@@ -255,8 +264,14 @@ private FlinkOptions() {
   public static final ConfigOption WRITE_BATCH_SIZE = ConfigOptions
   .key("write.batch.size.MB")
   .doubleType()
-  .defaultValue(2D) // 2MB
-  .withDescription("Batch buffer size in MB to flush data into the 
underneath filesystem");
+  .defaultValue(64D) // 64MB
+  .withDescription("Batch buffer size in MB to flush data into the 
underneath filesystem, default 64MB");
+
+  public static final ConfigOption WRITE_LOG_BLOCK_SIZE = 
ConfigOptions
+  .key("write.log_block.size")

Review comment:
   IMO, between verbosity and following a uniform style, I think the latter 
is more important and allows users to feel a consistent experience.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck edited a comment on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck edited a comment on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816336235


   I updated my issue description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck edited a comment on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck edited a comment on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816336235


   I updated my iuuse description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck commented on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck commented on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816336235


   I update my iuuse description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssdong commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



ssdong commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610259030



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   Also, the current unit tests set meta for `requested` commit files which 
isn’t _strictly_ following and reflecting the actual real-world behavior(we 
have empty requested file). I believe down the roadmap, we will have to 
standardize the requested file structure and investigate why it’s empty. There 
is a jira ticket filed for that and we have self-assigned and will work on that 
next. :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssdong commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



ssdong commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610259030



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   Also, the current unit tests set meta for `requested` commit files which 
isn’t _strictly_ following and reflecting the actual behavior(we have empty 
requested file). I believe down the roadmap, we will have to standardize the 
requested file structure and investigate why it’s empty. There is a jira ticket 
filed for that and we have self-assigned and will work on that next. :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssdong commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



ssdong commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610259030



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   Also, the current unit tests set meta for `requested` commit files which 
isn’t _strictly_ following and reflecting the actual behavior(we have empty 
requested file). I believe down the roadmap, we will have to standardize the 
requested file structure and investigate why it’s empty. There is a jira ticket 
filed for that and we have self-assigned and work on that next. :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck edited a comment on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck edited a comment on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816335004


   @n3nash I'm not sure which jar is missing, so i add all of the jars under 
${HIVE_HOME}/lib/*.jar to the classpath . And i use this start command in the 
script `run_sync_too.sh`,it works .
`HIVE_EXTERNAL_JAR=${HIVE_HOME}/lib/*.jar`
   
   `java -cp 
$HIVE_EXTERNAL_JAR:$HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:${HIVE_HOME}/conf
 org.apache.hudi.hive.HiveSyncTool "$@"`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck edited a comment on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck edited a comment on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816335004


   @n3nash I'm not sure which jar is missing, so i add all of the jars under 
${HIVE_HOME}/lib/*.jar to the classpath . And i use this start command in the 
script `run_sync_too.sh`,it works .
`HIVE_EXTERNAL_JAR=${HIVE_HOME}/lib/*.jar
   
   java -cp 
$HIVE_EXTERNAL_JAR:$HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:${HIVE_HOME}/conf
 org.apache.hudi.hive.HiveSyncTool "$@"`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck edited a comment on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck edited a comment on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816335004


   @n3nash I'm not sure which jar is missing, so i add all of the jars under 
${HIVE_HOME}/lib/*.jar to the classpath . And i use this start command in the 
script `run_sync_too.sh`,it works .
`HIVE_EXTERNAL_JAR=${HIVE_HOME}/lib/*.jar
   java -cp 
$HIVE_EXTERNAL_JAR:$HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:${HIVE_HOME}/conf
 org.apache.hudi.hive.HiveSyncTool "$@"`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ztcheck commented on issue #2680: [SUPPORT]Hive sync error by using run_sync_tool.sh

2021-04-08 Thread GitBox



ztcheck commented on issue #2680:
URL: https://github.com/apache/hudi/issues/2680#issuecomment-816335004


   @n3nash I'm not sure which jar is missing, so i add all of the jars under 
${HIVE_HOME}/lib/*.jar to the classpath . And i use this start command,it works 
.
`HIVE_EXTERNAL_JAR=${HIVE_HOME}/lib/*.jar
   java -cp 
$HIVE_EXTERNAL_JAR:$HUDI_HIVE_UBER_JAR:${HADOOP_HIVE_JARS}:${HADOOP_CONF_DIR}:${HIVE_HOME}/conf
 org.apache.hudi.hive.HiveSyncTool "$@"`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] ssdong commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



ssdong commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610256593



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   Hey @satishkotha The `inflight` was _originally_ empty in the unit tests 
before I _change_ it to have meta body in it. I wasn’t seeing the whole picture 
that `inflight` could be both empty and non-empty, neither were the original 
unit tests covering both cases. We are working on giving it a little 
refactoring to have the tests covering both ends. :) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



satishkotha commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610253248



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   ok, sounds good. I think most of unit test framework already is setup. 
Do you mind adding that small test with inflight empty file?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] satishkotha commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



satishkotha commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610252511



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -245,7 +245,7 @@ public final void reset() {
   bootstrapIndex = null;
 
   // Initialize with new Hoodie timeline.
-  init(metaClient, getTimeline());
+  init(metaClient, metaClient.reloadActiveTimeline());

Review comment:
   @bvaradar mentioned this is by design. Balaji, could you please help 
resolve this?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] zherenyu831 commented on a change in pull request #2784: [HUDI-1740] Fix insert-overwrite API archival

2021-04-08 Thread GitBox



zherenyu831 commented on a change in pull request #2784:
URL: https://github.com/apache/hudi/pull/2784#discussion_r610237185



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java
##
@@ -72,9 +76,14 @@ public static HoodieArchivedMetaEntry 
createMetaWrapper(HoodieInstant hoodieInst
   HoodieReplaceCommitMetadata replaceCommitMetadata = 
HoodieReplaceCommitMetadata
   
.fromBytes(metaClient.getActiveTimeline().getInstantDetails(hoodieInstant).get(),
 HoodieReplaceCommitMetadata.class);
   
archivedMetaWrapper.setHoodieReplaceCommitMetadata(ReplaceArchivalHelper.convertReplaceCommitMetadata(replaceCommitMetadata));
+} else if (hoodieInstant.isInflight()) {
+  // inflight replacecommit files have the same meta data body as 
HoodieCommitMetadata

Review comment:
   @satishkotha 
   Same with deltacommit.requested, it is also empty, but when using 
`HoodieCommitMetadata.fromBytes() `, a new empty instance will be created. 
better to have a test case, but it should be fine




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] njalan edited a comment on issue #2791: [SUPPORT]Failed to enable hoodie.metadata.enable

2021-04-08 Thread GitBox



njalan edited a comment on issue #2791:
URL: https://github.com/apache/hudi/issues/2791#issuecomment-816318167


   After job running half an hour I got this error. Once I remove all the files 
from table directory and got the error again within an hour (1 minute for one 
micro batch).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen edited a comment on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

2021-04-08 Thread GitBox



cdmikechen edited a comment on issue #2544:
URL: https://github.com/apache/hudi/issues/2544#issuecomment-816314674


   @li36909  
   As I known, `TimestampWritableV2` is a hive3 class, we mainly use hive2 lib 
in hudi. And your class is based on a `MOR` table, my change is based on a 
`COW` table and compatible with `MOR`  table.
   I can see how I can incorporate your code into my branch. Meanwhile, we may 
need to consider whether we need to treat hive3 and hive2 as two subprojects. 
In this way, it will be helpful to make subsequent changes to different hive 
versions in the future


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] njalan commented on issue #2791: [SUPPORT]Failed to enable hoodie.metadata.enable

2021-04-08 Thread GitBox



njalan commented on issue #2791:
URL: https://github.com/apache/hudi/issues/2791#issuecomment-816318167


   After job running half an hour I got this error. Once I remove all the files 
from table directory and got the error again within an hour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen edited a comment on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

2021-04-08 Thread GitBox



cdmikechen edited a comment on issue #2544:
URL: https://github.com/apache/hudi/issues/2544#issuecomment-816314674


   @li36909  
   As I known, `TimestampWritableV2` is a hive3 class, we mainly use hive2 lib 
in hudi. And your class is based on a `MOR` table, my change is based on a 
`COW` table.
   I can see how I can incorporate your code into my branch. Meanwhile, we may 
need to consider whether we need to treat hive3 and hive2 as two subprojects. 
In this way, it will be helpful to make subsequent changes to different hive 
versions in the future


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] cdmikechen commented on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

2021-04-08 Thread GitBox



cdmikechen commented on issue #2544:
URL: https://github.com/apache/hudi/issues/2544#issuecomment-816314674


   @li36909  
   As I known, `TimestampWritableV2` is a hive3 class, we mainly use hive2 lib 
in hudi.And your class is based on a `MOR` table, my change is based on a `COW` 
table.
   I can see how I can incorporate your code into my branch. Meanwhile, we may 
need to consider whether we need to treat hive3 and hive2 as two subprojects. 
In this way, it will be helpful to make subsequent changes to different hive 
versions in the future


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash opened a new pull request #2794: [MINOR] Fix concurrency docs

2021-04-08 Thread GitBox



n3nash opened a new pull request #2794:
URL: https://github.com/apache/hudi/pull/2794


   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

2021-04-08 Thread GitBox



nsivabalan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-816306475


   @kimberlyamandalu : do you have a support ticket for your question. lets not 
pollute this issue. we can create a new one for your use-case and can discuss 
over there


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

2021-04-08 Thread GitBox



nsivabalan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-816306202


   @codejoyan : sorry, somehow slipped from my radar. 
   May I know whats the scale of data you are dealing with? I see your 
parallelism is very less (2). Can you try w/ 100 or more and see how it goes. 
   
   Among 3 methods you have quoted, 2 of them are index related and 3rd is 
actual write operation. 
   
   Best way to decide partitionin strategy is to see what your queries usually 
filter based on. If its date based, then you definitely need to have date in 
your partitioning strategy which you already do. And if adding region would cut 
down most of the data to be looked up, sure. I assume this would also blow up 
your # partitions in general since its no of dates * no of regions. 
   
   wrt record keys and bloom: 
   You can try to use regular bloom "BLOOM" as index. With this, there are few 
config knobs. with simple bloom, we don't lot of config knobs to play around. 
   within a single batch of writes, does records have some ordering to it or is 
it just random. From your response I guess its random. So, you can turn of 
range pruning since that may not help much. 
   https://hudi.apache.org/docs/configurations.html#bloomIndexPruneByRanges to 
false. (default value is true). 
   
   @n3nash : do you have any pointers here. 
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



nsivabalan merged pull request #2792:
URL: https://github.com/apache/hudi/pull/2792


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TeRS-K commented on pull request #2793: [HUDI-57] Support ORC Storage

2021-04-08 Thread GitBox



TeRS-K commented on pull request #2793:
URL: https://github.com/apache/hudi/pull/2793#issuecomment-816163413


   The build is currently failing with error `ERROR: toomanyrequests: You have 
reached your pull rate limit. You may increase the limit by authenticating and 
upgrading: https://www.docker.com/increase-rate-limit`, it doesn't seem to be 
related to my change. How can I trigger a rebuild?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto removed a comment on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



rubenssoto removed a comment on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815986157


   Hello Guys, 
   
   is it a bug on hudi 0.8.0?
   
   I migrate all my workload to Hudi yesterday using hudi 0.8.0 and Im having 
problems with timestamp.
   
   https://user-images.githubusercontent.com/36298331/114066893-95b08780-9872-11eb-87e0-6ea1fc1080a9.png;>
   
   
   the first one is written using hudi 0.8.0 final
   the second one, Hudi 0.8.0 RC1
   the third one is my regular parquet source
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1780) Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of Archival

2021-04-08 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1780:
--
Labels: sev:critical  (was: )

> Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of 
> Archival
> ---
>
> Key: HUDI-1780
> URL: https://issues.apache.org/jira/browse/HUDI-1780
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jagmeet Bali
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: sev:critical
>
> The refactoring of HoodieMetadataArchiveLog to MetadataConverionUtils has 
> changed the behaviour of how replacecommit.* files will be handled.
> Before the logic was 
> [TimelineArchiveLog|https://github.com/apache/hudi/blob/c4a66324cdd3e289e0bf18bdd150b95ee6e4c66c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L402]
>  but now it has been changed to 
> [MetadaConversionUtils.|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java#L70]
> The issue with the code is replacecommit.* can be generated from different 
> places and not from clustering alone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1780) Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of Archival

2021-04-08 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal updated HUDI-1780:
--
Priority: Major  (was: Minor)

> Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of 
> Archival
> ---
>
> Key: HUDI-1780
> URL: https://issues.apache.org/jira/browse/HUDI-1780
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jagmeet Bali
>Assignee: Nishith Agarwal
>Priority: Major
>
> The refactoring of HoodieMetadataArchiveLog to MetadataConverionUtils has 
> changed the behaviour of how replacecommit.* files will be handled.
> Before the logic was 
> [TimelineArchiveLog|https://github.com/apache/hudi/blob/c4a66324cdd3e289e0bf18bdd150b95ee6e4c66c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L402]
>  but now it has been changed to 
> [MetadaConversionUtils.|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java#L70]
> The issue with the code is replacecommit.* can be generated from different 
> places and not from clustering alone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1780) Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of Archival

2021-04-08 Thread Nishith Agarwal (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1780:
-

Assignee: Nishith Agarwal

> Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of 
> Archival
> ---
>
> Key: HUDI-1780
> URL: https://issues.apache.org/jira/browse/HUDI-1780
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jagmeet Bali
>Assignee: Nishith Agarwal
>Priority: Minor
>
> The refactoring of HoodieMetadataArchiveLog to MetadataConverionUtils has 
> changed the behaviour of how replacecommit.* files will be handled.
> Before the logic was 
> [TimelineArchiveLog|https://github.com/apache/hudi/blob/c4a66324cdd3e289e0bf18bdd150b95ee6e4c66c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L402]
>  but now it has been changed to 
> [MetadaConversionUtils.|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java#L70]
> The issue with the code is replacecommit.* can be generated from different 
> places and not from clustering alone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1780) Refactoring of parts of HoodieMetadataArchiveLog have changed behaviour of Archival

2021-04-08 Thread Jagmeet Bali (Jira)

Jagmeet Bali created HUDI-1780:
--

 Summary: Refactoring of parts of HoodieMetadataArchiveLog have 
changed behaviour of Archival
 Key: HUDI-1780
 URL: https://issues.apache.org/jira/browse/HUDI-1780
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Jagmeet Bali


The refactoring of HoodieMetadataArchiveLog to MetadataConverionUtils has 
changed the behaviour of how replacecommit.* files will be handled.

Before the logic was 
[TimelineArchiveLog|https://github.com/apache/hudi/blob/c4a66324cdd3e289e0bf18bdd150b95ee6e4c66c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTimelineArchiveLog.java#L402]
 but now it has been changed to 
[MetadaConversionUtils.|https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/MetadataConversionUtils.java#L70]

The issue with the code is replacecommit.* can be generated from different 
places and not from clustering alone.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] n3nash commented on pull request #2793: [HUDI-57] Support ORC Storage

2021-04-08 Thread GitBox



n3nash commented on pull request #2793:
URL: https://github.com/apache/hudi/pull/2793#issuecomment-816037894


   @prashantwason Can you review this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TeRS-K opened a new pull request #2793: [HUDI-57] Support ORC Storage

2021-04-08 Thread GitBox



TeRS-K opened a new pull request #2793:
URL: https://github.com/apache/hudi/pull/2793


   ## What is the purpose of the pull request
   
   This pull request supports ORC storage in hudi.
   
   ## Brief change log
   
   In two separate commits:
   - Implemented HoodieOrcWriter
 - Added HoodieOrcConfigs
 - Added AvroOrcUtils that writes Avro record **to** VectorizedRowBatch
 - Used orc-core:no-hive module (`no-hive` is needed because spark-sql uses 
no-hive version of orc and it would become easier for spark integration)
   - Implemented HoodieOrcReader
 - Read Avro records **from** VectorizedRowBatch
 - Implemented OrcReaderIterator
 - Implemented ORC utility functions 
   
   ## Verify this pull request
   
   - Added unit tests for 
 - reader/writer creation
 - AvroOrcUtils
   - (local) Wrote a small tool that reads from ORC/Parquet files and writes to 
ORC/Parquet files, verified that the records in the input/output files are 
identical using spark.read.orc/spark.read.parquet.
   - (local) Changed the HoodieTableConfig.DEFAULT_BASE_FILE_FORMAT  to force 
the tests to run with ORC as the base format. Some changes need to be made, but 
I'm leaving it out of this PR to get some initial feedback on the reader/writer 
implementation first.
 For all tests to pass with ORC as the base file format:
 - Understand schema evolution in ORC (ref TestUpdateSchemaEvolution)
 - Add ORC support for places that have hardcoded ParquetReader or 
sqlContext.read().parquet() 
 - Add ORC support for bootstrap op
 - Hive engine integration with ORC (implement HoodieOrcInputFormat)
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codejoyan commented on issue #2592: [SUPPORT] Does latest versions of Hudi (0.7.0, 0.6.0) work with Spark 2.3.0 when reading orc files?

2021-04-08 Thread GitBox



codejoyan commented on issue #2592:
URL: https://github.com/apache/hudi/issues/2592#issuecomment-816012992


   Please let me know if there are any suggestions to try out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] stackfun commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-04-08 Thread GitBox



stackfun commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-816013345


   I'm using GCP dataproc 1.4, the gcs connector version is 1.9.17. The 
versions of all the libraries can be found here: 
https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] codejoyan commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

2021-04-08 Thread GitBox



codejoyan commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-816012609


   @nsivabalan, any inputs would be very helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] kimberlyamandalu commented on issue #2620: [SUPPORT] Performance Tuning: Slow stages (Building Workload Profile & Getting Small files from partitions) during Hudi Writes

2021-04-08 Thread GitBox



kimberlyamandalu commented on issue #2620:
URL: https://github.com/apache/hudi/issues/2620#issuecomment-815987341


   I have a similar issue where bloom index performance is very slow for upsert 
into a Hudi MOR table.
   Does anyone know if when Hudi performs an upsert, does it only lookup index 
for the related partitions or does it lookup against the entire data set? I 
have partitions of year and month from 1998 to 2020. My upserts are mostly to 
recent partitions (95%). I also notice a lot of calls to build fs view for 
older partitions i know should not have any upserts
   
   `AbstractTableFileSystemView: Building file system view for partition 
(message_year=2002/message_month=9)`
   
   
![image](https://user-images.githubusercontent.com/25435575/114066282-90027400-9869-11eb-8828-f9615f828d7e.png)
   
   
   Obtain key ranges for file slices (range pruning=on)
   collect at HoodieSparkEngineContext.java:73+details
   org.apache.spark.api.java.AbstractJavaRDDLike.collect(JavaRDDLike.scala:45)
   
org.apache.hudi.client.common.HoodieSparkEngineContext.map(HoodieSparkEngineContext.java:73)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.loadInvolvedFiles(SparkHoodieBloomIndex.java:176)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.lookupIndex(SparkHoodieBloomIndex.java:119)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:84)
   
org.apache.hudi.index.bloom.SparkHoodieBloomIndex.tagLocation(SparkHoodieBloomIndex.java:60)
   
org.apache.hudi.table.action.commit.AbstractWriteHelper.tag(AbstractWriteHelper.java:69)
   
org.apache.hudi.table.action.commit.AbstractWriteHelper.write(AbstractWriteHelper.java:51)
   
org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitActionExecutor.execute(SparkUpsertDeltaCommitActionExecutor.java:46)
   
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:82)
   
org.apache.hudi.table.HoodieSparkMergeOnReadTable.upsert(HoodieSparkMergeOnReadTable.java:74)
   
org.apache.hudi.client.SparkRDDWriteClient.upsert(SparkRDDWriteClient.java:146)
   org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:214)
   org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:181)
   org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
   
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
   
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto edited a comment on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



rubenssoto edited a comment on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815986157


   Hello Guys, 
   
   is it a bug on hudi 0.8.0?
   
   I migrate all my workload to Hudi yesterday using hudi 0.8.0 and Im having 
problems with timestamp.
   
   https://user-images.githubusercontent.com/36298331/114066893-95b08780-9872-11eb-87e0-6ea1fc1080a9.png;>
   
   
   the first one is written using hudi 0.8.0 final
   the second one, Hudi 0.8.0 RC1
   the third one is my regular parquet source
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] rubenssoto commented on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



rubenssoto commented on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815986157


   Hello Guys, 
   
   is it a bug on hudi 0.8.0?
   
   I migrate all my workload to Hudi yesterday using hudi 0.8.0 and Im having 
problems with timestamp.
   
   https://user-images.githubusercontent.com/36298331/114066893-95b08780-9872-11eb-87e0-6ea1fc1080a9.png;>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on issue #2791: [SUPPORT]Failed to enable hoodie.metadata.enable

2021-04-08 Thread GitBox



n3nash commented on issue #2791:
URL: https://github.com/apache/hudi/issues/2791#issuecomment-815973743


   @prashantwason Can you take a look at this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



nsivabalan commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609883525



##
File path: docs/_docs/0.8.0/1_1_spark_quick_start_guide.md
##
@@ -0,0 +1,530 @@
+---
+version: 0.8.0
+title: "Quick-Start Guide"
+permalink: /docs/0.8.0-spark_quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/0.8.0-concepts.html#copy-on-write-table). 
+After each write operation we will also show how to read the data both 
snapshot and incrementally.
+# Scala example
+
+## Setup
+
+Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+From the extracted directory run spark-shell with Hudi as:
+
+```scala
+// spark-shell
+spark-shell \

Review comment:
   might as well add spark2 and scala11 as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



n3nash commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609890178



##
File path: docs/_docs/1_1_spark_quick_start_guide.md
##
@@ -13,13 +13,17 @@ After each write operation we will also show how to read 
the data both snapshot
 
 ## Setup
 
-Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+Hudi works with Spark-2.4.4+ & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 

Review comment:
   Let's put 2.4.3 since that's the right one




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



nsivabalan commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609884054



##
File path: docs/_docs/1_1_spark_quick_start_guide.md
##
@@ -13,13 +13,17 @@ After each write operation we will also show how to read 
the data both snapshot
 
 ## Setup
 
-Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+Hudi works with Spark-2.4.4+ & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 

Review comment:
   @n3nash : should we say 2.4.3? or 2.4.4 is fine? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



nsivabalan commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609883525



##
File path: docs/_docs/0.8.0/1_1_spark_quick_start_guide.md
##
@@ -0,0 +1,530 @@
+---
+version: 0.8.0
+title: "Quick-Start Guide"
+permalink: /docs/0.8.0-spark_quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/0.8.0-concepts.html#copy-on-write-table). 
+After each write operation we will also show how to read the data both 
snapshot and incrementally.
+# Scala example
+
+## Setup
+
+Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+From the extracted directory run spark-shell with Hudi as:
+
+```scala
+// spark-shell
+spark-shell \

Review comment:
   might as well add spark_11 as well. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



garyli1019 commented on pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#issuecomment-815963047


   > Did we change to version 0.8.0 in all places wherever applicable? I see 
quick start was not updated.
   
   This version change was generated by an automated tool. The quick start was 
probably missed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



garyli1019 commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609875712



##
File path: docs/_docs/0.8.0/1_1_spark_quick_start_guide.md
##
@@ -0,0 +1,530 @@
+---
+version: 0.8.0
+title: "Quick-Start Guide"
+permalink: /docs/0.8.0-spark_quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/0.8.0-concepts.html#copy-on-write-table). 
+After each write operation we will also show how to read the data both 
snapshot and incrementally.
+# Scala example
+
+## Setup
+
+Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+From the extracted directory run spark-shell with Hudi as:
+
+```scala
+// spark-shell
+spark-shell \

Review comment:
   good catch




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on a change in pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



nsivabalan commented on a change in pull request #2792:
URL: https://github.com/apache/hudi/pull/2792#discussion_r609858054



##
File path: docs/_docs/0.8.0/1_1_spark_quick_start_guide.md
##
@@ -0,0 +1,530 @@
+---
+version: 0.8.0
+title: "Quick-Start Guide"
+permalink: /docs/0.8.0-spark_quick-start-guide.html
+toc: true
+last_modified_at: 2019-12-30T15:59:57-04:00
+---
+
+This guide provides a quick peek at Hudi's capabilities using spark-shell. 
Using Spark datasources, we will walk through 
+code snippets that allows you to insert and update a Hudi table of default 
table type: 
+[Copy on Write](/docs/0.8.0-concepts.html#copy-on-write-table). 
+After each write operation we will also show how to read the data both 
snapshot and incrementally.
+# Scala example
+
+## Setup
+
+Hudi works with Spark-2.x & Spark 3.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for setting up spark. 
+From the extracted directory run spark-shell with Hudi as:
+
+```scala
+// spark-shell
+spark-shell \

Review comment:
   call out that this is spark2 bundle. if someone wants to try out w/ 
spark3, its hudi-spark3-bundle_2.12:0.7.0.
   oh, btw, you need to fix the version no here to 0.8.0




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 commented on pull request #2786: Add more options for HUDI Flink

2021-04-08 Thread GitBox



garyli1019 commented on pull request #2786:
URL: https://github.com/apache/hudi/pull/2786#issuecomment-815943296


   Should we change the 0.8.0 doc as well? It will be merged soon. #2792 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] TeRS-K commented on a change in pull request #2740: [HUDI-1055] Remove hardcoded parquet in tests

2021-04-08 Thread GitBox



TeRS-K commented on a change in pull request #2740:
URL: https://github.com/apache/hudi/pull/2740#discussion_r609825554



##
File path: hudi-cli/src/main/scala/org/apache/hudi/cli/SparkHelpers.scala
##
@@ -40,7 +40,7 @@ import scala.collection.mutable._
 object SparkHelpers {
   @throws[Exception]
   def skipKeysAndWriteNewFile(instantTime: String, fs: FileSystem, sourceFile: 
Path, destinationFile: Path, keysToSkip: Set[String]) {
-val sourceRecords = ParquetUtils.readAvroRecords(fs.getConf, sourceFile)
+val sourceRecords = new ParquetUtils().readAvroRecords(fs.getConf, 
sourceFile)

Review comment:
   Yes that makes sense. How about the ParquetUtils instance in 
`HoodieSparkBootstrapSchemaProvider::getBootstrapSourceSchema` and 
`HoodieParquetReader`, should I replace those with `DataFileUtils.getInstance` 
to get the instance as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] garyli1019 opened a new pull request #2792: [DOCS] Add docs for release 0.8.0

2021-04-08 Thread GitBox



garyli1019 opened a new pull request #2792:
URL: https://github.com/apache/hudi/pull/2792


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan merged pull request #2772: [MINOR] Update doap with 0.8.0 release

2021-04-08 Thread GitBox



nsivabalan merged pull request #2772:
URL: https://github.com/apache/hudi/pull/2772


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch master updated (5b3608f -> cf3d2e2)

2021-04-08 Thread sivabalan

This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 5b3608f  [HUDI-1778] Add setter to CompactionPlanEvent and 
CompactionCommitEvent to have better SE/DE performance for Flink (#2789)
 add cf3d2e2  [MINOR] Update doap with 0.8.0 release (#2772)

No new revisions were added by this update.

Summary of changes:
 doap_HUDI.rdf | 5 +
 1 file changed, 5 insertions(+)

[GitHub] [hudi] nsivabalan edited a comment on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



nsivabalan edited a comment on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815898964


   @li36909 : IIUC, this patch is not about failing a bootstrap or upsert w/ 
timestamp. We are adding support for timestamp column by upgrading parquet 
version. If yes, please do fix the title of the patch. 
   Also, looks like we are upgrading the parquet version in this patch. 
@vinothchandar @n3nash @bvaradar : thoughts. is there any considerations 
required on this end. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



nsivabalan commented on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815898964


   IIUC, this patch is not about failing a bootstrap or upsert w/ timestamp. We 
are adding support for timestamp column by upgrading parquet version. If yes, 
please do fix the title of the patch. 
   Also, looks like we are upgrading the parquet version in this patch. 
@vinothchandar @n3nash @bvaradar : thoughts. is there any considerations 
required on this end. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2452: [HUDI-1531] Introduce HoodiePartitionCleaner to delete specific partition

2021-04-08 Thread GitBox



nsivabalan commented on pull request #2452:
URL: https://github.com/apache/hudi/pull/2452#issuecomment-815878916


   sounds good. yeah, cleaning strategy would be great. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2601: [HUDI-1602] update parquet version from 1.10.1 to 1.11.1

2021-04-08 Thread GitBox



nsivabalan commented on pull request #2601:
URL: https://github.com/apache/hudi/pull/2601#issuecomment-815870944


   CC @li36909 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] nsivabalan commented on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



nsivabalan commented on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815868898


   @li36909 : can you fix the links in the description. guess its cuttoff. 
   ```
   parquet.avro.readInt96AsFixed=true, please check https://github 
https://github/.com/apache/parquet-mr/pull/831/files)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] njalan opened a new issue #2791: [SUPPORT]Error when enable hoodie.metadata.enable

2021-04-08 Thread GitBox



njalan opened a new issue #2791:
URL: https://github.com/apache/hudi/issues/2791


   I am facing performance issue by S3 slow file listing. So I try to enable 
hoodie metadata to improve performance.
   
   **Environment Description**
   
   Hudi version : 0.7
   
   Spark version : 3.0.1
   
   Hive version : 3.1.2
   
   Hadoop version : 3.2.1
   
   Storage (HDFS/S3/GCS..) : S3
   
   Running on Docker? (yes/no) : no
   

   
   **Additional context**
   
   Below is my hudi configuration:
   
   df.write.format("org.apache.hudi")
   .options(getQuickstartWriteConfigs)
   .option(HIVE_URL_OPT_KEY, hive_jbdc_url)
   .option(HIVE_USER_OPT_KEY, hive_user)
   .option(HIVE_PASS_OPT_KEY, "")
   .option(HIVE_DATABASE_OPT_KEY, dataBase)
   .option(HIVE_TABLE_OPT_KEY, tableName)
   .option(HIVE_SYNC_ENABLED_OPT_KEY, true)
   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "")
   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "")
   .option(HoodieWriteConfig.TABLE_NAME, tableName)
   .option("hoodie.upsert.shuffle.parallelism", "8")
   .option("hoodie.insert.shuffle.parallelism", "8")
   .option("hoodie.cleaner.commits.retained",2)
   .option("hoodie.keep.min.commits",3)
   .option("hoodie.keep.max.commits",4)
   .option("hoodie.metadata.enable",true)
   .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
KEY_GENERATOR_Non_Partition)
   .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY, 
classOf[DefaultHoodieRecordPayload].getName)
   .option(HoodiePayloadProps.DEFAULT_PAYLOAD_ORDERING_FIELD_VAL, 
combineKey)
   .option(PRECOMBINE_FIELD_OPT_KEY, combineKey)
   .option(RECORDKEY_FIELD_OPT_KEY, key)
   .option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
EXTRACTOR_CLASS_Non_Partition)
   .option("hoodie.datasource.hive_sync.support_timestamp", true)
   .mode(mode)
   .save(basePath)
   

   
   
   **Stacktrace**
   
   Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
old record into new file for key  from old file 
xx_2/.hoodie/metadata/files/6bd6d5c7-712f-4580-b895-79c471bf6dab-0_0-5-5_20210408215411.hfile
 to new file 
xx_2/.hoodie/metadata/files/6bd6d5c7-712f-4580-b895-79c471bf6dab-0_0-44-95_20210408221019001.hfile
 with writerSchema {
 "type" : "record",
 "name" : "HoodieMetadataRecord",
 "namespace" : "org.apache.hudi.avro.model",
 "doc" : "A record saved within the Metadata Table",
 "fields" : [ {
   "name" : "_hoodie_commit_time",
   "type" : [ "null", "string" ],
   "doc" : "",
   "default" : null
 }, {
   "name" : "_hoodie_commit_seqno",
   "type" : [ "null", "string" ],
   "doc" : "",
   "default" : null
 }, {
   "name" : "_hoodie_record_key",
   "type" : [ "null", "string" ],
   "doc" : "",
   "default" : null
 }, {
   "name" : "_hoodie_partition_path",
   "type" : [ "null", "string" ],
   "doc" : "",
   "default" : null
 }, {
   "name" : "_hoodie_file_name",
   "type" : [ "null", "string" ],
   "doc" : "",
   "default" : null
 }, {
   "name" : "key",
   "type" : {
 "type" : "string",
 "avro.java.string" : "String"
   }
 }, {
   "name" : "type",
   "type" : "int",
   "doc" : "Type of the metadata record"
 }, {
   "name" : "filesystemMetadata",
   "type" : [ "null", {
 "type" : "map",
 "values" : {
   "type" : "record",
   "name" : "HoodieMetadataFileInfo",
   "fields" : [ {
 "name" : "size",
 "type" : "long",
 "doc" : "Size of the file"
   }, {
 "name" : "isDeleted",
 "type" : "boolean",
 "doc" : "True if this file has been deleted"
   } ]
 },
 "avro.java.string" : "String"
   } ],
   "doc" : "Contains information about partitions and files within the 
dataset"
 } ]
   }
at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:256)
at 
org.apache.hudi.io.HoodieSortedMergeHandle.write(HoodieSortedMergeHandle.java:101)
at 
org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:122)
at 
org.apache.hudi.table.action.commit.AbstractMergeHelper$UpdateHandler.consumeOneRecord(AbstractMergeHelper.java:112)
at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
   Caused by: java.lang.IllegalArgumentException: key length must be > 0
at

[GitHub] [hudi] vburenin commented on issue #2692: [SUPPORT] Corrupt Blocks in Google Cloud Storage

2021-04-08 Thread GitBox



vburenin commented on issue #2692:
URL: https://github.com/apache/hudi/issues/2692#issuecomment-815855516


   @n3nash I am not sure the issue is still relevant. That has been happening 
with Hudi 0.5.0-snapshot. The symptoms were like there are duplicate records, 
or records were not upserted which could be seen as version field was let say 
9, but the current version is like 15. Apparently the compaction is to blame, 
we had a few intermitent failures when compaction didn't finish and when it 
eventually finishes things get borked... but it was extremely hard to catch 
unless we were doing direct data validation that is not that simple to do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] li36909 commented on pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



li36909 commented on pull request #2790:
URL: https://github.com/apache/hudi/pull/2790#issuecomment-815854810


   cc @nsivabalan could you help to take a look, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1779:
-
Labels: pull-request-available  (was: )

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: lrz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] li36909 opened a new pull request #2790: [HUDI-1779] Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread GitBox



li36909 opened a new pull request #2790:
URL: https://github.com/apache/hudi/pull/2790


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   current when hudi bootstrap a parquet file, or upsert into a parquet file 
which contains timestmap column, it will fail because these issues:
   
   1) At bootstrap operation, if the origin parquet file was written by a spark 
application, then spark will default save timestamp as int96(see 
spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of 
Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 
1.12.0, and set parquet.avro.readInt96AsFixed=true, please check https://github 
.com/apache/parquet-mr/pull/831/files) 
   
   2) after bootstrap, doing upsert will fail because we use hoodie schema to 
read origin parquet file. The schema is not match because hoodie schema  treat 
timestamp as long and at origin file it’s Int96 
   
   3) after bootstrap, and partial update for a parquet file will fail, because 
we copy the old record and save by hoodie schema( we miss a convertFixedToLong 
operation like spark does)
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   add new UT, and also check by exists UTs
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] li36909 commented on issue #2544: [SUPPORT]failed to read timestamp column in version 0.7.0 even when HIVE_SUPPORT_TIMESTAMP is enabled

2021-04-08 Thread GitBox



li36909 commented on issue #2544:
URL: https://github.com/apache/hudi/issues/2544#issuecomment-815849870


   @nsivabalan @cdmikechen 
   I fix and pass the test by a simple change like this:
   at 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieRealtimeRecordReaderUtils.java
   
   case LONG:
   177 | +if 
(schema.getLogicalType().getName().equals("timestamp-micros") && 
supportTimestamp) {
   178 | +  Timestamp timestamp = new Timestamp();
   179 | +  timestamp.setTimeInMillis((Long) value / 1000);
   180 | +  return new TimestampWritableV2(timestamp);
   181 | +}
   
   here convert long to timestamp then it's ok. and also pass supportTimestamp 
config at every reader and writter.
   do i miss any thing? thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread lrz (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lrz updated HUDI-1779:
--
Attachment: upsertFail.png

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: lrz
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread lrz (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lrz updated HUDI-1779:
--
Attachment: unsupportInt96.png

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: lrz
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread lrz (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lrz updated HUDI-1779:
--
Attachment: upsertFail2.png

> Fail to bootstrap/upsert a table which contains timestamp column
> 
>
> Key: HUDI-1779
> URL: https://issues.apache.org/jira/browse/HUDI-1779
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: lrz
>Priority: Major
> Fix For: 0.9.0
>
> Attachments: unsupportInt96.png, upsertFail.png, upsertFail2.png
>
>
> current when hudi bootstrap a parquet file, or upsert into a parquet file 
> which contains timestmap column, it will fail because these issues:
> 1) At bootstrap operation, if the origin parquet file was written by a spark 
> application, then spark will default save timestamp as int96(see 
> spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because 
> of Hudi can not read Int96 type now.(this issue can be solve by upgrade 
> parquet to 1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
> [https://github|https://github/] 
> <[https://github/]>.com/apache/parquet-mr/pull/831/files) 
> 2) after bootstrap, doing upsert will fail because we use hoodie schema to 
> read origin parquet file. The schema is not match because hoodie schema  
> treat timestamp as long and at origin file it’s Int96 
> 3) after bootstrap, and partial update for a parquet file will fail, because 
> we copy the old record and save by hoodie schema( we miss a 
> convertFixedToLong operation like spark does)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1779) Fail to bootstrap/upsert a table which contains timestamp column

2021-04-08 Thread lrz (Jira)

lrz created HUDI-1779:
-

 Summary: Fail to bootstrap/upsert a table which contains timestamp 
column
 Key: HUDI-1779
 URL: https://issues.apache.org/jira/browse/HUDI-1779
 Project: Apache Hudi
  Issue Type: Bug
Reporter: lrz
 Fix For: 0.9.0


current when hudi bootstrap a parquet file, or upsert into a parquet file which 
contains timestmap column, it will fail because these issues:

1) At bootstrap operation, if the origin parquet file was written by a spark 
application, then spark will default save timestamp as int96(see 
spark.sql.parquet.int96AsTimestamp), then bootstrap will fail, it’s because of 
Hudi can not read Int96 type now.(this issue can be solve by upgrade parquet to 
1.12.0, and set parquet.avro.readInt96AsFixed=true, please check 
[https://github|https://github/] 
<[https://github/]>.com/apache/parquet-mr/pull/831/files) 

2) after bootstrap, doing upsert will fail because we use hoodie schema to read 
origin parquet file. The schema is not match because hoodie schema  treat 
timestamp as long and at origin file it’s Int96 

3) after bootstrap, and partial update for a parquet file will fail, because we 
copy the old record and save by hoodie schema( we miss a convertFixedToLong 
operation like spark does)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] codecov-io edited a comment on pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-08 Thread GitBox



codecov-io edited a comment on pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#issuecomment-812815750


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=h1) Report
   > Merging 
[#2761](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=desc) (e862a1a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e970e1f48302aec3af7eeca009a2c793757cd501?el=desc)
 (e970e1f) will **decrease** coverage by `2.45%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2761/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2761  +/-   ##
   
   - Coverage 52.32%   49.86%   -2.46% 
   - Complexity 3689 3702  +13 
   
 Files   483  500  +17 
 Lines 2309524364+1269 
 Branches   2460 2757 +297 
   
   + Hits  1208412150  +66 
   - Misses 994211142+1200 
   - Partials   1069 1072   +3 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (+3.28%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.69% <ø> (-0.15%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `56.36% <ø> (-0.36%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `42.31% <0.00%> (-29.03%)` | `0.00 <0.00> (ø)` | |
   | hudisync | `45.47% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.72% <ø> (+0.03%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...c/main/scala/org/apache/hudi/HoodieFileIndex.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vc2NhbGEvb3JnL2FwYWNoZS9odWRpL0hvb2RpZUZpbGVJbmRleC5zY2FsYQ==)
 | `78.06% <0.00%> (-1.03%)` | `24.00 <0.00> (ø)` | |
   | 
[...scala/io/hudi/sql/HudiSpark3SessionExtension.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9pby9odWRpL3NxbC9IdWRpU3BhcmszU2Vzc2lvbkV4dGVuc2lvbi5zY2FsYQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...scala/org/apache/hudi/execution/HudiSQLUtils.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvZXhlY3V0aW9uL0h1ZGlTUUxVdGlscy5zY2FsYQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...che/spark/sql/catalyst/analysis/HudiAnalysis.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jYXRhbHlzdC9hbmFseXNpcy9IdWRpQW5hbHlzaXMuc2NhbGE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rk/sql/catalyst/analysis/HudiOperationsCheck.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jYXRhbHlzdC9hbmFseXNpcy9IdWRpT3BlcmF0aW9uc0NoZWNrLnNjYWxh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...k/sql/catalyst/analysis/PostprocessHudiTable.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jYXRhbHlzdC9hbmFseXNpcy9Qb3N0cHJvY2Vzc0h1ZGlUYWJsZS5zY2FsYQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...spark/sql/catalyst/analysis/ProcessHudiMerge.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jYXRhbHlzdC9hbmFseXNpcy9Qcm9jZXNzSHVkaU1lcmdlLnNjYWxh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/spark/sql/catalyst/merge/Interface.scala](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3BhcmszLWV4dGVuc2lvbnNfMi4xMi9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL3NwYXJrL3NxbC9jYXRhbHlzdC9tZXJnZS9JbnRlcmZhY2Uuc2NhbGE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2761: [HUDI-1676] Support SQL with spark3

2021-04-08 Thread GitBox



codecov-io edited a comment on pull request #2761:
URL: https://github.com/apache/hudi/pull/2761#issuecomment-812815750


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=h1) Report
   > Merging 
[#2761](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=desc) (e862a1a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e970e1f48302aec3af7eeca009a2c793757cd501?el=desc)
 (e970e1f) will **decrease** coverage by `1.37%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2761/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2761  +/-   ##
   
   - Coverage 52.32%   50.94%   -1.38% 
   + Complexity 3689 3275 -414 
   
 Files   483  423  -60 
 Lines 2309519745-3350 
 Branches   2460 2057 -403 
   
   - Hits  1208410060-2024 
   + Misses 9942 8844-1098 
   + Partials   1069  841 -228 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `40.29% <ø> (+3.28%)` | `0.00 <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | `0.00 <ø> (ø)` | |
   | hudicommon | `50.69% <ø> (-0.15%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `56.36% <ø> (-0.36%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.72% <ø> (+0.03%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2761?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/sink/compact/CompactionPlanEvent.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL2NvbXBhY3QvQ29tcGFjdGlvblBsYW5FdmVudC5qYXZh)
 | `50.00% <0.00%> (-50.00%)` | `3.00% <0.00%> (ø%)` | |
   | 
[...pache/hudi/sink/compact/CompactionCommitEvent.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL2NvbXBhY3QvQ29tcGFjdGlvbkNvbW1pdEV2ZW50LmphdmE=)
 | `43.75% <0.00%> (-43.75%)` | `3.00% <0.00%> (ø%)` | |
   | 
[.../org/apache/hudi/sink/utils/NonThrownExecutor.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL3V0aWxzL05vblRocm93bkV4ZWN1dG9yLmphdmE=)
 | `66.66% <0.00%> (-11.12%)` | `4.00% <0.00%> (-1.00%)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `15.00% <0.00%> (-1.00%)` | |
   | 
[...e/hudi/sink/transform/RowDataToHoodieFunction.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL3RyYW5zZm9ybS9Sb3dEYXRhVG9Ib29kaWVGdW5jdGlvbi5qYXZh)
 | `83.33% <0.00%> (-7.58%)` | `5.00% <0.00%> (-1.00%)` | |
   | 
[...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh)
 | `68.54% <0.00%> (-5.78%)` | `28.00% <0.00%> (-5.00%)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `41.66% <0.00%> (-2.78%)` | `7.00% <0.00%> (+1.00%)` | :arrow_down: |
   | 
[...java/org/apache/hudi/sink/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlRnVuY3Rpb24uamF2YQ==)
 | `80.50% <0.00%> (-2.57%)` | `22.00% <0.00%> (-1.00%)` | |
   | 
[...i/common/table/timeline/TimelineMetadataUtils.java](https://codecov.io/gh/apache/hudi/pull/2761/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL1RpbWVsaW5lTWV0YWRhdGFVdGlscy5qYXZh)
 | `70.17% <0.00%> (-2.56%)` | `17.00% <0.00%> (ø%)` | |
   |

[GitHub] [hudi] codecov-io edited a comment on pull request #2283: [HUDI-1415] Read Hoodie Table As Spark DataSource Table

2021-04-08 Thread GitBox



codecov-io edited a comment on pull request #2283:
URL: https://github.com/apache/hudi/pull/2283#issuecomment-734137301






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

1 2 >

1 - 100 of 146 matches

Mail list logo