[jira] [Updated] (HUDI-335) Improvements to DiskBasedMap

2020-01-16 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-335:
---
Fix Version/s: (was: 0.5.2)
   0.5.1

> Improvements to DiskBasedMap
> 
>
> Key: HUDI-335
> URL: https://issues.apache.org/jira/browse/HUDI-335
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balajee Nagasubramaniam
>Priority: Major
>  Labels: Hoodie, pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 
> 2019-11-13 at 2.56.53 PM.png
>
>   Original Estimate: 504h
>  Time Spent: 40m
>  Remaining Estimate: 503h 20m
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the 
> record on disk.
> This change improves the performance of the record get/read operation to 
> disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising.   Before the write performance improvement, 
> spilling/writing 1 million records (record size ~ 350 bytes) to the file took 
> about 104 seconds. 
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records 
> (size ~350 bytes) from the spill file took about 23 seconds.  After the 
> improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements 
> 
> RecordsHandled:   1   totalTestTime:  3145writeTime:  1176
> readTime:   255
> RecordsHandled:   5   totalTestTime:  5775writeTime:  4187
> readTime:   1175
> RecordsHandled:   10  totalTestTime:  10570   writeTime:  7718
> readTime:   2203
> RecordsHandled:   50  totalTestTime:  59723   writeTime:  45618   
> readTime:   11093
> RecordsHandled:   100 totalTestTime:  120022  writeTime:  87918   
> readTime:   22355
> RecordsHandled:   200 totalTestTime:  258627  writeTime:  187185  
> readTime:   56431}}
> {{With write improvement:
> RecordsHandled:   1   totalTestTime:  2013writeTime:  700 
> readTime:   503
> RecordsHandled:   5   totalTestTime:  2525writeTime:  390 
> readTime:   1247
> RecordsHandled:   10  totalTestTime:  3583writeTime:  464 
> readTime:   2352
> RecordsHandled:   50  totalTestTime:  22934   writeTime:  3731
> readTime:   15778
> RecordsHandled:   100 totalTestTime:  42415   writeTime:  4816
> readTime:   30332
> RecordsHandled:   200 totalTestTime:  74158   writeTime:  10192   
> readTime:   53195}}
> {{With read improvements:
> RecordsHandled:   1   totalTestTime:  2473writeTime:  1562
> readTime:   87
> RecordsHandled:   5   totalTestTime:  6169writeTime:  5151
> readTime:   438
> RecordsHandled:   10  totalTestTime:  9967writeTime:  8636
> readTime:   252
> RecordsHandled:   50  totalTestTime:  50889   writeTime:  46766   
> readTime:   1014
> RecordsHandled:   100 totalTestTime:  114482  writeTime:  104353  
> readTime:   3776
> RecordsHandled:   200 totalTestTime:  239251  writeTime:  219041  
> readTime:   8127}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-526) inline compact not work

2020-01-16 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-526.

Fix Version/s: 0.5.1
   Resolution: Fixed

Fixed via master: c1f8acab344fa632f1cce6268d2fc765c45e8b22

> inline compact not work
> ---
>
> Key: HUDI-526
> URL: https://issues.apache.org/jira/browse/HUDI-526
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Compaction
>Reporter: liujianhui
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> hoodie.compact.inline set as true
> hoodie.index.type set as INMEMEORY
>  
> compact not occur after dela commit
> {code}
> 20/01/13 16:43:43 INFO HoodieMergeOnReadTable: Checking if compaction needs 
> to be run on file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieMergeOnReadTable: Compacting merge on read table 
> file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO FileSystemViewManager: Creating InMemory based view 
> for basePath file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
> [file:///], Config:[Configuration: core-default.xml, core-site.xml, 
> mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
> hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml], FileSystem: 
> [org.apache.hadoop.fs.LocalFileSystem@6a24b9e2]
> 20/01/13 16:43:43 INFO HoodieTableConfig: Loading table properties from 
> file:/tmp/hudi_cow_table_read/.hoodie/hoodie.properties
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Finished Loading Table of type 
> MERGE_ON_READ(version=org.apache.hudi.common.model.TimelineLayoutVersion@20) 
> from file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Loading Active commit timeline 
> for file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieActiveTimeline: Loaded instants 
> [[20200109181330__deltacommit__COMPLETED], 
> [2020011017__deltacommit__COMPLETED], 
> [20200110171526__deltacommit__COMPLETED], 
> [20200113105844__deltacommit__COMPLETED], 
> [20200113145851__deltacommit__COMPLETED], 
> [20200113155502__deltacommit__COMPLETED], 
> [20200113164342__deltacommit__COMPLETED]]
> 20/01/13 16:43:43 INFO HoodieRealtimeTableCompactor: Compacting 
> file:///tmp/hudi_cow_table_read with commit 20200113164343
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
> from file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO FSUtils: Hadoop Configuration: fs.defaultFS: 
> [file:///], Config:[Configuration: core-default.xml, core-site.xml, 
> mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, 
> hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml], FileSystem: 
> [org.apache.hadoop.fs.LocalFileSystem@6a24b9e2]
> 20/01/13 16:43:43 INFO HoodieTableConfig: Loading table properties from 
> file:/tmp/hudi_cow_table_read/.hoodie/hoodie.properties
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Finished Loading Table of type 
> MERGE_ON_READ(version=org.apache.hudi.common.model.TimelineLayoutVersion@20) 
> from file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieTableMetaClient: Loading Active commit timeline 
> for file:///tmp/hudi_cow_table_read
> 20/01/13 16:43:43 INFO HoodieActiveTimeline: Loaded instants 
> [[20200109181330__deltacommit__COMPLETED], 
> [2020011017__deltacommit__COMPLETED], 
> [20200110171526__deltacommit__COMPLETED], 
> [20200113105844__deltacommit__COMPLETED], 
> [20200113145851__deltacommit__COMPLETED], 
> [20200113155502__deltacommit__COMPLETED], 
> [20200113164342__deltacommit__COMPLETED]]
> {code} 
> not compact time record in the .hoodie path



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on issue #1095: [HUDI-210] Implement prometheus metrics reporter

2020-01-16 Thread GitBox
yanghua commented on issue #1095: [HUDI-210] Implement prometheus metrics 
reporter
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-575052312
 
 
   @XuQianJin-Stars What's the status of this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-539) No FileSystem for scheme: abfss

2020-01-16 Thread Sam Somuah (Jira)
Sam Somuah created HUDI-539:
---

 Summary: No FileSystem for scheme: abfss
 Key: HUDI-539
 URL: https://issues.apache.org/jira/browse/HUDI-539
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Affects Versions: 0.5.1
 Environment: Spark version : 2.4.4
Hadoop version : 2.7.3
Databricks Runtime: 6.1
Reporter: Sam Somuah


Hi,
 I'm trying to use hudi to write to one of the Azure storage container file 
systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file schemes. 
The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries to get a 
file path passing in a blank hadoop configuration. This manifests as 
{{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
have any of the configuration in the environment.

The problematic line is

[https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]

 

Stacktrace
java.io.IOException: No FileSystem for scheme: abfss
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at 
org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
at 
org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367492682
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.hive.MultiPartKeysValueExtractor;
+import org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer;
+import org.apache.hudi.utilities.deltastreamer.TableExecutionObject;
+import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
+import org.apache.hudi.utilities.sources.JsonKafkaSource;
+import org.apache.hudi.utilities.sources.TestDataSource;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.streaming.kafka.KafkaTestUtils;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+public class TestHoodieMultiTableDeltaStreamer extends UtilitiesTestBase {
+
+  private static final String PROPS_FILENAME_TEST_SOURCE = 
"test-source1.properties";
+  private static volatile Logger log = 
LogManager.getLogger(TestHoodieMultiTableDeltaStreamer.class);
+  private static KafkaTestUtils testUtils;
+
+  @BeforeClass
+  public static void initClass() throws Exception {
 
 Review comment:
   Will extend TestHoodieDeltaStreamer directly here. In the process I fixed a 
bug as well. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

2020-01-16 Thread GitBox
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update 
partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367508976
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -431,6 +431,10 @@ public StorageLevel getBloomIndexInputStorageLevel() {
 return 
StorageLevel.fromString(props.getProperty(HoodieIndexConfig.BLOOM_INDEX_INPUT_STORAGE_LEVEL));
   }
 
+  public boolean getBloomIndexShouldUpdatePartitionPath() {
 
 Review comment:
   @nsivabalan I think for a simple a getter that fetch the property, the 
javadoc is accessible from the property itself. I've added the docs there. So 
can i skip it here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1229: [HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility

2020-01-16 Thread GitBox
vinothchandar commented on a change in pull request #1229: [HUDI-535] Ensure 
Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 
reader-writer compatibility issues
URL: https://github.com/apache/incubator-hudi/pull/1229#discussion_r367471314
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
 ##
 @@ -279,18 +280,28 @@ private void deleteInstantFile(HoodieInstant instant) {
 return readDataFromPath(detailPath);
   }
 
+  public Option readCleanerInfoAsBytes(HoodieInstant instant) {
+// Cleaner metadata are always stored only in timeline .hoodie
+return readDataFromPath(new Path(metaClient.getMetaPath(), 
instant.getFileName()));
+  }
+
   //-
   //  BEGIN - COMPACTION RELATED META-DATA MANAGEMENT.
   //-
 
-  public Option readPlanAsBytes(HoodieInstant instant) {
-Path detailPath = null;
-if (metaClient.getTimelineLayoutVersion().isNullVersion()) {
-  detailPath = new Path(metaClient.getMetaAuxiliaryPath(), 
instant.getFileName());
-} else {
-  detailPath = new Path(metaClient.getMetaPath(), instant.getFileName());
+  public Option readCompactionPlanAsBytes(HoodieInstant instant) {
+try {
+  // This is going to be the common case in future when 0.5.1 is deployed.
 
 Review comment:
   but then in the meantime, every reader will be doing two RPCs for this 
method right? I am thinking about flipping the order.. read from aux first, 
then fallback to meta path...  
   
   This way, when we switch to writing in metapath, only the older readers will 
incur this additional RPC... Does it make sense? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

2020-01-16 Thread GitBox
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update 
partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367490392
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -114,14 +117,23 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
 keyLocationPairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), 
new Tuple2<>(p._2, p._1)));
 
 // Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null), so we do left outer join.
-return 
incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record
 -> {
+return 
incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().flatMap(record
 -> {
   final HoodieRecord hoodieRecord = record._1;
   final Optional> 
recordLocationHoodieKeyPair = record._2;
   if (recordLocationHoodieKeyPair.isPresent()) {
 // Record key matched to file
-return getTaggedRecord(new 
HoodieRecord<>(recordLocationHoodieKeyPair.get()._2, hoodieRecord.getData()), 
Option.ofNullable(recordLocationHoodieKeyPair.get()._1));
+if (config.getBloomIndexShouldUpdatePartitionPath()) {
+  HoodieRecord emptyRecord = new 
HoodieRecord(recordLocationHoodieKeyPair.get()._2,
 
 Review comment:
   Exactly @nsivabalan . I understood the logic but missed the part of checking 
path update.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update partition path with GLOBAL_BLOOM

2020-01-16 Thread GitBox
xushiyan commented on a change in pull request #1187: [HUDI-499] Allow update 
partition path with GLOBAL_BLOOM
URL: https://github.com/apache/incubator-hudi/pull/1187#discussion_r367508270
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
 ##
 @@ -77,6 +77,9 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = 
"hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = 
"MEMORY_AND_DISK_SER";
 
+  public static final String BLOOM_INDEX_SHOULD_UPDATE_PARTITION_PATH = 
"hoodie.bloom.index.should.update.partition.path";
 
 Review comment:
   Renamed and added docs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ssomuah commented on issue #1228: No FileSystem for scheme: abfss

2020-01-16 Thread GitBox
ssomuah commented on issue #1228: No FileSystem for scheme: abfss
URL: https://github.com/apache/incubator-hudi/issues/1228#issuecomment-575163833
 
 
   The scheme I'm trying to use is supported (abfss). The problem is a blank 
Hadoop Configuration is passed in in `HoodieROTablePathFilter` so it never 
picks up any settings in my hadoop environment. Is JIRA the correct place to 
raise issues and not github?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
vinothchandar commented on a change in pull request #1212: [HUDI-509] Renaming 
code in sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367472578
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/SyncableFileSystemView.java
 ##
 @@ -18,12 +18,15 @@
 
 package org.apache.hudi.common.table;
 
+import org.apache.hudi.common.table.TableFileSystemView.BaseFileOnlyView;
+import org.apache.hudi.common.table.TableFileSystemView.SliceView;
+
 /**
- * A consolidated file-system view interface exposing both realtime and 
read-optimized views along with
+ * A consolidated file-system view interface exposing both full and basefile 
only views along with
 
 Review comment:
   okay .. `complete` sounds better


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367491239
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/util/DFSTablePropertiesConfiguration.java
 ##
 @@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.util;
+
+import org.apache.hudi.common.model.TableConfig;
+
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.io.InputStreamReader;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Used for parsing custom files having TableConfig objects.
+ */
+public class DFSTablePropertiesConfiguration {
 
 Review comment:
   I did not get this completely, let me try to put forward my understanding. 
Basically you do not want to have DFSTablePropertiesConfiguration class, rather 
we can have a separate folder for every table that we want to ingest and every 
such folder will have the source sink properties needed for ingesting that 
particular table. In essence, this means I need to redefine the way configs are 
maintained. Basically I should have every table config similar to how we have 
configs in DFSPropertiesConfiguration (key value pair). 
   
   Please let me know if I understood it correctly. @bvaradar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
vinothchandar commented on issue #1212: [HUDI-509] Renaming code in sync with 
cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#issuecomment-575197796
 
 
   @n3nash Addressed comments.. Responded to the renaming issue. Let me know 
what you think and we can hopefully merge after CI passes? 
   
   I will follow up with site doc changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1236: [MINOR] Fix missing @Override annotation on BufferedRandomAccessFilemethod

2020-01-16 Thread GitBox
lamber-ken opened a new pull request #1236: [MINOR] Fix missing @Override 
annotation on BufferedRandomAccessFilemethod
URL: https://github.com/apache/incubator-hudi/pull/1236
 
 
   ## What is the purpose of the pull request
   
   Fix missing @Override annotation on BufferedRandomAccessFilemethod
   
   ## Brief change log
   
 - *Fix missing @Override annotation on BufferedRandomAccessFilemethod*
   
   ## Verify this pull request
   
   This pull request is code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhedoubushishi commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12

2020-01-16 Thread GitBox
zhedoubushishi commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12
URL: https://github.com/apache/incubator-hudi/pull/1226#issuecomment-575296997
 
 
   > @leesf : FYI : This diff will have implication on how we release packages. 
We will be changing the names of packages : hudi-spark, hudi-spark-bundle and 
hudi-utilities-bundle to include scala version. As part of building jars when 
releasing, you would have to run "mvn clean install xxx" twice one without any 
additional settings to build 2.11 scala versions and then run
   > 
   > dev/change-scala-version 2.12
   > mvn -Pscala-2.12 clean install
   > 
   > for 2.12
   > 
   > cc @vinothchandar
   
   Also there is another way to do this: I can rename artifact Id, 
e.g.```hudi-spark``` to ```hudi-spark_${scala.binary.version}```, and thus 
avoid using ```dev/change-scala-version 2.12```. I am not sure why Spark does 
not choose this way.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (b39458b -> 8a3a503)

2020-01-16 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from b39458b  [MINOR] Make constant fields final in HoodieTestDataGenerator 
(#1234)
 add 8a3a503  [MINOR] Fix missing @Override annotation on 
BufferedRandomAccessFile method (#1236)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/common/util/BufferedRandomAccessFile.java  | 10 ++
 1 file changed, 10 insertions(+)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1236: [MINOR] Fix missing @Override annotation on BufferedRandomAccessFilemethod

2020-01-16 Thread GitBox
vinothchandar merged pull request #1236: [MINOR] Fix missing @Override 
annotation on BufferedRandomAccessFilemethod
URL: https://github.com/apache/incubator-hudi/pull/1236
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-540) Incorrect archive directory path in show archived commits cli

2020-01-16 Thread Venkatesh (Jira)
Venkatesh created HUDI-540:
--

 Summary: Incorrect archive directory path in show archived commits 
cli
 Key: HUDI-540
 URL: https://issues.apache.org/jira/browse/HUDI-540
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: CLI
 Environment: EMR, S3 
Reporter: Venkatesh


Archive path is specified as - new Path(basePath + 
"/.hoodie/.commits_.archive*"); , but should be -new Path(basePath + 
"/.hoodie/archived/.commits_.archive*");

We are using S3 to store hudi dataset if that matters.

 

[https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L143]

 

[https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L66]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-01-16 Thread Javier Vega (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Javier Vega updated HUDI-528:
-
Description: 
When trying to create an incremental view of a dataset, an exception is thrown 
when the latest commit in the time range is empty. In order to determine the 
schema of the dataset, Hudi will grab the [latest commit file, parse it, and 
grab the first metadata file 
path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
 If the latest commit was empty though, the field which is used to determine 
file paths (partitionToWriteStats) will be empty causing the following 
exception:

 

 
{code:java}
java.util.NoSuchElementException
  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
  at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
  at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

{code}

  was:
When trying to create an incremental view of a dataset, an exception is thrown 
when the latest commit in the time range is empty. In order to determine the 
schema of the dataset, Hudi will grab the [latest commit file, parse it, and 
grab the first metadata file 
path|[https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80]].
 If the latest commit was empty though, the field which is used to determine 
file paths (partitionToWriteStats) will be empty causing the following 
exception:

 

 
{code:java}
java.util.NoSuchElementException
  at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
  at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
  at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

{code}


> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Priority: Minor
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-575461848
 
 
   @bvaradar Have asked for clarification on 2-3 comments. Otherwise give me 
some time to fix the failing test cases. Things broke after rebasing with 
master.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017739#comment-17017739
 ] 

Vinoth Chandar commented on HUDI-538:
-

I think the first step here is to pull spark facing stuff into a hudi-spark and 
common code into hudi-writer-common..

Next I can take up some task around cleaning up the classes in 
hudi-writer-common (Have enough context, since I wrote a bunch of this)

 

lets create a sub tasks for that? 

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-510) Update site documentation in sync with cWiki

2020-01-16 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-510:
--

Assignee: Bhavani Sudha  (was: Vinoth Chandar)

> Update site documentation in sync with cWiki
> 
>
> Key: HUDI-510
> URL: https://issues.apache.org/jira/browse/HUDI-510
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread Zijie Lu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017674#comment-17017674
 ] 

Zijie Lu commented on HUDI-538:
---

[~yanghua],I am also interested in this issue and want to contribute in 
integrating hudi with flink. please @ me if any need too.

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-536) Update release notes to include KeyGenerator package changes

2020-01-16 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-536:
---
Status: Open  (was: New)

> Update release notes to include KeyGenerator package changes
> 
>
> Key: HUDI-536
> URL: https://issues.apache.org/jira/browse/HUDI-536
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Brandon Scheller
>Priority: Major
> Fix For: 0.5.1
>
>
> The change introduced here:
>  [https://github.com/apache/incubator-hudi/pull/1194]
> Refactors hudi keygenerators into their own package.
> We need to make this a backwards compatible change or update the release 
> notes to address this.
> Specifically:
> org.apache.hudi.ComplexKeyGenerator -> 
> org.apache.hudi.keygen.ComplexKeyGenerator



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2020-01-16 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-29:
--
Fix Version/s: (was: 0.5.1)
   0.6.0

> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017682#comment-17017682
 ] 

vinoyang commented on HUDI-538:
---

[~alfredlu] (y)

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12

2020-01-16 Thread GitBox
bvaradar commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12
URL: https://github.com/apache/incubator-hudi/pull/1226#issuecomment-575469290
 
 
   > Also there is another way to do this: I can rename artifact Id, 
e.g.`hudi-spark` to `hudi-spark_${scala.binary.version}`, and thus avoid using 
`dev/change-scala-version 2.12`. I am not sure why Spark does not choose this 
way.
   
   @zhedoubushishi : If the artifact id renaming to 
hudi-spark_${scala.binary.version} works, please go for it and make the change. 
I will take a look at this PR tomorrow morning PST again.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] melin opened a new issue #1240: future support for multi-client concurrent write?

2020-01-16 Thread GitBox
melin opened a new issue #1240: future support for multi-client concurrent 
write?
URL: https://github.com/apache/incubator-hudi/issues/1240
 
 
   future support for multi-client concurrent write?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhedoubushishi commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12

2020-01-16 Thread GitBox
zhedoubushishi commented on issue #1226: [HUDI-238] Make Hudi support Scala 2.12
URL: https://github.com/apache/incubator-hudi/pull/1226#issuecomment-575513618
 
 
   > > Also there is another way to do this: I can rename artifact Id, 
e.g.`hudi-spark` to `hudi-spark_${scala.binary.version}`, and thus avoid using 
`dev/change-scala-version 2.12`. I am not sure why Spark does not choose this 
way.
   > 
   > @zhedoubushishi : If the artifact id renaming to 
hudi-spark_${scala.binary.version} works, please go for it and make the change. 
I will take a look at this PR tomorrow morning PST again.
   
   Sorry I just found ```hudi-spark_${scala.binary.version}``` might not work. 
   See the dependency tree of hudi-cli (hudi-cli depends on hudi-utilities, and 
hudi-utilities depends on hudi-spark and other spark libraries):
   ```
   $ mvn dependency:tree -Pscala-2.12 -pl hudi-cli
   
   [INFO] --< org.apache.hudi:hudi-cli 
>--
   [INFO] Building hudi-cli 0.5.1-SNAPSHOT
   [INFO] [ jar 
]-
   [INFO] 
   [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-cli ---
   [INFO] org.apache.hudi:hudi-cli:jar:0.5.1-SNAPSHOT
   [INFO] +- org.scala-lang:scala-library:jar:2.12.10:compile
   [INFO] +- org.apache.hudi:hudi-hive:jar:0.5.1-SNAPSHOT:compile
   [INFO] |  +- org.apache.hudi:hudi-hadoop-mr:jar:0.5.1-SNAPSHOT:compile
   [INFO] |  \- com.beust:jcommander:jar:1.72:compile
   ...
   [INFO] +- org.apache.hudi:hudi-utilities_2.12:jar:0.5.1-SNAPSHOT:compile
   [INFO] |  +- org.apache.hudi:hudi-spark_2.11:jar:0.5.1-SNAPSHOT:compile
   [INFO] |  +- 
com.fasterxml.jackson.module:jackson-module-scala_2.11:jar:2.6.7.1:compile
   [INFO] |  |  +- org.scala-lang:scala-reflect:jar:2.11.8:compile
   [INFO] |  |  \- 
com.fasterxml.jackson.module:jackson-module-paranamer:jar:2.7.9:compile
   [INFO] |  +- org.apache.spark:spark-streaming_2.11:jar:2.4.4:compile
   [INFO] |  |  +- org.apache.spark:spark-core_2.11:jar:2.4.4:compile
   [INFO] |  |  |  +- com.twitter:chill_2.11:jar:0.9.3:compile
   [INFO] |  |  |  +- org.apache.spark:spark-launcher_2.11:jar:2.4.4:compile
   [INFO] |  |  |  +- org.apache.spark:spark-kvstore_2.11:jar:2.4.4:compile
   [INFO] |  |  |  +- 
org.apache.spark:spark-network-common_2.11:jar:2.4.4:compile
   [INFO] |  |  |  +- 
org.apache.spark:spark-network-shuffle_2.11:jar:2.4.4:compile
   [INFO] |  |  |  +- org.apache.spark:spark-unsafe_2.11:jar:2.4.4:compile
   [INFO] |  |  |  \- org.json4s:json4s-jackson_2.11:jar:3.5.3:compile
   [INFO] |  |  | \- org.json4s:json4s-core_2.11:jar:3.5.3:compile
   [INFO] |  |  |+- org.json4s:json4s-ast_2.11:jar:3.5.3:compile
   [INFO] |  |  |+- org.json4s:json4s-scalap_2.11:jar:3.5.3:compile
   [INFO] |  |  |\- 
org.scala-lang.modules:scala-xml_2.11:jar:1.0.6:compile
   [INFO] |  |  \- org.apache.spark:spark-tags_2.11:jar:2.4.4:compile
   [INFO] |  +- 
org.apache.spark:spark-streaming-kafka-0-10_2.11:jar:2.4.4:compile
   [INFO] |  |  \- org.apache.kafka:kafka-clients:jar:2.0.0:compile
   [INFO] |  +- 
org.apache.spark:spark-streaming-kafka-0-10_2.11:jar:tests:2.4.4:compile
   [INFO] |  +- org.antlr:stringtemplate:jar:4.0.2:compile
   [INFO] |  |  \- org.antlr:antlr-runtime:jar:3.3:compile
   [INFO] |  +- com.twitter:bijection-avro_2.11:jar:0.9.3:compile
   [INFO] |  |  +- com.twitter:bijection-core_2.11:jar:0.9.3:compile
   [INFO] |  |  \- org.scoverage:scalac-scoverage-runtime_2.11:jar:1.3.0:compile
   [INFO] |  +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
   [INFO] |  +- io.confluent:common-config:jar:3.0.0:compile
   [INFO] |  +- io.confluent:common-utils:jar:3.0.0:compile
   [INFO] |  |  \- com.101tec:zkclient:jar:0.5:compile
   [INFO] |  +- io.confluent:kafka-schema-registry-client:jar:3.0.0:compile
   [INFO] |  \- org.apache.httpcomponents:httpcore:jar:4.3.2:compile
   ```
   
   Although scala-2.12 profile overrides the 
```hudi-utilities_${scala.binary.version}``` to ```hudi-utilities_2.12```, but 
when it comes to the dependency of hudi-utilities_2.12, it seems maven would 
only use the default ```scala.binary.version``` inside 
```hudi-utilities_${scala.binary.version}``` which is 2.11. And thus all the 
dependencies of  ```hudi-utilities_${scala.binary.version}``` goes will 
```xxx_2.11```.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-16 Thread GitBox
yanghua commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, 
and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237#issuecomment-575513806
 
 
   @vinothchandar OK, will review this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-541) Replace variables/comments named "data files" to "base file"

2020-01-16 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-541:
--

Assignee: Bhavani Sudha

> Replace variables/comments named "data files" to "base file"
> 
>
> Key: HUDI-541
> URL: https://issues.apache.org/jira/browse/HUDI-541
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, newbie
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.0
>
>
> Per cWiki design and arch page, we should converge on the same terminology.. 
> We have _HoodieBaseFile_.. we should ensure all variables of this type are 
> named _baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
vinothchandar commented on issue #1212: [HUDI-509] Renaming code in sync with 
cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#issuecomment-575512172
 
 
   @bhasudha good catch. I cleaned up what you pointed out.. but there are 
still tons of variables named `dataFile` instead of `baseFile`. I created a 
JIRA for later.. 
   
   @n3nash On the suffix, I believe this is the least path of change for 
existing users. I believe most users will be using teh _rt table with MOR.. So 
we can't rename that or drop the suffix there... Its better to add it to the RO 
table to make it clear.. I even added a flag to hive sync so that existing 
users can skip _ro suffix if needed..
   
   Ideally we had no suffix on the snapshot table and only did a special _ro 
suffix... or even drop the _ro talbe altogether.. but this is a much larger 
change.. We can pursue this on top of what this PR does anyway 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #161

2020-01-16 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.21 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[GitHub] [incubator-hudi] vinothchandar commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-16 Thread GitBox
vinothchandar commented on issue #1237: [MINOR] Code Cleanup, remove redundant 
code, and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237#issuecomment-575491467
 
 
   @yanghua Are you able to do a quick review of this? I am dealing with some 
other release blockers.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1229: [HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues

2020-01-16 Thread GitBox
vinothchandar commented on issue #1229: [HUDI-535] Ensure Compaction Plan is 
always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility 
issues
URL: https://github.com/apache/incubator-hudi/pull/1229#issuecomment-575491975
 
 
   @bvaradar any updates?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread Zijie Lu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017750#comment-17017750
 ] 

Zijie Lu commented on HUDI-538:
---

[~vinoth] And any plan for integrating hudi with flink?

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yihua opened a new pull request #1239: [MINOR] Abstract a test case class for DFS Source to make it extensible

2020-01-16 Thread GitBox
yihua opened a new pull request #1239: [MINOR] Abstract a test case class for 
DFS Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239
 
 
   ## What is the purpose of the pull request
   
   `TestDFSSource` class contains redundant code of the test logic for 
different `Source`s reading from DFS.  This PR abstracts a test case class for 
DFS Source to make it extensible.  It can be easily extended to run test for 
new DFS `Source`, such as `CsvDFSSource` to come in HUDI-76.
   
   ## Brief change log
   
 - Refactored the code in `TestDFSSource` to add a new inner abstract class 
`DFSSourceTestCase` to make the test case extensible and use the new class for 
existing tests.
   
   ## Verify this pull request
   
   This pull request is a code cleanup in the tests, and the modified tests are 
verified to finish successfully.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-541) Replace variables/comments named "data files" to "base file"

2020-01-16 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-541:

Issue Type: Improvement  (was: Bug)

> Replace variables/comments named "data files" to "base file"
> 
>
> Key: HUDI-541
> URL: https://issues.apache.org/jira/browse/HUDI-541
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, newbie
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Per cWiki design and arch page, we should converge on the same terminology.. 
> We have _HoodieBaseFile_.. we should ensure all variables of this type are 
> named _baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-541) Replace variables/comments named "data files" to "base file"

2020-01-16 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-541:
---

 Summary: Replace variables/comments named "data files" to "base 
file"
 Key: HUDI-541
 URL: https://issues.apache.org/jira/browse/HUDI-541
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Code Cleanup, newbie
Reporter: Vinoth Chandar
 Fix For: 0.6.0


Per cWiki design and arch page, we should converge on the same terminology.. We 
have _HoodieBaseFile_.. we should ensure all variables of this type are named 
_baseFile_ or _bf_ , as opposed to _dataFile_ or _df_. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar merged pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
vinothchandar merged pull request #1212: [HUDI-509] Renaming code in sync with 
cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code 
in sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367679877
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
 ##
 @@ -249,10 +249,10 @@ private void ensurePartitionLoadedCorrectly(String 
partition) {
*
* @param statuses List of File-Status
*/
-  private Stream convertFileStatusesToDataFiles(FileStatus[] 
statuses) {
+  private Stream convertFileStatusesToDataFiles(FileStatus[] 
statuses) {
 
 Review comment:
   convertFileStatusesToBaseFiles instead ? No strong opinion on these. I think 
it should be okay to leave it as is since the return type is indicative already.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hmatu commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-16 Thread GitBox
hmatu commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, 
and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237#issuecomment-575414268
 
 
   Big thanks for your work on this, but it's hard to review like 
https://github.com/apache/incubator-hudi/pull/1159.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-540) Incorrect archive directory path in show archived commits cli

2020-01-16 Thread hong dongdong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017639#comment-17017639
 ] 

hong dongdong commented on HUDI-540:


[~venkee14] It‘s need 
{code:java}
@CliOption(key = {"archiveFolderPattern"}, help = "Archive Folder", 
unspecifiedDefaultValue = "") String folder
{code}
Command 'show archived commit stats' always work correctly with this cliOption.

HoodieSparkSqlWriter.scala set archiveLogfolder as 'archived' when create a new 
table, otherwise the default is "".

I will try to address this later.

> Incorrect archive directory path in show archived commits cli
> -
>
> Key: HUDI-540
> URL: https://issues.apache.org/jira/browse/HUDI-540
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
> Environment: EMR, S3 
>Reporter: Venkatesh
>Priority: Major
>
> Archive path is specified as - new Path(basePath + 
> "/.hoodie/.commits_.archive*"); , but should be -new Path(basePath + 
> "/.hoodie/archived/.commits_.archive*");
> We are using S3 to store hudi dataset if that matters.
>  
> [https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L143]
>  
> [https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L66]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-540) Incorrect archive directory path in show archived commits cli

2020-01-16 Thread hong dongdong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hong dongdong reassigned HUDI-540:
--

Assignee: hong dongdong

> Incorrect archive directory path in show archived commits cli
> -
>
> Key: HUDI-540
> URL: https://issues.apache.org/jira/browse/HUDI-540
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI
> Environment: EMR, S3 
>Reporter: Venkatesh
>Assignee: hong dongdong
>Priority: Major
>
> Archive path is specified as - new Path(basePath + 
> "/.hoodie/.commits_.archive*"); , but should be -new Path(basePath + 
> "/.hoodie/archived/.commits_.archive*");
> We are using S3 to store hudi dataset if that matters.
>  
> [https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L143]
>  
> [https://github.com/apache/incubator-hudi/blob/a733f4ef723865738d8541282c0c7234d64668db/hudi-cli/src/main/java/org/apache/hudi/cli/commands/ArchivedCommitsCommand.java#L66]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in 
sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367687172
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
 ##
 @@ -249,10 +249,10 @@ private void ensurePartitionLoadedCorrectly(String 
partition) {
*
* @param statuses List of File-Status
*/
-  private Stream convertFileStatusesToDataFiles(FileStatus[] 
statuses) {
+  private Stream convertFileStatusesToDataFiles(FileStatus[] 
statuses) {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in 
sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367687094
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/DataFileDTO.java
 ##
 @@ -36,22 +36,22 @@
   @JsonProperty("fileLen")
   private long fileLen;
 
-  public static HoodieDataFile toHoodieDataFile(DataFileDTO dto) {
+  public static HoodieBaseFile toHoodieDataFile(DataFileDTO dto) {
 if (null == dto) {
   return null;
 }
 
-HoodieDataFile dataFile = null;
+HoodieBaseFile dataFile = null;
 if (null != dto.fileStatus) {
-  dataFile = new 
HoodieDataFile(FileStatusDTO.toFileStatus(dto.fileStatus));
+  dataFile = new 
HoodieBaseFile(FileStatusDTO.toFileStatus(dto.fileStatus));
 } else {
-  dataFile = new HoodieDataFile(dto.fullPath);
+  dataFile = new HoodieBaseFile(dto.fullPath);
   dataFile.setFileLen(dto.fileLen);
 }
 return dataFile;
   }
 
-  public static DataFileDTO fromHoodieDataFile(HoodieDataFile dataFile) {
+  public static DataFileDTO fromHoodieDataFile(HoodieBaseFile dataFile) {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
bhasudha commented on issue #1212: [HUDI-509] Renaming code in sync with cWiki 
restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#issuecomment-575380133
 
 
   LGTM overall. Left minor comments. Was just thinking about them. Feel free 
to ignore if its too much refactoring.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
n3nash commented on a change in pull request #1212: [HUDI-509] Renaming code in 
sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367687060
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/DataFileDTO.java
 ##
 @@ -36,22 +36,22 @@
   @JsonProperty("fileLen")
   private long fileLen;
 
-  public static HoodieDataFile toHoodieDataFile(DataFileDTO dto) {
+  public static HoodieBaseFile toHoodieDataFile(DataFileDTO dto) {
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] melin opened a new issue #1238: planned hudi-client not dependent on spark api?

2020-01-16 Thread GitBox
melin opened a new issue #1238: planned hudi-client not dependent on spark api?
URL: https://github.com/apache/incubator-hudi/issues/1238
 
 
   Hudi-client relies heavily on spark api and can be replaced?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread hong dongdong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017648#comment-17017648
 ] 

hong dongdong commented on HUDI-538:


[~yanghua] I am interesting in this work, please @ me if any need. And I had 
want to do this before.

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-539) No FileSystem for scheme: abfss

2020-01-16 Thread hong dongdong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017644#comment-17017644
 ] 

hong dongdong commented on HUDI-539:


[~ssomuah] As far as I know, ABFS was support after hadoop-3.2.0, may be you 
need a higher version of hadoop.

> No FileSystem for scheme: abfss
> ---
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Priority: Major
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> Stacktrace
> java.io.IOException: No FileSystem for scheme: abfss
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
> at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
> at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-16 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017670#comment-17017670
 ] 

vinoyang commented on HUDI-538:
---

[~hongdongdong] Glad to hear this news. Welcome!

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code 
in sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367678245
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/DataFileDTO.java
 ##
 @@ -36,22 +36,22 @@
   @JsonProperty("fileLen")
   private long fileLen;
 
-  public static HoodieDataFile toHoodieDataFile(DataFileDTO dto) {
+  public static HoodieBaseFile toHoodieDataFile(DataFileDTO dto) {
 if (null == dto) {
   return null;
 }
 
-HoodieDataFile dataFile = null;
+HoodieBaseFile dataFile = null;
 if (null != dto.fileStatus) {
-  dataFile = new 
HoodieDataFile(FileStatusDTO.toFileStatus(dto.fileStatus));
+  dataFile = new 
HoodieBaseFile(FileStatusDTO.toFileStatus(dto.fileStatus));
 } else {
-  dataFile = new HoodieDataFile(dto.fullPath);
+  dataFile = new HoodieBaseFile(dto.fullPath);
   dataFile.setFileLen(dto.fileLen);
 }
 return dataFile;
   }
 
-  public static DataFileDTO fromHoodieDataFile(HoodieDataFile dataFile) {
+  public static DataFileDTO fromHoodieDataFile(HoodieBaseFile dataFile) {
 
 Review comment:
   'fromHoodieBaseFile' instead ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code in sync with cWiki restructuring

2020-01-16 Thread GitBox
bhasudha commented on a change in pull request #1212: [HUDI-509] Renaming code 
in sync with cWiki restructuring
URL: https://github.com/apache/incubator-hudi/pull/1212#discussion_r367678068
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/DataFileDTO.java
 ##
 @@ -36,22 +36,22 @@
   @JsonProperty("fileLen")
   private long fileLen;
 
-  public static HoodieDataFile toHoodieDataFile(DataFileDTO dto) {
+  public static HoodieBaseFile toHoodieDataFile(DataFileDTO dto) {
 
 Review comment:
   @vinothchandar Do we want to rename the method as well to 'toHoodieBaseFile' 
?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2020-01-16 Thread GitBox
n3nash commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource 
to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-575381731
 
 
   @yanghua Are we close to calling this as a first version of the test suite ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi opened a new pull request #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-16 Thread GitBox
smarthi opened a new pull request #1237: [MINOR] Code Cleanup, remove redundant 
code, and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Code cleanup, and other minor tweaks
   
   ## Brief change log
   
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367363989
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TableConfig.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import com.fasterxml.jackson.annotation.JsonProperty;
+
+import java.util.Objects;
+
+/*
+Represents object with all the topic level overrides for multi table delta 
streamer execution
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
 
 Review comment:
   Get your point, will move the class to utilities package. Can you please 
elaborate what do you mean by TargetTableConfig. All the properties in this 
class only apply for Source. I guess with few changes in validations part, the 
same class can be used for non-kafka sources too? Let me know your thoughts on 
this. @bvaradar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367366905
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,242 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.TableConfig;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTopics;
+  private Set failedTopics;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTopics = new HashSet<>();
+this.failedTopics = new HashSet<>();
+this.jssc = jssc;
+String tableConfigFile = getCustomPropsFileName(args);
+FileSystem fs = FSUtils.getFs(tableConfigFile, jssc.hadoopConfiguration());
+List configList = UtilHelpers.readTableConfig(fs, new 
Path(tableConfigFile)).getConfigs();
+
+for (TableConfig config : configList) {
+  validateTableConfigObject(config);
+  populateTableExecutionObjectList(config, args);
+}
+  }
+
+  /*
+  validate if given object has all the necessary fields.
+  Throws IllegalArgumentException if any of the required fields are missing
+   */
+  private void validateTableConfigObject(TableConfig config) {
+if (Strings.isNullOrEmpty(config.getDatabase()) || 
Strings.isNullOrEmpty(config.getTableName()) || 
Strings.isNullOrEmpty(config.getPrimaryKeyField())
+|| Strings.isNullOrEmpty(config.getTopic())) {
+  throw new IllegalArgumentException("Please provide valid table config 
arguments!");
+}
+  }
+
+  private void populateTableExecutionObjectList(TableConfig config, String[] 
args) {
+TableExecutionObject executionObject;
+try {
+  final Config cfg = new Config();
+  String[] tableArgs = args.clone();
+  String targetBasePath = resetTarget(tableArgs, config.getDatabase(), 
config.getTableName());
+  JCommander cmd = new JCommander(cfg);
+  cmd.parse(tableArgs);
+  cfg.targetBasePath = targetBasePath;
+  FileSystem fs = FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration());
+  TypedProperties typedProperties = UtilHelpers.readConfig(fs, new 
Path(cfg.propsFilePath), cfg.configs).getConfig();
+  populateIngestionProps(typedProperties, config);
+  populateSchemaProviderProps(cfg, typedProperties, config);
+  populateHiveSyncProps(cfg, typedProperties, config);
+  executionObject = new TableExecutionObject();
+  executionObject.setConfig(cfg);
+  executionObject.setProperties(typedProperties);
+  executionObject.setTableConfig(config);
+  this.tableExecutionObjects.add(executionObject);
+} catch (Exception e) {
+  logger.error("Error while creating execution object for topic: " + 
config.getTopic(), e);
+  throw e;
+}
+  

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367371389
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -156,6 +167,10 @@ public Operation convert(String value) throws 
ParameterException {
 required = true)
 public String targetBasePath;
 
+@Parameter(names = {"--base-path-prefix"},
 
 Review comment:
   Makes sense, will add the support.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] XuQianJin-Stars commented on issue #1095: [HUDI-210] Implement prometheus metrics reporter

2020-01-16 Thread GitBox
XuQianJin-Stars commented on issue #1095: [HUDI-210] Implement prometheus 
metrics reporter
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-575115811
 
 
   hi @leesf @lamber-ken  Hive any time to continue review this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-01-16 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r367363989
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TableConfig.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import com.fasterxml.jackson.annotation.JsonProperty;
+
+import java.util.Objects;
+
+/*
+Represents object with all the topic level overrides for multi table delta 
streamer execution
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
 
 Review comment:
   Get your point, will move the class to utilities package. Can you please 
elaborate what do you mean by TargetTableConfig. All the properties in this 
class only apply for Source. I guess with few changes in validations part 
(validateTableConfigObject function in HoodieMultiTableDeltaStreamer.java), the 
same class can be used for non-kafka sources too? Let me know your thoughts on 
this. @bvaradar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services