date:20200620

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #315

2020-06-20 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.38 KB...]
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @

[GitHub] [hudi] vinothchandar commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-20 Thread GitBox



vinothchandar commented on pull request #1100:
URL: https://github.com/apache/hudi/pull/1100#issuecomment-647073091


   > The plan is to setup the nightly performance build by @yanghua and a bunch 
of follow up items that probably captures the essence of what you have in mind 
for th
   
   As it stands now, is this going to add new tests to our ci? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-20 Thread GitBox



vinothchandar commented on pull request #1100:
URL: https://github.com/apache/hudi/pull/1100#issuecomment-647073129


   I can take a final pass, make any small edits myself and merge once you do 
the renaming



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-20 Thread GitBox



vinothchandar commented on pull request #1100:
URL: https://github.com/apache/hudi/pull/1100#issuecomment-647072976


   @n3nash  let’s please call this integration test to be consistent with the 
direction @xushiyan has been pushing towards. Unit, integration, functional are 
standard test terms industry wide. We have to classify this into one of these. 
hudi-test-suite is very generic IMO



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1747: [SUPPORT] HiveSynctool syncs wrong location

2020-06-20 Thread GitBox



vinothchandar commented on issue #1747:
URL: https://github.com/apache/hudi/issues/1747#issuecomment-647072617


   @bhasudha  can you please take a pass



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #933: Support for multiple level partitioning in Hudi

2020-06-20 Thread GitBox



vinothchandar commented on issue #933:
URL: https://github.com/apache/hudi/issues/933#issuecomment-647072567


   @afeldman1 http://hudi.apache.org/contributing#website sorry if this was a 
bit obscure :) site content lives in a `asf-site` branch



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #1748: [HUDI-1029] Use FastDateFormat for parsing and formating in Timestamp…

2020-06-20 Thread GitBox



vinothchandar commented on a change in pull request #1748:
URL: https://github.com/apache/hudi/pull/1748#discussion_r443176993



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/keygen/TimestampBasedKeyGenerator.java
##
@@ -27,10 +27,10 @@
 import org.apache.hudi.utilities.exception.HoodieDeltaStreamerException;
 
 import org.apache.avro.generic.GenericRecord;
+import org.apache.commons.lang3.time.FastDateFormat;

Review comment:
   We solved a ton of bundle conflicts for commons lang and no longer use 
it in the project. Please rework or without requiring it 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Commented] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-06-20 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141271#comment-17141271
 ] 

Vinoth Chandar commented on HUDI-1015:
--

Hi renyi , goal here is to audit all these calls that may be listing the entire 
table (outside of cleaning and rollback, which have their own jira s addressing 
this) and see if we can make them more intelligent by only listing some 
partitions 

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xushiyan closed pull request #1619: [HUDI-896] Parallelize CI tests by modules

2020-06-20 Thread GitBox



xushiyan closed pull request #1619:
URL: https://github.com/apache/hudi/pull/1619


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #1619: [HUDI-896] Parallelize CI tests by modules

2020-06-20 Thread GitBox



xushiyan commented on pull request #1619:
URL: https://github.com/apache/hudi/pull/1619#issuecomment-647068902


   Thank you all for the feedbacks..will look into codecov.. closing this for 
now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#issuecomment-647062786


   > @xushiyan codecov seems happy?
   
   @vinothchandar When we split some tests from unit tests to another travis 
job, we lost some coverage reporting in the final report, due to overwriting 
report data. We were discussing this in 
https://github.com/apache/hudi/pull/1619#issuecomment-628260195
   
   I just opened https://github.com/apache/hudi/pull/1753 to verify the 
solution for it. I'll address your comments soon after that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443170684



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/SharedResources.java
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+
+import java.io.IOException;
+
+public class SharedResources implements SparkProvider, DFSProvider {
+
+  private static transient SparkSession spark;
+  private static transient SQLContext sqlContext;
+  private static transient JavaSparkContext jsc;
+
+  private static transient HdfsTestService hdfsTestService;
+  private static transient MiniDFSCluster dfsCluster;
+  private static transient DistributedFileSystem dfs;
+
+  /**
+   * An indicator of the initialization status.
+   */
+  protected boolean initialized = false;
+
+  @Override
+  public SparkSession spark() {
+return spark;
+  }
+
+  @Override
+  public SQLContext sqlContext() {
+return sqlContext;
+  }
+
+  @Override
+  public JavaSparkContext jsc() {
+return jsc;
+  }
+
+  @Override
+  public MiniDFSCluster dfsCluster() {
+return dfsCluster;
+  }
+
+  @Override
+  public DistributedFileSystem dfs() {
+return dfs;
+  }
+
+  @Override
+  public Path dfsBasePath() {
+return dfs.getWorkingDirectory();
+  }
+
+  @BeforeEach
+  public synchronized void runBeforeEach() throws Exception {
+initialized = spark != null && hdfsTestService != null;
+if (!initialized) {
+  spark = SparkSession.builder()
+  .config(HoodieWriteClient.registerClasses(conf()))
+  .getOrCreate();
+  sqlContext = spark.sqlContext();
+  jsc = new JavaSparkContext(spark.sparkContext());
+
+  FileSystem.closeAll();
+  hdfsTestService = new HdfsTestService();
+  dfsCluster = hdfsTestService.start(true);
+  dfs = dfsCluster.getFileSystem();
+  dfs.mkdirs(dfs.getWorkingDirectory());
+}
+  }
+
+  @AfterAll
+  public static synchronized void cleanUpAfterAll() throws IOException {

Review comment:
   Tried to close them all in 
`org.apache.hudi.utilities.functional.FunctionalTestSuite#afterAll` method hook 
but couldn't trigger the method...not sure why. As those services are only to 
shut down when all suite test classes finish, i could at least close them in a 
jvm shutdown hook.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443169927



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/SharedResources.java
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+
+import java.io.IOException;
+
+public class SharedResources implements SparkProvider, DFSProvider {
+
+  private static transient SparkSession spark;
+  private static transient SQLContext sqlContext;
+  private static transient JavaSparkContext jsc;
+
+  private static transient HdfsTestService hdfsTestService;
+  private static transient MiniDFSCluster dfsCluster;
+  private static transient DistributedFileSystem dfs;
+
+  /**
+   * An indicator of the initialization status.
+   */
+  protected boolean initialized = false;
+
+  @Override
+  public SparkSession spark() {
+return spark;
+  }
+
+  @Override
+  public SQLContext sqlContext() {
+return sqlContext;
+  }
+
+  @Override
+  public JavaSparkContext jsc() {
+return jsc;
+  }
+
+  @Override
+  public MiniDFSCluster dfsCluster() {
+return dfsCluster;
+  }
+
+  @Override
+  public DistributedFileSystem dfs() {
+return dfs;
+  }
+
+  @Override
+  public Path dfsBasePath() {
+return dfs.getWorkingDirectory();
+  }
+
+  @BeforeEach
+  public synchronized void runBeforeEach() throws Exception {
+initialized = spark != null && hdfsTestService != null;
+if (!initialized) {
+  spark = SparkSession.builder()
+  .config(HoodieWriteClient.registerClasses(conf()))
+  .getOrCreate();
+  sqlContext = spark.sqlContext();
+  jsc = new JavaSparkContext(spark.sparkContext());
+
+  FileSystem.closeAll();

Review comment:
   Not quite sure about the harm.. but def. looks safer to move it to the 
beginning of the method





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443169749



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/DFSProvider.java
##
@@ -0,0 +1,34 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.testutils;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+
+public interface DFSProvider {

Review comment:
   yup. it should be ok to move this in future PR where extend to more 
modules' functional tests





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443169804



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/SharedResources.java
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+
+import java.io.IOException;
+
+public class SharedResources implements SparkProvider, DFSProvider {

Review comment:
   How about `FunctionalTestHarness`?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443169701



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/checkpointing/TestKafkaConnectHdfsProvider.java
##
@@ -19,37 +19,39 @@
 package org.apache.hudi.utilities.checkpointing;
 
 import org.apache.hudi.common.config.TypedProperties;
-import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
 import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.exception.HoodieException;
 
 import org.apache.hadoop.conf.Configuration;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
 
 import java.io.File;
+import java.nio.file.Files;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 
-public class TestKafkaConnectHdfsProvider extends HoodieCommonTestHarness {

Review comment:
   make sense.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Created] (HUDI-1034) Document info about test structure and guide

2020-06-20 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-1034:


 Summary: Document info about test structure and guide
 Key: HUDI-1034
 URL: https://issues.apache.org/jira/browse/HUDI-1034
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Docs
Reporter: Raymond Xu


Create a test guide section in contribution guide to layout test structure and 
other tips for writing tests

 

Quote [~vinothchandar]

unit - testing basic functionality at the class level, potentially using mocks. 
Expected to finish quicker
functional - brings up the services needed and runs test without mocking
integration - runs subset of functional tests, on a full fledged enviroment 
with dockerized services

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] xushiyan commented on a change in pull request #1746: [HUDI-996] Add functional test suite

2020-06-20 Thread GitBox



xushiyan commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443169647



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieSnapshotExporter.java
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.utilities.HoodieSnapshotExporter.OutputFormatValidator;
+
+import com.beust.jcommander.ParameterException;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.NullSource;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+public class TestHoodieSnapshotExporter {

Review comment:
   @vinothchandar yes exactly. created 
https://issues.apache.org/jira/browse/HUDI-1034 for this





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] xushiyan opened a new pull request #1753: [WIP] [HUDI-896] Report test coverages by modules

2020-06-20 Thread GitBox



xushiyan opened a new pull request #1753:
URL: https://github.com/apache/hudi/pull/1753


   wip
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1690: [HUDI-908] Add some data types to HoodieTestDataGenerator and fix some bugs

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1690:
URL: https://github.com/apache/hudi/pull/1690#discussion_r443164942



##
File path: hudi-utilities/src/test/resources/delta-streamer-config/source.avsc
##
@@ -43,8 +43,41 @@
   }, {
 "name" : "end_lon",
 "type" : "double"
-  },
-  {
+  }, {
+"name" : "int_val",

Review comment:
   Minor: Instead of naming based on types, can we give relevant names like 
:distance_in_meters (int_type), seconds_since_epoch(long_val), 
payment_in_dollars,...





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1752: [WIP] [HUDI-575] Support Async Compaction for spark streaming writes to hudi table

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1752:
URL: https://github.com/apache/hudi/pull/1752#discussion_r443162781



##
File path: 
hudi-client/src/main/java/org/apache/hudi/async/AsyncCompactService.java
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.async;
+
+import org.apache.hudi.client.Compactor;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.concurrent.BlockingQueue;
+import java.util.concurrent.CompletableFuture;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.LinkedBlockingQueue;
+import java.util.concurrent.TimeUnit;
+import java.util.concurrent.locks.Condition;
+import java.util.concurrent.locks.ReentrantLock;
+import java.util.stream.IntStream;
+
+/**
+ * Async Compactor Service that runs in separate thread. Currently, only one 
compactor is allowed to run at any time.
+ */
+public class AsyncCompactService extends AbstractAsyncService {
+
+  private static final long serialVersionUID = 1L;
+  private static final Logger LOG = 
LogManager.getLogger(AsyncCompactService.class);
+
+  /**
+   * This is the job pool used by async compaction.
+   * In case of deltastreamer, Spark job scheduling configs are automatically 
set.
+   * As the configs needs to be set before spark context is initiated, it is 
not
+   * automated for Structured Streaming.
+   * https://spark.apache.org/docs/latest/job-scheduling.html

Review comment:
   https://jira.apache.org/jira/browse/HUDI-1031 to add to docs





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-06-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-575:

Labels: pull-request-available  (was: )

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] bvaradar opened a new pull request #1752: [WIP] [HUDI-575] Support Async Compaction for spark streaming writes to hudi table

2020-06-20 Thread GitBox



bvaradar opened a new pull request #1752:
URL: https://github.com/apache/hudi/pull/1752


   This PR is dependent on #1577  It has 2 commits: The first commit 
corresponds to #1577 and the second commit is for this PR.
   
   Contains:
   
   1. Structured Streaming Async Compaction Support
   2. Integration tests was missing Structured Streaming. Added them to 
ITTestHoodieSanity with async compaction enabled for MOR tables.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1033) Remove redundant CLI tests

2020-06-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1033:
-
Status: Open  (was: New)

> Remove redundant CLI tests 
> ---
>
> Key: HUDI-1033
> URL: https://issues.apache.org/jira/browse/HUDI-1033
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.0
>
>
> There are some tests like ITTestRepairsCommand vs TestRepairsCommand, 
> ITTestCleanerCommand vs TestCleanerCommand. Please consolidate if they are 
> redundant.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] bvaradar commented on pull request #1577: [HUDI-855] Run Auto Cleaner in parallel with ingestion

2020-06-20 Thread GitBox



bvaradar commented on pull request #1577:
URL: https://github.com/apache/hudi/pull/1577#issuecomment-647048778


   @vinothchandar : Ready for review. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1577: [HUDI-855] Run Auto Cleaner in parallel with ingestion

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1577:
URL: https://github.com/apache/hudi/pull/1577#discussion_r443162134



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestCleansCommand.java
##
@@ -1,106 +0,0 @@
-/*

Review comment:
   @yanghua : I went ahead and removed this class. I have added a jira : 
https://issues.apache.org/jira/browse/HUDI-1033 to go through other such 
examples and see if multiple CLI tests are still needed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Assigned] (HUDI-1033) Remove redundant CLI tests

2020-06-20 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1033:


Assignee: vinoyang

> Remove redundant CLI tests 
> ---
>
> Key: HUDI-1033
> URL: https://issues.apache.org/jira/browse/HUDI-1033
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.0
>
>
> There are some tests like ITTestRepairsCommand vs TestRepairsCommand, 
> ITTestCleanerCommand vs TestCleanerCommand. Please consolidate if they are 
> redundant.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1033) Remove redundant CLI tests

2020-06-20 Thread Balaji Varadarajan (Jira)

Balaji Varadarajan created HUDI-1033:


 Summary: Remove redundant CLI tests 
 Key: HUDI-1033
 URL: https://issues.apache.org/jira/browse/HUDI-1033
 Project: Apache Hudi
  Issue Type: Task
  Components: Testing
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


There are some tests like ITTestRepairsCommand vs TestRepairsCommand, 
ITTestCleanerCommand vs TestCleanerCommand. Please consolidate if they are 
redundant.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] lyogev opened a new issue #1751: [SUPPORT] Hudi not working with Spark 3.0.0

2020-06-20 Thread GitBox



lyogev opened a new issue #1751:
URL: https://github.com/apache/hudi/issues/1751


   **Describe the problem you faced**
   
   Trying to run hudi with spark 3.0.0, and getting an error
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   **Expected behavior**
   
   **Environment Description**
   
   * Hudi version :
   
   
   0.5.3
   
   * Spark version :
   
   3.0.0
   
   * Hive version :
   
   2.3.7
   
   * Hadoop version :
   
   3.2.0
   
   * Storage (HDFS/S3/GCS..) :
   
   * Running on Docker? (yes/no) :
   
   yes
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Caused by: java.lang.NoSuchMethodError: 'java.lang.Object 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(org.apache.spark.sql.catalyst.InternalRow)'
   at 
org.apache.hudi.AvroConversionUtils$.$anonfun$createRdd$1(AvroConversionUtils.scala:42)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
   at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271)
   at scala.collection.Iterator.foreach(Iterator.scala:941)
   at scala.collection.Iterator.foreach$(Iterator.scala:941)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
   at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
   at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
   at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
   at scala.collection.AbstractIterator.to(Iterator.scala:1429)
   at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
   at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
   at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
   at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
   at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1423)
   at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:127)
   at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
   at java.base/java.lang.Thread.run(Unknown Source)```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1577: [WIP] [HUDI-855] Run Auto Cleaner in parallel with ingestion

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1577:
URL: https://github.com/apache/hudi/pull/1577#discussion_r443157306



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/async/AbstractAsyncService.java
##
@@ -16,7 +16,7 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.utilities.deltastreamer;
+package org.apache.hudi.common.async;

Review comment:
   Moved to hudi-client





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1577: [WIP] [HUDI-855] Run Auto Cleaner in parallel with ingestion

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1577:
URL: https://github.com/apache/hudi/pull/1577#discussion_r443157280



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -677,4 +714,27 @@ private void rollbackPendingCommits() {
 });
 return compactionInstantTimeOpt;
   }
+
+  /**
+   * Auto Clean service running concurrently.
+   */
+  private static class AutoCleanerService extends AbstractAsyncService {
+
+private final HoodieWriteClient writeClient;
+private final String cleanInstant;
+
+private AutoCleanerService(HoodieWriteClient writeClient, String 
cleanInstant) {
+  this.writeClient = writeClient;
+  this.cleanInstant = cleanInstant;
+}
+
+@Override
+protected Pair startService() {
+  ExecutorService executor = Executors.newFixedThreadPool(1);

Review comment:
   Move to constructor of AutoCleanerService





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on a change in pull request #1577: [WIP] [HUDI-855] Run Auto Cleaner in parallel with ingestion

2020-06-20 Thread GitBox



bvaradar commented on a change in pull request #1577:
URL: https://github.com/apache/hudi/pull/1577#discussion_r443157218



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -339,8 +352,13 @@ protected void postCommit(HoodieCommitMetadata metadata, 
String instantTime,
   archiveLog.archiveIfRequired(jsc);
   if (config.isAutoClean()) {
 // Call clean to cleanup if there is anything to cleanup after the 
commit,
-LOG.info("Auto cleaning is enabled. Running cleaner now");
-clean(instantTime);
+if (config.isRunParallelAutoClean()) {

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu commented on a change in pull request #1744: [HUDI-1027] Introduce TimestampBasedComplexKeyGenerator to support ti…

2020-06-20 Thread GitBox



wangxianghu commented on a change in pull request #1744:
URL: https://github.com/apache/hudi/pull/1744#discussion_r442913719



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/keygen/TestTimestampBasedKeyGenerator.java
##
@@ -91,8 +94,9 @@ public void testScalar() {
 baseRecord.put("createTime", 2L);
 
 // timezone is GMT
-properties = getBaseKeyConfig("SCALAR", "-MM-dd hh", "GMT", "days");
+properties = getBaseKeyConfig("SCALAR", "-MM-dd HH", "GMT", "days");
 HoodieKey hk5 = new 
TimestampBasedKeyGenerator(properties).getKey(baseRecord);
-assertEquals(hk5.getPartitionPath(), "2024-10-04 12");
+assertEquals(hk5.getPartitionPath(), "2024-10-04 00");

Review comment:
   Hi @afilipchik, would you please help me out here, I am not kind of 
familiar with SCALAR time.
   My unit test shows that the actual `partitionPath` is "2024-10-04 00", while 
yours are "2024-10-04 12" 
   thanks :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-1023) Add validation error messages in delta sync

2020-06-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-1023.
-
Resolution: Fixed

Fixed via master: 8a9fdd603e3e532ea5252b98205acfb8aa648795

> Add validation error messages in delta sync
> ---
>
> Key: HUDI-1023
> URL: https://issues.apache.org/jira/browse/HUDI-1023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> To print messages upon validation errors for expected configuration settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-1023) Add validation error messages in delta sync

2020-06-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-1023.
---

> Add validation error messages in delta sync
> ---
>
> Key: HUDI-1023
> URL: https://issues.apache.org/jira/browse/HUDI-1023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> To print messages upon validation errors for expected configuration settings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-696) Add unit test for CommitsCommand

2020-06-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-696.

Fix Version/s: 0.6.0
   Resolution: Fixed

Fixed via master: f3a701757b9fb838acc4fb2975f378009d71f104

> Add unit test for CommitsCommand
> 
>
> Key: HUDI-696
> URL: https://issues.apache.org/jira/browse/HUDI-696
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Add unit test for CommitsCommand in hudi-cli module



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-696) Add unit test for CommitsCommand

2020-06-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-696.
--

> Add unit test for CommitsCommand
> 
>
> Key: HUDI-696
> URL: https://issues.apache.org/jira/browse/HUDI-696
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Add unit test for CommitsCommand in hudi-cli module



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-696) Add unit test for CommitsCommand

2020-06-20 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-696:
---
Status: Open  (was: New)

> Add unit test for CommitsCommand
> 
>
> Key: HUDI-696
> URL: https://issues.apache.org/jira/browse/HUDI-696
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>
> Add unit test for CommitsCommand in hudi-cli module



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] vinothchandar commented on a change in pull request #1746: [HUDI-996] Add functional test suite

2020-06-20 Thread GitBox



vinothchandar commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443131872



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/checkpointing/TestKafkaConnectHdfsProvider.java
##
@@ -19,37 +19,39 @@
 package org.apache.hudi.utilities.checkpointing;
 
 import org.apache.hudi.common.config.TypedProperties;
-import org.apache.hudi.common.testutils.HoodieCommonTestHarness;
 import org.apache.hudi.common.testutils.HoodieTestUtils;
 import org.apache.hudi.exception.HoodieException;
 
 import org.apache.hadoop.conf.Configuration;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.io.TempDir;
 
 import java.io.File;
+import java.nio.file.Files;
 
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertThrows;
 
-public class TestKafkaConnectHdfsProvider extends HoodieCommonTestHarness {
+public class TestKafkaConnectHdfsProvider {
 
-  private String topicPath = null;
-  private Configuration hadoopConf = null;
+  @TempDir
+  public java.nio.file.Path basePath;

Review comment:
   I wish java lmports had aliases like scala does :) ...again with a 
common base class, this can be avoided in every class?

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/TestHoodieSnapshotExporter.java
##
@@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities;
+
+import org.apache.hudi.utilities.HoodieSnapshotExporter.OutputFormatValidator;
+
+import com.beust.jcommander.ParameterException;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.NullSource;
+import org.junit.jupiter.params.provider.ValueSource;
+
+import static org.junit.jupiter.api.Assertions.assertDoesNotThrow;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+
+public class TestHoodieSnapshotExporter {

Review comment:
   meaning this is not bringing up any new resources to run the test... 
   I assume this the principle we will be following?
   
   unit - testing basic functionality at the class level, potentially using 
mocks. Expected to finish quicker
   functional - brings up the services needed and runs test without mocking
   integration - runs subset of functional tests, on a full fledged enviroment 
with dockerized services 
   
   Might be good to add such a doc somewhere.. may be in travis.yml or in 
README even.. so developers understand what test is what.

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/testutils/SharedResources.java
##
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+
+import java.io.IOException;
+
+public class

[jira] [Updated] (HUDI-1032) Remove unused classes in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Status: In Progress  (was: Open)

> Remove unused classes in HoodieCopyOnWriteTable and code clean
> --
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1032) Remove unused classes in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Status: Open  (was: New)

> Remove unused classes in HoodieCopyOnWriteTable and code clean
> --
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1032) Remove unused classes in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Summary: Remove unused classes in HoodieCopyOnWriteTable and code clean  
(was: Remove unused code in HoodieCopyOnWriteTable and code clean)

> Remove unused classes in HoodieCopyOnWriteTable and code clean
> --
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] vinothchandar commented on pull request #1746: [HUDI-996] Add functional test suite

2020-06-20 Thread GitBox



vinothchandar commented on pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#issuecomment-646993804


   @xushiyan codecov seems happy?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wangxianghu commented on pull request #1750: [HUDI-1032]Remove unused code in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread GitBox



wangxianghu commented on pull request #1750:
URL: https://github.com/apache/hudi/pull/1750#issuecomment-646993606


   Hi @yanghua, please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on a change in pull request #1746: [HUDI-996] Add functional test suite

2020-06-20 Thread GitBox



vinothchandar commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r443130343



##
File path: scripts/run_travis_tests.sh
##
@@ -20,12 +20,13 @@ mode=$1
 sparkVersion=2.4.4
 hadoopVersion=2.7
 
-if [ "$mode" = "unit" ];
-then
+if [ "$mode" = "unit" ]; then
   echo "Running Unit Tests"
   mvn test -DskipITs=true -B
-elif [ "$mode" = "integration" ];
-then
+elif [ "$mode" = "functional" ]; then
+  echo "Running Functional Test Suite"
+  mvn test -pl hudi-utilities -Pfunctional-test-suite -B

Review comment:
   let's make the PR title reflective of this scope?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

2020-06-20 Thread GitBox



vinothchandar commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-646993572


   @harishchanderramesh if you are looking for the specific config, it's 
https://hudi.apache.org/docs/configurations.html#PAYLOAD_CLASS_OPT_KEY
   
   if you are already deploying your app in a jar, all you need to do is to 
write the class and specify its name in tthe config. Hope  that helps 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1032:
-
Labels: pull-request-available  (was: )

> Remove unused code in HoodieCopyOnWriteTable and code clean
> ---
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] wangxianghu opened a new pull request #1750: [HUDI-1032]Remove unused code in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread GitBox



wangxianghu opened a new pull request #1750:
URL: https://github.com/apache/hudi/pull/1750


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
introduced as independent classes in path 
"src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
removed now.*
   
   ## Brief change log
   
   *Remove unused code in HoodieCopyOnWriteTable and code clean*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable and code clean

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Summary: Remove unused code in HoodieCopyOnWriteTable and code clean  (was: 
Remove unused code in HoodieCopyOnWriteTable)

> Remove unused code in HoodieCopyOnWriteTable and code clean
> ---
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Description: Currently, SmallFile, InsertBucket, BucketType and BucketInfo 
had been introduced as independent classes in path 
"src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
removed now.  (was: Currently, SmallFile, InsertBucket, and BucketInfo had been 
introduced as independent classes in path 
"src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
removed now.)

> Remove unused code in HoodieCopyOnWriteTable
> 
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>
> Currently, SmallFile, InsertBucket, BucketType and BucketInfo had been 
> introduced as independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1032:
--
Description: Currently, SmallFile, InsertBucket, and BucketInfo had been 
introduced as independent classes in path 
"src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
removed now.

> Remove unused code in HoodieCopyOnWriteTable
> 
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>
> Currently, SmallFile, InsertBucket, and BucketInfo had been introduced as 
> independent classes in path 
> "src/main/java/org/apache/hudi/table/action/commit",  So the old ones can be 
> removed now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu reassigned HUDI-1032:
-

Assignee: wangxianghu

> Remove unused code in HoodieCopyOnWriteTable
> 
>
> Key: HUDI-1032
> URL: https://issues.apache.org/jira/browse/HUDI-1032
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-1032) Remove unused code in HoodieCopyOnWriteTable

2020-06-20 Thread wangxianghu (Jira)

wangxianghu created HUDI-1032:
-

 Summary: Remove unused code in HoodieCopyOnWriteTable
 Key: HUDI-1032
 URL: https://issues.apache.org/jira/browse/HUDI-1032
 Project: Apache Hudi
  Issue Type: Task
  Components: Code Cleanup
Reporter: wangxianghu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-340) Increase Default max events to read from kafka source

2020-06-20 Thread wangxianghu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141019#comment-17141019
 ] 

wangxianghu edited comment on HUDI-340 at 6/20/20, 12:04 PM:
-

Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago(we better not assume all our users know the logic).

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 


was (Author: wangxianghu):
Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago(we better not assume all our users know the logic).

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 

> Increase Default max events to read from kafka source
> -
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-396) Provide an documentation to describe how to use test suite

2020-06-20 Thread wangxianghu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu reassigned HUDI-396:


Assignee: Trevorzhang  (was: wangxianghu)

> Provide an documentation to describe how to use test suite
> --
>
> Key: HUDI-396
> URL: https://issues.apache.org/jira/browse/HUDI-396
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: vinoyang
>Assignee: Trevorzhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] shenh062326 commented on pull request #1690: [HUDI-908] Add some data types to HoodieTestDataGenerator and fix some bugs

2020-06-20 Thread GitBox



shenh062326 commented on pull request #1690:
URL: https://github.com/apache/hudi/pull/1690#issuecomment-646971258


   @bvaradar  I have add all the missing types and fix some bugs.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Comment Edited] (HUDI-340) Increase Default max events to read from kafka source

2020-06-20 Thread wangxianghu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141019#comment-17141019
 ] 

wangxianghu edited comment on HUDI-340 at 6/20/20, 9:18 AM:


Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago(we better not assume all our users know the logic).

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 


was (Author: wangxianghu):
Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago(we'd better not assume all our users know the logic).

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 

> Increase Default max events to read from kafka source
> -
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-340) Increase Default max events to read from kafka source

2020-06-20 Thread wangxianghu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141019#comment-17141019
 ] 

wangxianghu edited comment on HUDI-340 at 6/20/20, 9:17 AM:


Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago(we'd better not assume all our users know the logic).

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 


was (Author: wangxianghu):
Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago.

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 

> Increase Default max events to read from kafka source
> -
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-340) Increase Default max events to read from kafka source

2020-06-20 Thread wangxianghu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17141019#comment-17141019
 ] 

wangxianghu commented on HUDI-340:
--

Hi [~Pratyaksh], thanks for the feedback!

yes, the user will never try to set such a source limit, it is just an example 
to state my point that such a huge limit is still configurable(for a test or 
set by mistake), so there is still a chance user can scan the entire Kafka 
topic at ago.

If the goal is to avoid scanning the entire Kafka topic at ago absolutely, then 
we must eliminate this possibility by setting a hard limit,

or just log a warning when the user sets a limit greater than the default value 
of *maxEventsToReadFromKafka*(the default value is 500w, I think it is big 
enough) which will in turn weaken the goal. I think either way is doable since 
setting such a huge sourceLimt is a low probability event too.

But, either way, I don't think we need the check above very much, it will be 
useful only when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by 
chance, this looks like a lottery, very low probability.

WDYT ?

cc [~vinoth]

 

> Increase Default max events to read from kafka source
> -
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-340) Increase Default max events to read from kafka source

2020-06-20 Thread Pratyaksh Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140999#comment-17140999
 ] 

Pratyaksh Sharma commented on HUDI-340:
---

Hi [~wangxianghu], the idea behind having these checks is one should not try to 
scan the entire kafka topic at a go. I agree you can still set the value to 
Long.MAX_VALUE - 1, but I do not think someone would try to set such a source 
limit after checking this logic which you pointed out. If you try to scan the 
entire kafka topic, in essence you might be trying to read a really large chunk 
of data, which might cause issues. 

If you further want to tune this logic, we can discuss about setting some hard 
limit to the sourceLimit so that one is not able to configure sourceLimits like 
Long.MAX_VALUE - 1, but I am not sure if that is going to be a good idea. 

> Increase Default max events to read from kafka source
> -
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

60 matches

Mail list logo