[jira] [Resolved] (HUDI-902) Avoid exception for getting SchemaProvider when no new input data

2020-05-15 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-902.
-
Resolution: Done

> Avoid exception for getting SchemaProvider when no new input data
> -
>
> Key: HUDI-902
> URL: https://issues.apache.org/jira/browse/HUDI-902
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] xushiyan commented on pull request #1623: [MINOR] Increase heap space for surefire

2020-05-15 Thread GitBox


xushiyan commented on pull request #1623:
URL: https://github.com/apache/incubator-hudi/pull/1623#issuecomment-629587320


   @bvaradar I found this doc saying the linux system has 7.5Gb memory 
https://docs.travis-ci.com/user/reference/overview/#virtualisation-environment-vs-operating-system
   
   Then i did a quick experiment to see the heap usage with the same surefire 
setup in this test repo
   
https://github.com/xushiyan/travis-work/blob/b47bf508a8b46429b3c302f7222073c279d4e964/pom.xml#L51
   
   With the debug output, we can see the `argLine` is indeed applied to surefire
   https://travis-ci.org/github/xushiyan/travis-work/jobs/687693986#L2622
   
   And the test is allow to go up to 3616.5 mb on heap size
   https://travis-ci.org/github/xushiyan/travis-work/jobs/687693986#L2625
   
   So i think we should be good with 2g max heap 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: Travis CI build asf-site

2020-05-15 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4b419ad  Travis CI build asf-site
4b419ad is described below

commit 4b419adf1ec2faaafdf1319475363d9239ba612d
Author: CI 
AuthorDate: Sat May 16 04:36:26 2020 +

Travis CI build asf-site
---
 content/assets/css/main.css  |  2 +-
 content/docs/powered_by.html | 34 ++
 content/index.html   | 82 ++--
 3 files changed, 76 insertions(+), 42 deletions(-)

diff --git a/content/assets/css/main.css b/content/assets/css/main.css
index 9922c4f..23f7302 100644
--- a/content/assets/css/main.css
+++ b/content/assets/css/main.css
@@ -1 +1 @@
-table{border-color:#1ab7ea !important}.page a{color:#3b9cba 
!important}.page__content{font-size:17px}.page__content.releases{font-size:17px}.page__footer{font-size:15px
 !important}.page__footer a{color:#3b9cba !important}.page__content 
.notice,.page__content .notice--primary,.page__content 
.notice--info,.page__content .notice--warning,.page__content 
.notice--success,.page__content .notice--danger{font-size:0.8em 
!important}.page__content table{font-size:0.8em !important}.page__content ta 
[...]
+table{border-color:#1ab7ea !important}.page a{color:#3b9cba 
!important}.page__content{font-size:17px}.page__content.releases{font-size:17px}.page__footer{font-size:15px
 !important}.page__footer a{color:#3b9cba !important}.page__content 
.notice,.page__content .notice--primary,.page__content 
.notice--info,.page__content .notice--warning,.page__content 
.notice--success,.page__content .notice--danger{font-size:0.8em 
!important}.page__content table{font-size:0.8em !important}.page__content ta 
[...]
diff --git a/content/docs/powered_by.html b/content/docs/powered_by.html
index 8a7e363..f4ab420 100644
--- a/content/docs/powered_by.html
+++ b/content/docs/powered_by.html
@@ -425,6 +425,40 @@ December 2019, AWS re:Invent 2019, Las Vegas, NV, USA
   https://eng.uber.com/hoodie/;>“Hoodie: Uber Engineering’s 
Incremental Processing Framework on Hadoop” - Engineering Blog By Prasanna 
Rajaperumal
 
 
+Powered by
+
+
+
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+  
+
+
+
+
   
 
   Back to top 

diff --git a/content/index.html b/content/index.html
index f680ed7..d8296e3 100644
--- a/content/index.html
+++ b/content/index.html
@@ -163,47 +163,47 @@
   
 
 
-
-  
-
-  Hudi Users
-
-
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-  
-
-
-
-   
Get Started
-
-  
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
   
 
 



[GitHub] [incubator-hudi] lamber-ken merged pull request #1635: [MINOR] Remove logos on home page

2020-05-15 Thread GitBox


lamber-ken merged pull request #1635:
URL: https://github.com/apache/incubator-hudi/pull/1635


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch asf-site updated: [MINOR] Remove logos on home page (#1635)

2020-05-15 Thread lamberken
This is an automated email from the ASF dual-hosted git repository.

lamberken pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 05a8ddb  [MINOR] Remove logos on home page (#1635)
05a8ddb is described below

commit 05a8ddb89a4ad8186c7d4f20b4bde806b73698f5
Author: lamber-ken 
AuthorDate: Sat May 16 12:34:32 2020 +0800

[MINOR] Remove logos on home page (#1635)
---
 docs/_docs/1_4_powered_by.md   | 25 +
 docs/_layouts/home.html| 32 
 docs/_sass/hudi_style/skins/_hudi.scss |  4 ++--
 3 files changed, 43 insertions(+), 18 deletions(-)

diff --git a/docs/_docs/1_4_powered_by.md b/docs/_docs/1_4_powered_by.md
index bee6bb9..04044d2 100644
--- a/docs/_docs/1_4_powered_by.md
+++ b/docs/_docs/1_4_powered_by.md
@@ -3,6 +3,19 @@ title: "Talks & Powered By"
 keywords: hudi, talks, presentation
 permalink: /docs/powered_by.html
 last_modified_at: 2019-12-31T15:59:57-04:00
+power_items:
+  - img_path: /assets/images/powers/uber.png
+  - img_path: /assets/images/powers/aws.jpg
+  - img_path: /assets/images/powers/alibaba.png
+  - img_path: /assets/images/powers/emis.jpg
+  - img_path: /assets/images/powers/yield.png
+  - img_path: /assets/images/powers/qq.png
+  - img_path: /assets/images/powers/tongcheng.png
+  - img_path: /assets/images/powers/yotpo.png
+  - img_path: /assets/images/powers/kyligence.png
+  - img_path: /assets/images/powers/tathastu.png
+  - img_path: /assets/images/powers/shunfeng.png
+  - img_path: /assets/images/powers/lingyue.png
 ---
 
 ## Adoption
@@ -71,3 +84,15 @@ Using Hudi at Yotpo for several usages. Firstly, integrated 
Hudi as a writer in
 
 1. ["The Case for incremental processing on 
Hadoop"](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
 - O'reilly Ideas article by Vinoth Chandar
 2. ["Hoodie: Uber Engineering's Incremental Processing Framework on 
Hadoop"](https://eng.uber.com/hoodie/) - Engineering Blog By Prasanna 
Rajaperumal
+
+## Powered by
+
+
+
+
+  {% for pi in page.power_items %}
+
+  {% endfor %}
+
+
+
diff --git a/docs/_layouts/home.html b/docs/_layouts/home.html
index f8174ba..fa169f9 100644
--- a/docs/_layouts/home.html
+++ b/docs/_layouts/home.html
@@ -69,25 +69,25 @@ layout: home
   
 
 
-
-  
-
-  Hudi Users
-
-
+
+
+
+
+
+
 
-  {% for pi in page.power_items %}
-
-  {% endfor %}
+
+
+
 
-
-
-   
Get Started
-
-  
-
+
+
+
+
+
+
 
-
+
   
 {% include footer.html %}
   
diff --git a/docs/_sass/hudi_style/skins/_hudi.scss 
b/docs/_sass/hudi_style/skins/_hudi.scss
index 898046b..0b2f00e 100644
--- a/docs/_sass/hudi_style/skins/_hudi.scss
+++ b/docs/_sass/hudi_style/skins/_hudi.scss
@@ -60,7 +60,7 @@ table {
   &--overlay {
 position: relative;
 .wrapper {
-  padding: 3em 2em 2em 2em !important;
+  padding: 4em 2em 2em 2em !important;
 }
   }
 }
@@ -98,7 +98,7 @@ table {
 
   .power-item {
 display: inline-block;
-width: 210px;
+width: 240px;
 margin-left: 0.8em;
 margin-right: 0.8em;
 margin-bottom: 1.1em;



[incubator-hudi] branch master updated: [HUDI-902] Avoid exception when getSchemaProvider (#1584)

2020-05-15 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ada2ef  [HUDI-902] Avoid exception when getSchemaProvider (#1584)
2ada2ef is described below

commit 2ada2ef50fc373ed3083d0e7a96e5e644be52bfb
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Fri May 15 21:33:02 2020 -0700

[HUDI-902] Avoid exception when getSchemaProvider (#1584)

* When no new input data, don't throw exception for null SchemaProvider
* Return the newly added NullSchemaProvider instead
---
 .../apache/hudi/utilities/sources/InputBatch.java  | 24 --
 .../hudi/utilities/sources/TestInputBatch.java | 37 ++
 2 files changed, 59 insertions(+), 2 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/InputBatch.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/InputBatch.java
index dcf56f3..f752e0d 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/InputBatch.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/InputBatch.java
@@ -18,10 +18,14 @@
 
 package org.apache.hudi.utilities.sources;
 
+import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.utilities.schema.SchemaProvider;
 
+import org.apache.avro.Schema;
+import org.apache.spark.api.java.JavaSparkContext;
+
 public class InputBatch {
 
   private final Option batch;
@@ -49,9 +53,25 @@ public class InputBatch {
   }
 
   public SchemaProvider getSchemaProvider() {
-if (schemaProvider == null) {
+if (batch.isPresent() && schemaProvider == null) {
   throw new HoodieException("Please provide a valid schema provider 
class!");
 }
-return schemaProvider;
+return Option.ofNullable(schemaProvider).orElse(new NullSchemaProvider());
+  }
+
+  public static class NullSchemaProvider extends SchemaProvider {
+
+public NullSchemaProvider() {
+  this(null, null);
+}
+
+public NullSchemaProvider(TypedProperties props, JavaSparkContext jssc) {
+  super(props, jssc);
+}
+
+@Override
+public Schema getSourceSchema() {
+  return Schema.create(Schema.Type.NULL);
+}
   }
 }
diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java
new file mode 100644
index 000..752621d
--- /dev/null
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java
@@ -0,0 +1,37 @@
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.RowBasedSchemaProvider;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+
+import org.junit.jupiter.api.Test;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertSame;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+public class TestInputBatch {
+
+  @Test
+  public void getSchemaProviderShouldThrowException() {
+final InputBatch inputBatch = new InputBatch<>(Option.of("foo"), 
null, null);
+Throwable t = assertThrows(HoodieException.class, 
inputBatch::getSchemaProvider);
+assertEquals("Please provide a valid schema provider class!", 
t.getMessage());
+  }
+
+  @Test
+  public void getSchemaProviderShouldReturnNullSchemaProvider() {
+final InputBatch inputBatch = new InputBatch<>(Option.empty(), 
null, null);
+SchemaProvider schemaProvider = inputBatch.getSchemaProvider();
+assertTrue(schemaProvider instanceof InputBatch.NullSchemaProvider);
+  }
+
+  @Test
+  public void getSchemaProviderShouldReturnGivenSchemaProvider() {
+SchemaProvider schemaProvider = new RowBasedSchemaProvider(null);
+final InputBatch inputBatch = new InputBatch<>(Option.of("foo"), 
null, schemaProvider);
+assertSame(schemaProvider, inputBatch.getSchemaProvider());
+  }
+}



[GitHub] [incubator-hudi] bvaradar merged pull request #1584: [HUDI-902] Avoid exception when getSchemaProvider

2020-05-15 Thread GitBox


bvaradar merged pull request #1584:
URL: https://github.com/apache/incubator-hudi/pull/1584


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on pull request #1635: [MINOR] Remove logos on home page

2020-05-15 Thread GitBox


vinothchandar commented on pull request #1635:
URL: https://github.com/apache/incubator-hudi/pull/1635#issuecomment-629586033


   Please land.. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1635: [MINOR] Remove logos on home page

2020-05-15 Thread GitBox


lamber-ken opened a new pull request #1635:
URL: https://github.com/apache/incubator-hudi/pull/1635


   ## What is the purpose of the pull request
   
   - Remove logos on home page
   - Move logos to powered by page.
   
   **Sync**
   https://lamber-ken.github.io
   https://lamber-ken.github.io/docs/powered_by.html
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #279

2020-05-15 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.38 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] xushiyan commented on pull request #1584: [HUDI-902] Avoid exception when getSchemaProvider

2020-05-15 Thread GitBox


xushiyan commented on pull request #1584:
URL: https://github.com/apache/incubator-hudi/pull/1584#issuecomment-629576325


   @bvaradar The CI passed. It's ready for review now. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-902) Avoid exception for getting SchemaProvider when no new input data

2020-05-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-902:

Labels: pull-request-available  (was: )

> Avoid exception for getting SchemaProvider when no new input data
> -
>
> Key: HUDI-902
> URL: https://issues.apache.org/jira/browse/HUDI-902
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-895) Reduce listing .hoodie folder when using timeline server

2020-05-15 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-895:
---

Assignee: Balaji Varadarajan

> Reduce listing .hoodie folder when using timeline server
> 
>
> Key: HUDI-895
> URL: https://issues.apache.org/jira/browse/HUDI-895
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0, 0.5.3
>
>
> Currently, we are unnecessarily listing .hoodie folder when sending queries 
> to timeline-server. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-858) Allow multiple operations to be executed within a single commit

2020-05-15 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-858:

Fix Version/s: (was: 0.5.2)
   (was: 0.5.1)
   0.5.3

> Allow multiple operations to be executed within a single commit
> ---
>
> Key: HUDI-858
> URL: https://issues.apache.org/jira/browse/HUDI-858
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> There are users who had been directly using RDD APIs and have relied on a 
> behavior in 0.4.x to allow multiple write operations (upsert/buk-insert/...) 
> to be executed within a single commit. 
> Given Hudi commit protocol, these are generally unsafe operations and user 
> need to handle failure scenarios. It only works with COW table. Hudi 0.5.x 
> had stopped this behavior.
> Given the importance of supporting such cases for the user's migration to 
> 0.5.x, we are proposing a safety flag (disabled by default) which will allow 
> this old behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-846) Turn on incremental cleaning bu default in 0.6.0

2020-05-15 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-846:

Fix Version/s: 0.5.3

> Turn on incremental cleaning bu default in 0.6.0
> 
>
> Key: HUDI-846
> URL: https://issues.apache.org/jira/browse/HUDI-846
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> Incremental cleaner will track commits that have happened since the last 
> clean operation to figure out partitions which needs to be scanned for 
> cleaning. This avoids the costly scanning of all partition paths.
> Incremental cleaning is currently disabled by default. We need to enable it 
> by default in 0.6.0.
> No special handling is required for upgrade/downgrade scenarios as 
> incremental cleaning relies on standard format of commit metadata 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-848) Turn on embedded timeline server by default for all writes

2020-05-15 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-848:

Fix Version/s: 0.5.3

> Turn on embedded timeline server by default for all writes
> --
>
> Key: HUDI-848
> URL: https://issues.apache.org/jira/browse/HUDI-848
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0, 0.5.3
>
>
> Includes RDD level, Spark DS and DeltaStreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on pull request #1634: [WIP] [HUDI-846][HUDI-848] Enable Incremental cleaning and embedded timeline-server by default

2020-05-15 Thread GitBox


bvaradar commented on pull request #1634:
URL: https://github.com/apache/incubator-hudi/pull/1634#issuecomment-629572248


   @bhasudha : This is another important config change for 0.5.3. I am marking 
the PR as WIP till I get the tests to succeed. After that, I will make the PR 
active.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-846) Turn on incremental cleaning bu default in 0.6.0

2020-05-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-846:

Labels: pull-request-available  (was: )

> Turn on incremental cleaning bu default in 0.6.0
> 
>
> Key: HUDI-846
> URL: https://issues.apache.org/jira/browse/HUDI-846
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Incremental cleaner will track commits that have happened since the last 
> clean operation to figure out partitions which needs to be scanned for 
> cleaning. This avoids the costly scanning of all partition paths.
> Incremental cleaning is currently disabled by default. We need to enable it 
> by default in 0.6.0.
> No special handling is required for upgrade/downgrade scenarios as 
> incremental cleaning relies on standard format of commit metadata 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar opened a new pull request #1634: [WIP] [HUDI-846][HUDI-848] Enable Incremental cleaning and embedded timeline-server by default

2020-05-15 Thread GitBox


bvaradar opened a new pull request #1634:
URL: https://github.com/apache/incubator-hudi/pull/1634


   This is to enable timeline-server and incremental cleaning by default



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-858) Allow multiple operations to be executed within a single commit

2020-05-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-858:

Labels: pull-request-available  (was: )

> Allow multiple operations to be executed within a single commit
> ---
>
> Key: HUDI-858
> URL: https://issues.apache.org/jira/browse/HUDI-858
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1, 0.5.2, 0.6.0
>
>
> There are users who had been directly using RDD APIs and have relied on a 
> behavior in 0.4.x to allow multiple write operations (upsert/buk-insert/...) 
> to be executed within a single commit. 
> Given Hudi commit protocol, these are generally unsafe operations and user 
> need to handle failure scenarios. It only works with COW table. Hudi 0.5.x 
> had stopped this behavior.
> Given the importance of supporting such cases for the user's migration to 
> 0.5.x, we are proposing a safety flag (disabled by default) which will allow 
> this old behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-15 Thread GitBox


bvaradar commented on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629571160


   @bhasudha : FYI: This is needed for 0.5.3 (cc @vinothchandar )



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar opened a new pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-15 Thread GitBox


bvaradar opened a new pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633


   There are users who had been directly using RDD APIs and have relied on a 
behavior in 0.4.x to allow multiple write operations (upsert/buk-insert/...) to 
be executed within a single commit. 
   
   Given Hudi commit protocol, these are generally unsafe operations and user 
need to handle failure scenarios. It only works with COW table. Hudi 0.5.x had 
stopped this behavior.
   
   Given the importance of supporting such cases for the user's migration to 
0.5.x, we are proposing a safety flag (disabled by default) which will allow 
this old behavior.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-902) Avoid exception for getting SchemaProvider when no new input data

2020-05-15 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-902:

Status: In Progress  (was: Open)

> Avoid exception for getting SchemaProvider when no new input data
> -
>
> Key: HUDI-902
> URL: https://issues.apache.org/jira/browse/HUDI-902
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-902) Avoid exception for getting SchemaProvider when no new input data

2020-05-15 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-902:
---

 Summary: Avoid exception for getting SchemaProvider when no new 
input data
 Key: HUDI-902
 URL: https://issues.apache.org/jira/browse/HUDI-902
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: Raymond Xu
Assignee: Raymond Xu
 Fix For: 0.5.3






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] umehrot2 commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

2020-05-15 Thread GitBox


umehrot2 commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629568357


   @afilipchik Seems like spark-avro schema convertor itself generates 
incorrect schema when we want to have **default value** as **null**. Is that 
the main concern addressed in this PR ? If thats the case, we should avoid 
using the spark library for conversion all together, and have in-house logic to 
generate correct schema in the first place.
   
   Also, under what scenario do we see failures because of this. Would be good 
to have a test case that fails currently for better understanding.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-112) Supporting a Collapse type of operation

2020-05-15 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108792#comment-17108792
 ] 

liwei commented on HUDI-112:


Hi, Nishith Agarwal [~nishith29]

we  also meet this issue, in RFC 19: 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction
 
|https://cwiki.apache.org/confluence/display/HUDI/RFC+-+19+hudi+support+log+append+scenario+with+better+write+and+asynchronous+compaction]

we need a mechanism to solve two issues.
1. On the write side: do not compaction for faster write. (now merge on read 
can solve this problem)
2. compaction and read : also a mechanism to collapse older smaller files into 
larger ones while also keeping the query cost low.(if use merge on read, if do 
not compaction, the realtime read will slow)

we have a option:
1. On the write side: just write parquet, not compaction
2. compaction and read : because the small file is parquet, the realtime read 
can be fast, also user can use asynchronous compaction to collapse older 
smaller parquet files into larger parquet files

what is  the current progress of this issue? :)

> Supporting a Collapse type of operation
> ---
>
> Key: HUDI-112
> URL: https://issues.apache.org/jira/browse/HUDI-112
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>
> Currently, for COPY_ON_WRITE tables Hudi automatically adjusts small file by 
> packing inserts and sending them over to a particular file based on the small 
> file size limits set in the client config.
> One of the side effects of this is that the time taken to rewrite the small 
> files into larger ones is borne by the writer (or the ingestor). In cases 
> where we continuously want really low ingestion latency ( < 5 mins ), having 
> the writer enlarge the small files may not be preferable.
> If there was a way for the writer to schedule a collapse sort of operation 
> that can later be picked up asynchronously by a job/thread (different from 
> the ingestor) that collapses N files into M files, thereby also enlarging the 
> file sizes. 
> The mechanism should support different strategies for scheduling collapse so 
> we can perform even smarter data layout during such rewriting, for eg., group 
> certain record_keys together in a single file from N different files to allow 
> for better query performance and more.
> MERGE_ON_READ on the other hand solves this in a different way. We can send 
> inserts to log files (for a base columnar file) and when the compaction kicks 
> in, it would automatically resize the file. Although, the reader (realtime 
> query) would have to pay a small penalty here to merge the log files with the 
> base columnar files to get freshest data. 
> In any case, we need a mechanism to collapse older smaller files into larger 
> ones while also keeping the query cost low. Creating this ticket to discuss 
> more around this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1402: [HUDI-407] Adding Simple Index

2020-05-15 Thread GitBox


codecov-io edited a comment on pull request #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-619680608


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=h1) 
Report
   > Merging 
[#1402](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25e0b75b3d03b6d460dc18d1a5fce7b881b0e019=desc)
 will **increase** coverage by `0.14%`.
   > The diff coverage is `88.27%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1402/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1402  +/-   ##
   
   + Coverage 71.81%   71.95%   +0.14% 
 Complexity 1092 1092  
   
 Files   386  390   +4 
 Lines 1660816745 +137 
 Branches   1667 1678  +11 
   
   + Hits  1192712049 +122 
   - Misses 3955 3966  +11 
   - Partials726  730   +4 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=)
 | `60.25% <46.66%> (-3.24%)` | `3.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/common/util/ParquetUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvUGFycXVldFV0aWxzLmphdmE=)
 | `74.22% <77.27%> (+0.54%)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/index/simple/HoodieSimpleIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvc2ltcGxlL0hvb2RpZVNpbXBsZUluZGV4LmphdmE=)
 | `91.89% <91.89%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...che/hudi/index/simple/HoodieGlobalSimpleIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvc2ltcGxlL0hvb2RpZUdsb2JhbFNpbXBsZUluZGV4LmphdmE=)
 | `94.28% <94.28%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...n/java/org/apache/hudi/index/HoodieIndexUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXhVdGlscy5qYXZh)
 | `95.00% <95.00%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...org/apache/hudi/client/utils/SparkConfigUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L3V0aWxzL1NwYXJrQ29uZmlnVXRpbHMuamF2YQ==)
 | `96.15% <100.00%> (+0.15%)` | `2.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `85.10% <100.00%> (+0.25%)` | `47.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `89.47% <100.00%> (+1.23%)` | `3.00 <1.00> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `96.96% <100.00%> (-0.40%)` | `16.00 <2.00> (-2.00)` | |
   | 
[...pache/hudi/index/bloom/HoodieGlobalBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllR2xvYmFsQmxvb21JbmRleC5qYXZh)
 | `91.66% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and [8 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1402: [HUDI-407] Adding Simple Index

2020-05-15 Thread GitBox


codecov-io edited a comment on pull request #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-619680608


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=h1) 
Report
   > Merging 
[#1402](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25e0b75b3d03b6d460dc18d1a5fce7b881b0e019=desc)
 will **increase** coverage by `0.14%`.
   > The diff coverage is `88.27%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1402/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1402  +/-   ##
   
   + Coverage 71.81%   71.95%   +0.14% 
 Complexity 1092 1092  
   
 Files   386  390   +4 
 Lines 1660816745 +137 
 Branches   1667 1678  +11 
   
   + Hits  1192712049 +122 
   - Misses 3955 3966  +11 
   - Partials726  730   +4 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieIndexConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUluZGV4Q29uZmlnLmphdmE=)
 | `60.25% <46.66%> (-3.24%)` | `3.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/common/util/ParquetUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvUGFycXVldFV0aWxzLmphdmE=)
 | `74.22% <77.27%> (+0.54%)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/index/simple/HoodieSimpleIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvc2ltcGxlL0hvb2RpZVNpbXBsZUluZGV4LmphdmE=)
 | `91.89% <91.89%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...che/hudi/index/simple/HoodieGlobalSimpleIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvc2ltcGxlL0hvb2RpZUdsb2JhbFNpbXBsZUluZGV4LmphdmE=)
 | `94.28% <94.28%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...n/java/org/apache/hudi/index/HoodieIndexUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXhVdGlscy5qYXZh)
 | `95.00% <95.00%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...org/apache/hudi/client/utils/SparkConfigUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L3V0aWxzL1NwYXJrQ29uZmlnVXRpbHMuamF2YQ==)
 | `96.15% <100.00%> (+0.15%)` | `2.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `85.10% <100.00%> (+0.25%)` | `47.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/index/HoodieIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvSG9vZGllSW5kZXguamF2YQ==)
 | `89.47% <100.00%> (+1.23%)` | `3.00 <1.00> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `96.96% <100.00%> (-0.40%)` | `16.00 <2.00> (-2.00)` | |
   | 
[...pache/hudi/index/bloom/HoodieGlobalBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllR2xvYmFsQmxvb21JbmRleC5qYXZh)
 | `91.66% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and [8 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1402/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1402?src=pr=footer).
 Last update 

[GitHub] [incubator-hudi] nsivabalan commented on pull request #1402: [HUDI-407] Adding Simple Index

2020-05-15 Thread GitBox


nsivabalan commented on pull request #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-629558212


   Squashed all commits to one @vinothchandar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

2020-05-15 Thread GitBox


nsivabalan commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629554509


   @bvaradar @vinothchandar : adding null and default logic looks good to me. 
Do you folks suggest to create a new Schema altogether to have a neat solution 
or do it in place with reflection as the patch does as of now. 
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] leesf commented on pull request #1622: [HUDI-888] fix NullPointerException

2020-05-15 Thread GitBox


leesf commented on pull request #1622:
URL: https://github.com/apache/incubator-hudi/pull/1622#issuecomment-629554166


   > cc @hddong in case this is a sign of some hardcode ports etc.
   
   I did look into the cli test code, and not found the hardcode ports and 
restarted the Travis three times, it is occured occasionally



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched

2020-05-15 Thread GitBox


bvaradar commented on pull request #1524:
URL: https://github.com/apache/incubator-hudi/pull/1524#issuecomment-629547495


   @afilipchik : Please take a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1566: [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode

2020-05-15 Thread GitBox


afilipchik commented on a change in pull request #1566:
URL: https://github.com/apache/incubator-hudi/pull/1566#discussion_r426074416



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaSet.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.schema;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaNormalization;
+
+import java.util.Set;
+
+/**
+ * Tracks already processed schemas.
+ */
+public class SchemaSet implements Serializable {
+

Review comment:
   should we add serialVersionUID? If it is not specified and anything in 
the class imports is shaded -> it will affect autogenerated one. Which causes 
issues if there is more than 1 version of the class in the classpath which was 
shaded differently. 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
##
@@ -81,11 +66,22 @@ private static Schema getSchema(String registryUrl) throws 
IOException {
 
   @Override
   public Schema getSourceSchema() {
-return schema;
+String registryUrl = config.getString(Config.SRC_SCHEMA_REGISTRY_URL_PROP);
+try {
+  return getSchema(registryUrl);
+} catch (IOException ioe) {
+  throw new HoodieIOException("Error reading source schema from registry 
:" + registryUrl, ioe);
+}
   }
 
   @Override
   public Schema getTargetSchema() {
-return targetSchema;
+String registryUrl = config.getString(Config.SRC_SCHEMA_REGISTRY_URL_PROP);
+String targetRegistryUrl = 
config.getString(Config.TARGET_SCHEMA_REGISTRY_URL_PROP, registryUrl);
+try {
+  return getSchema(targetRegistryUrl);

Review comment:
   it might result in target schema != source schema when targetRegistryUrl 
is not specified as schema might change between getSourceSchema, 
getTargetSchema calls. Is it a problem? 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/Compactor.java
##
@@ -59,6 +60,10 @@ public void compact(HoodieInstant instant) throws 
IOException {
   "Compaction for instant (" + instant + ") failed with write errors. 
Errors :" + numWriteErrors);
 }
 // Commit compaction
-compactionClient.commitCompaction(instant.getTimestamp(), res, 
Option.empty());
+writeClient.commitCompaction(instant.getTimestamp(), res, Option.empty());
+  }
+
+  public void updateWriteClient(HoodieWriteClient writeClient) {

Review comment:
   is it used anywhere? 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaSet.java
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.schema;
+
+import java.io.Serializable;
+import java.util.HashSet;
+import org.apache.avro.Schema;
+import org.apache.avro.SchemaNormalization;
+
+import java.util.Set;
+
+/**
+ * Tracks already processed schemas.
+ */
+public class SchemaSet implements Serializable {
+
+  private final Set processedSchema = new HashSet<>();

Review comment:
   will this grow indefinitely? How would we remove old schema? 

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/SchemaRegistryProvider.java
##
@@ -81,11 +66,22 @@ private static Schema getSchema(String registryUrl) 

[GitHub] [incubator-hudi] vingov commented on pull request #1632: [HUDI-783] Added python3 to the spark_base docker image to support pyspark

2020-05-15 Thread GitBox


vingov commented on pull request #1632:
URL: https://github.com/apache/incubator-hudi/pull/1632#issuecomment-629521740


   @bhasudha - As we discussed, I've followed the steps to test the docker 
images using local registry, check out the detailed testing report 
[here](https://gist.github.com/vingov/347548f2df9af11892c8331123b42967)
   
   Can you please review and merge this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vingov opened a new pull request #1632: Added python3 to the spark_base docker image to support pyspark

2020-05-15 Thread GitBox


vingov opened a new pull request #1632:
URL: https://github.com/apache/incubator-hudi/pull/1632


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   This pull-request adds python3 to the apachehudi/sparkbase docker image, 
which is required to run pyspark shell command. This is part of the effort to 
add official python-support to hudi package.
   
   We can run the following pyspark command on sparkmaster, sparkworker, 
sparkadhoc docker images after merging this pull request. 
   
   ## Brief change log
   
 - *Added python-support to the hudi demo docker images to enable pyspark 
shell*
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
 - *Manually verified the change by building the docker image locally, 
registering with the local Docker registry, using the local docker image in the 
docker-compose file, started all the docker images and tested the change by 
invoking pyspark from the adhoc-1 machine via command-line.*
   
   The details of the testing can be found in this gist: 
https://gist.github.com/vingov/347548f2df9af11892c8331123b42967
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-15 Thread Bhavani Sudha (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108663#comment-17108663
 ] 

Bhavani Sudha commented on HUDI-110:


[~garyli1019] all yours. Re-assigned it to you.

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-15 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-110:
--

Assignee: Yanjia Gary Li  (was: Bhavani Sudha Saktheeswaran)

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on pull request #1514: [HUDI-774] Addressing incorrect Spark to Avro schema generation

2020-05-15 Thread GitBox


bvaradar commented on pull request #1514:
URL: https://github.com/apache/incubator-hudi/pull/1514#issuecomment-629480819


   Also pinging @umehrot2 to get your help in reviewing this as you are 
familiar with this part.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] afeldman1 commented on issue #933: Support for multiple level partitioning in Hudi

2020-05-15 Thread GitBox


afeldman1 commented on issue #933:
URL: https://github.com/apache/incubator-hudi/issues/933#issuecomment-629478392


   Similarly on whether we should add for the hive configuration,
   
   val HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY = 
"hoodie.datasource.hive_sync.partition_extractor_class"
   
   to add constants for:
   classOf[MultiPartKeysValueExtractor].getCanonicalName
   classOf[NonPartitionedExtractor].getCanonicalName



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] xushiyan commented on pull request #1584: fix schema provider issue

2020-05-15 Thread GitBox


xushiyan commented on pull request #1584:
URL: https://github.com/apache/incubator-hudi/pull/1584#issuecomment-629472943


   Ok @bvaradar thanks for checking. I shall be able to do it in the late 
afternoon.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1584: fix schema provider issue

2020-05-15 Thread GitBox


bvaradar commented on pull request #1584:
URL: https://github.com/apache/incubator-hudi/pull/1584#issuecomment-629467290


   @xushiyan : The idea and code changes looks good to me. Can you add a jira 
ticket and add an unit-test to include this change. It would be great if you 
could get this to 0.5.3. If you cannot get this today, let me know and I will 
try to help 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] afeldman1 commented on issue #933: Support for multiple level partitioning in Hudi

2020-05-15 Thread GitBox


afeldman1 commented on issue #933:
URL: https://github.com/apache/incubator-hudi/issues/933#issuecomment-629455257


   Thank you! That works. Should this be added to the org.apache.hudi 
DataSourceOptions.scala?
   
   Right now it has:
   /**
   * Key generator class, that implements will extract the key out of 
incoming record
   *
   */
 val KEYGENERATOR_CLASS_OPT_KEY = 
"hoodie.datasource.write.keygenerator.class"
 val DEFAULT_KEYGENERATOR_CLASS_OPT_VAL = 
classOf[SimpleKeyGenerator].getName
   
   So we could add (open to opinions on the variable names):
   val COMPLEX_KEYGENERATOR_CLASS_OPT_VAL = classOf[ComplexKeyGenerator].getName
   val NOPARTITION_KEYGENERATOR_CLASS_OPT_VAL = 
classOf[NonpartitionedKeyGenerator].getName
   
   (I saw the no partition option being described in the FAQ wiki you linked to 
above)
   
   This would follow the pattern in the DataSourceWriteOptions object, so that 
it's easier to see what options are available for the parameter. I haven't 
contributed to this project yet, but if you agree about this change I can try 
and add it in.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1524: [HUDI-801] Adding a way to post process schema after it is fetched

2020-05-15 Thread GitBox


bvaradar commented on pull request #1524:
URL: https://github.com/apache/incubator-hudi/pull/1524#issuecomment-629454559


   Rebased to get the correct view of the diff



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1566: [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode

2020-05-15 Thread GitBox


codecov-io edited a comment on pull request #1566:
URL: https://github.com/apache/incubator-hudi/pull/1566#issuecomment-619623233


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1566?src=pr=h1) 
Report
   > Merging 
[#1566](https://codecov.io/gh/apache/incubator-hudi/pull/1566?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/83796b3189570182c68a9c41e57b356124c301ca=desc)
 will **decrease** coverage by `0.01%`.
   > The diff coverage is `69.81%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1566/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1566?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1566  +/-   ##
   
   - Coverage 71.80%   71.79%   -0.02% 
   - Complexity 1087 1095   +8 
   
 Files   385  387   +2 
 Lines 1659116645  +54 
 Branches   1669 1675   +6 
   
   + Hits  1191311950  +37 
   - Misses 3949 3962  +13 
   - Partials729  733   +4 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1566?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../hudi/utilities/schema/SchemaRegistryProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFSZWdpc3RyeVByb3ZpZGVyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...a/org/apache/hudi/client/AbstractHoodieClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllQ2xpZW50LmphdmE=)
 | `75.00% <50.00%> (-3.95%)` | `6.00 <0.00> (ø)` | |
   | 
[...i/utilities/keygen/TimestampBasedKeyGenerator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2tleWdlbi9UaW1lc3RhbXBCYXNlZEtleUdlbmVyYXRvci5qYXZh)
 | `58.82% <61.11%> (-0.16%)` | `7.00 <1.00> (+2.00)` | :arrow_down: |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `78.89% <70.00%> (-1.11%)` | `11.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.93% <76.92%> (+0.48%)` | `40.00 <2.00> (+3.00)` | |
   | 
[.../client/embedded/EmbeddedTimelineServerHelper.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2VtYmVkZGVkL0VtYmVkZGVkVGltZWxpbmVTZXJ2ZXJIZWxwZXIuamF2YQ==)
 | `80.00% <80.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...apache/hudi/utilities/deltastreamer/Compactor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvQ29tcGFjdG9yLmphdmE=)
 | `68.75% <80.00%> (-8.18%)` | `3.00 <0.00> (ø)` | |
   | 
[...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==)
 | `72.41% <100.00%> (-0.17%)` | `0.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `100.00% <100.00%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [5 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1566/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1566?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 

[GitHub] [incubator-hudi] bvaradar merged pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar merged pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar commented on pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#issuecomment-629445679


   Going ahead and merging this change. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch master updated: [HUDI-723] Register avro schema if infered from SQL transformation (#1518)

2020-05-15 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 25e0b75  [HUDI-723] Register avro schema if infered from SQL 
transformation (#1518)
25e0b75 is described below

commit 25e0b75b3d03b6d460dc18d1a5fce7b881b0e019
Author: Alexander Filipchik 
AuthorDate: Fri May 15 12:44:03 2020 -0700

[HUDI-723] Register avro schema if infered from SQL transformation (#1518)

* Register avro schema if infered from SQL transformation
* Make HoodieWriteClient creation done lazily always. Handle setting 
schema-provider and avro-schemas correctly when using SQL transformer

Co-authored-by: Alex Filipchik 
Co-authored-by: Balaji Varadarajan 
---
 .../hudi/utilities/deltastreamer/DeltaSync.java| 39 +
 .../utilities/schema/DelegatingSchemaProvider.java | 51 ++
 2 files changed, 73 insertions(+), 17 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
index 210c948..fd051ed 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -43,6 +43,7 @@ import org.apache.hudi.keygen.KeyGenerator;
 import org.apache.hudi.utilities.UtilHelpers;
 import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Operation;
 import org.apache.hudi.utilities.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.utilities.schema.DelegatingSchemaProvider;
 import org.apache.hudi.utilities.schema.RowBasedSchemaProvider;
 import org.apache.hudi.utilities.schema.SchemaProvider;
 import org.apache.hudi.utilities.sources.InputBatch;
@@ -98,6 +99,11 @@ public class DeltaSync implements Serializable {
   private transient SourceFormatAdapter formatAdapter;
 
   /**
+   * User Provided Schema Provider.
+   */
+  private transient SchemaProvider userProvidedSchemaProvider;
+
+  /**
* Schema provider that supplies the command for reading the input and 
writing out the target table.
*/
   private transient SchemaProvider schemaProvider;
@@ -162,20 +168,18 @@ public class DeltaSync implements Serializable {
 this.fs = fs;
 this.onInitializingHoodieWriteClient = onInitializingHoodieWriteClient;
 this.props = props;
-this.schemaProvider = schemaProvider;
+this.userProvidedSchemaProvider = schemaProvider;
 
 refreshTimeline();
+// Register User Provided schema first
+registerAvroSchemas(schemaProvider);
 
 this.transformer = 
UtilHelpers.createTransformer(cfg.transformerClassNames);
 this.keyGenerator = DataSourceUtils.createKeyGenerator(props);
 
 this.formatAdapter = new SourceFormatAdapter(
 UtilHelpers.createSource(cfg.sourceClassName, props, jssc, 
sparkSession, schemaProvider));
-
 this.conf = conf;
-
-// If schemaRegistry already resolved, setup write-client
-setupWriteClient();
   }
 
   /**
@@ -218,8 +222,7 @@ public class DeltaSync implements Serializable {
 if (null != srcRecordsWithCkpt) {
   // this is the first input batch. If schemaProvider not set, use it and 
register Avro Schema and start
   // compactor
-  if (null == schemaProvider) {
-// Set the schemaProvider if not user-provided
+  if (null == writeClient) {
 this.schemaProvider = srcRecordsWithCkpt.getKey();
 // Setup HoodieWriteClient and compaction now that we decided on schema
 setupWriteClient();
@@ -280,26 +283,28 @@ public class DeltaSync implements Serializable {
   Option> transformed =
   dataAndCheckpoint.getBatch().map(data -> 
transformer.get().apply(jssc, sparkSession, data, props));
   checkpointStr = dataAndCheckpoint.getCheckpointForNextBatch();
-  if (this.schemaProvider != null && this.schemaProvider.getTargetSchema() 
!= null) {
+  if (this.userProvidedSchemaProvider != null && 
this.userProvidedSchemaProvider.getTargetSchema() != null) {
 // If the target schema is specified through Avro schema,
 // pass in the schema for the Row-to-Avro conversion
 // to avoid nullability mismatch between Avro schema and Row schema
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
-t, this.schemaProvider.getTargetSchema(),
+t, this.userProvidedSchemaProvider.getTargetSchema(),
 HOODIE_RECORD_STRUCT_NAME, 
HOODIE_RECORD_NAMESPACE).toJavaRDD());
+schemaProvider = this.userProvidedSchemaProvider;
   } else {
+// Use Transformed Row's schema if not overridden. If target schema is 
not specified
+// default to RowBasedSchemaProvider
+

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar commented on a change in pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#discussion_r426010971



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -296,10 +296,21 @@ private void refreshTimeline() throws IOException {
 
   // Use Transformed Row's schema if not overridden. If target schema is 
not specified
   // default to RowBasedSchemaProvider
-  schemaProvider = this.schemaProvider == null || 
this.schemaProvider.getTargetSchema() == null
-  ? transformed.map(r -> (SchemaProvider) new 
RowBasedSchemaProvider(r.schema())).orElse(
-  dataAndCheckpoint.getSchemaProvider())
-  : this.schemaProvider;
+  if (this.schemaProvider == null) {
+schemaProvider =
+transformed
+.map(r -> (SchemaProvider) new 
RowBasedSchemaProvider(r.schema()))
+.orElse(dataAndCheckpoint.getSchemaProvider());
+  } else if (this.schemaProvider.getTargetSchema() == null) {
+schemaProvider =

Review comment:
   Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1566: [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode

2020-05-15 Thread GitBox


bvaradar commented on pull request #1566:
URL: https://github.com/apache/incubator-hudi/pull/1566#issuecomment-629443388


   @pratyakshsharma : Just rebased and did some cleanup.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1566: [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode

2020-05-15 Thread GitBox


pratyakshsharma commented on pull request #1566:
URL: https://github.com/apache/incubator-hudi/pull/1566#issuecomment-629430390


   > @pratyakshsharma : I updated this PR to address comments in the interest 
of reducing the review cycle time.
   
   I went through the changes. Looks good. I guess we can close it then or is 
there anything else to be done here? @bvaradar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1538: [HUDI-803]: added more test cases in TestHoodieAvroUtils.class

2020-05-15 Thread GitBox


pratyakshsharma commented on pull request #1538:
URL: https://github.com/apache/incubator-hudi/pull/1538#issuecomment-629419001


   @vinothchandar We can close this now :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on issue #933: Support for multiple level partitioning in Hudi

2020-05-15 Thread GitBox


vinothchandar commented on issue #933:
URL: https://github.com/apache/incubator-hudi/issues/933#issuecomment-629410586


   @afeldman1  We change package names in 0.5.2. Can you please try 
`org.apache.hudi.keygen.ComplexKeyGenerator`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] afeldman1 commented on issue #933: Support for multiple level partitioning in Hudi

2020-05-15 Thread GitBox


afeldman1 commented on issue #933:
URL: https://github.com/apache/incubator-hudi/issues/933#issuecomment-629409990


   It looks like org.apache.hudi.ComplexKeyGenerator no longer exists. How can 
multiple columns be used as the partition columns now?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-15 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-528.
-
Resolution: Fixed

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-15 Thread GitBox


vinothchandar commented on pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596#issuecomment-629286244


   @bhasudha this is a good 0.5.3 candidate.. if we are crunched on time, I can 
also write the test and push/merge tonight :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on pull request #1622: [HUDI-888] fix NullPointerException

2020-05-15 Thread GitBox


vinothchandar commented on pull request #1622:
URL: https://github.com/apache/incubator-hudi/pull/1622#issuecomment-629284848


   cc @hddong in case this is a sign of some hardcode ports etc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-05-15 Thread GitBox


yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629263967


   @pratyakshsharma still conflicting files



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] EdwinGuo closed issue #1630: [SUPPORT] Latest commit does not have any schema in commit metadata

2020-05-15 Thread GitBox


EdwinGuo closed issue #1630:
URL: https://github.com/apache/incubator-hudi/issues/1630


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] EdwinGuo commented on issue #1630: [SUPPORT] Latest commit does not have any schema in commit metadata

2020-05-15 Thread GitBox


EdwinGuo commented on issue #1630:
URL: https://github.com/apache/incubator-hudi/issues/1630#issuecomment-629221132


   Ok, thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] wangxianghu commented on a change in pull request #1409: [HUDI-714]Add javadoc and comments to hudi write method link

2020-05-15 Thread GitBox


wangxianghu commented on a change in pull request #1409:
URL: https://github.com/apache/incubator-hudi/pull/1409#discussion_r425775385



##
File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
##
@@ -241,6 +241,13 @@ public static HoodieRecord 
createHoodieRecord(GenericRecord gr, Comparable order
 return new HoodieRecord<>(hKey, payload);
   }
 
+  /**
+   * Drop duplicate records from incoming records.

Review comment:
   @nsivabalan Done, thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-901) Bug Bash 0.6.0 Tracking Ticket

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-901:
-
Fix Version/s: 0.6.0

> Bug Bash 0.6.0 Tracking Ticket
> --
>
> Key: HUDI-901
> URL: https://issues.apache.org/jira/browse/HUDI-901
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> This is a tracking ticket for all bug bash 0.6.0 tickets. 
> We have done our best to assign tickets to those who might have good context 
> and to those who volunteered for bug bash. The cursory assignment is just to 
> help you out, and by no means forcing you to work on it. If you feel you 
> can't work on it, please unassign yourself, or you could swap with someone 
> here. 
> All tickets are labelled with "bug-bash-0.6.0". If anyone feels to pitch in 
> with any of the work you have or currently doing, feel free to add the label, 
> but don't remove from existing ones. 
> Some tickets are support ones, which might need follow up 
> questions/clarifications with the reporter of the ticket. For those try to 
> start working right away so we can drive to completion by end of day 10. 
> We are looking to time it for 10 days. Planning to wrap it up by 27th of May. 
> Again, we totally understand that some tickets may not be completed by the 
> time due to various reasons, like support questions, not able to repro 
> locally, env mis-match, swamped with some PR work, or don't have cycles 
> during these 10 days, etc. Let's try our best to take these to completion.
> We are all ears for any questions or clarification. Please respond here in 
> Jira or you could send an email to our mailing list in bug bash thread.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-901) Bug Bash 0.6.0 Tracking Ticket

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-901:


Assignee: sivabalan narayanan

> Bug Bash 0.6.0 Tracking Ticket
> --
>
> Key: HUDI-901
> URL: https://issues.apache.org/jira/browse/HUDI-901
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> This is a tracking ticket for all bug bash 0.6.0 tickets. 
> We have done our best to assign tickets to those who might have good context 
> and to those who volunteered for bug bash. The cursory assignment is just to 
> help you out, and by no means forcing you to work on it. If you feel you 
> can't work on it, please unassign yourself, or you could swap with someone 
> here. 
> All tickets are labelled with "bug-bash-0.6.0". If anyone feels to pitch in 
> with any of the work you have or currently doing, feel free to add the label, 
> but don't remove from existing ones. 
> Some tickets are support ones, which might need follow up 
> questions/clarifications with the reporter of the ticket. For those try to 
> start working right away so we can drive to completion by end of day 10. 
> We are looking to time it for 10 days. Planning to wrap it up by 27th of May. 
> Again, we totally understand that some tickets may not be completed by the 
> time due to various reasons, like support questions, not able to repro 
> locally, env mis-match, swamped with some PR work, or don't have cycles 
> during these 10 days, etc. Let's try our best to take these to completion.
> We are all ears for any questions or clarification. Please respond here in 
> Jira or you could send an email to our mailing list in bug bash thread.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-901) Bug Bash 0.6.0 Tracking Ticket

2020-05-15 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-901:


 Summary: Bug Bash 0.6.0 Tracking Ticket
 Key: HUDI-901
 URL: https://issues.apache.org/jira/browse/HUDI-901
 Project: Apache Hudi (incubating)
  Issue Type: Task
Reporter: sivabalan narayanan


This is a tracking ticket for all bug bash 0.6.0 tickets. 

We have done our best to assign tickets to those who might have good context 
and to those who volunteered for bug bash. The cursory assignment is just to 
help you out, and by no means forcing you to work on it. If you feel you can't 
work on it, please unassign yourself, or you could swap with someone here. 

All tickets are labelled with "bug-bash-0.6.0". If anyone feels to pitch in 
with any of the work you have or currently doing, feel free to add the label, 
but don't remove from existing ones. 

Some tickets are support ones, which might need follow up 
questions/clarifications with the reporter of the ticket. For those try to 
start working right away so we can drive to completion by end of day 10. 

We are looking to time it for 10 days. Planning to wrap it up by 27th of May. 

Again, we totally understand that some tickets may not be completed by the time 
due to various reasons, like support questions, not able to repro locally, env 
mis-match, swamped with some PR work, or don't have cycles during these 10 
days, etc. Let's try our best to take these to completion.

We are all ears for any questions or clarification. Please respond here in Jira 
or you could send an email to our mailing list in bug bash thread.

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1409: [HUDI-714]Add javadoc and comments to hudi write method link

2020-05-15 Thread GitBox


codecov-io edited a comment on pull request #1409:
URL: https://github.com/apache/incubator-hudi/pull/1409#issuecomment-599323873


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1409?src=pr=h1) 
Report
   > Merging 
[#1409](https://codecov.io/gh/apache/incubator-hudi/pull/1409?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/9059bce977cee98e2d65769622c46a1941c563dd=desc)
 will **decrease** coverage by `0.02%`.
   > The diff coverage is `75.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1409/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1409?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1409  +/-   ##
   
   - Coverage 71.81%   71.78%   -0.03% 
   - Complexity  294 1090 +796 
   
 Files   385  385  
 Lines 1654016600  +60 
 Branches   1661 1666   +5 
   
   + Hits  1187811917  +39 
   - Misses 3932 3955  +23 
   + Partials730  728   -2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1409?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `77.35% <ø> (ø)` | `13.00 <0.00> (+13.00)` | |
   | 
[...in/java/org/apache/hudi/table/WorkloadProfile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvV29ya2xvYWRQcm9maWxlLmphdmE=)
 | `100.00% <ø> (ø)` | `9.00 <0.00> (+9.00)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `88.88% <ø> (ø)` | `4.00 <0.00> (+4.00)` | |
   | 
[...pache/hudi/common/table/HoodieTableMetaClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL0hvb2RpZVRhYmxlTWV0YUNsaWVudC5qYXZh)
 | `83.11% <ø> (ø)` | `30.00 <0.00> (+30.00)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `55.55% <ø> (-1.15%)` | `0.00 <0.00> (ø)` | |
   | 
[...i/utilities/deltastreamer/HoodieDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllRGVsdGFTdHJlYW1lci5qYXZh)
 | `80.00% <0.00%> (ø)` | `11.00 <0.00> (ø)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `72.58% <75.00%> (+0.13%)` | `37.00 <0.00> (ø)` | |
   | 
[...s/deltastreamer/HoodieMultiTableDeltaStreamer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvSG9vZGllTXVsdGlUYWJsZURlbHRhU3RyZWFtZXIuamF2YQ==)
 | `78.39% <100.00%> (ø)` | `18.00 <0.00> (ø)` | |
   | 
[...le/action/rollback/BaseRollbackActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL0Jhc2VSb2xsYmFja0FjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `70.83% <0.00%> (-6.95%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../apache/hudi/common/table/TableSchemaResolver.java](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL1RhYmxlU2NoZW1hUmVzb2x2ZXIuamF2YQ==)
 | `56.71% <0.00%> (-5.47%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [27 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1409/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1409?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 

[GitHub] [incubator-hudi] codecov-io commented on pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


codecov-io commented on pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#issuecomment-629195380


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=h1) 
Report
   > Merging 
[#1518](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/a64afdfd17ac974e451bceb877f3d40a9c775253=desc)
 will **increase** coverage by `0.06%`.
   > The diff coverage is `94.11%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1518/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1518  +/-   ##
   
   + Coverage 71.75%   71.81%   +0.06% 
   - Complexity 1089 1092   +3 
   
 Files   385  386   +1 
 Lines 1659916608   +9 
 Branches   1668 1667   -1 
   
   + Hits  1191011927  +17 
   + Misses 3962 3955   -7 
   + Partials727  726   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/incubator-hudi/pull/1518/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `73.86% <90.90%> (+1.42%)` | `36.00 <1.00> (-1.00)` | :arrow_up: |
   | 
[...udi/utilities/schema/DelegatingSchemaProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1518/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9EZWxlZ2F0aW5nU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | `100.00% <100.00%> (ø)` | `3.00 <3.00> (?)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1518/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `67.56% <0.00%> (+10.81%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1518/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `80.00% <0.00%> (+40.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=footer).
 Last update 
[a64afdf...8fe14bd](https://codecov.io/gh/apache/incubator-hudi/pull/1518?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-863:


Assignee: Roland Johann

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Assignee: Roland Johann
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-722:


Assignee: sivabalan narayanan  (was: lamber-ken)

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-767) Support transformation when export to Hudi

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-767:


Assignee: Raymond Xu

> Support transformation when export to Hudi
> --
>
> Key: HUDI-767
> URL: https://issues.apache.org/jira/browse/HUDI-767
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.1
>
>
> Main logic described in 
> https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
> In HoodieSnapshotExporter, we could extend the feature to include 
> transformation when --output-format hudi, using a custom Transformer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-859) Improve documentation around key generators

2020-05-15 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108208#comment-17108208
 ] 

sivabalan narayanan commented on HUDI-859:
--

[~hongdongdong]: discuss with [~Pratyaksh] on what needs to be done for this. 

> Improve documentation around key generators
> ---
>
> Key: HUDI-859
> URL: https://issues.apache.org/jira/browse/HUDI-859
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Pratyaksh Sharma
>Assignee: hong dongdong
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Proper documentation is required to help users understand what all key 
> generators are currently supported, how to use them etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-859) Improve documentation around key generators

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-859:


Assignee: hong dongdong  (was: Pratyaksh Sharma)

> Improve documentation around key generators
> ---
>
> Key: HUDI-859
> URL: https://issues.apache.org/jira/browse/HUDI-859
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Pratyaksh Sharma
>Assignee: hong dongdong
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Proper documentation is required to help users understand what all key 
> generators are currently supported, how to use them etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-13) Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive cluster #553

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-13:
---

Assignee: sivabalan narayanan

> Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive 
> cluster #553
> 
>
> Key: HUDI-13
> URL: https://issues.apache.org/jira/browse/HUDI-13
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, Hive Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> https://github.com/uber/hudi/issues/553



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-4) Support for writing to EMRFS

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-4:
--

Assignee: vinoyang

> Support for writing to EMRFS
> 
>
> Key: HUDI-4
> URL: https://issues.apache.org/jira/browse/HUDI-4
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: newbie, Usability, Writer Core
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> https://github.com/uber/hudi/issues/588



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-13) Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive cluster #553

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-13?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-13:
---

Assignee: vinoyang  (was: sivabalan narayanan)

> Clarify whether the hoodie-hadoop-mr jars need to be rolled out across Hive 
> cluster #553
> 
>
> Key: HUDI-13
> URL: https://issues.apache.org/jira/browse/HUDI-13
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Docs, Hive Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> https://github.com/uber/hudi/issues/553



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-395) hudi does not support scheme s3n when wrtiing to S3

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-395:


Assignee: sivabalan narayanan  (was: Raymond Xu)

> hudi does not support scheme s3n when wrtiing to S3
> ---
>
> Key: HUDI-395
> URL: https://issues.apache.org/jira/browse/HUDI-395
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie, Spark Integration, Usability
> Environment: spark-2.4.4-bin-hadoop2.7
>Reporter: rui feng
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> When I use Hudi to create a hudi table then write to s3, I used below maven 
> snnipet which is recommended by [https://hudi.apache.org/s3_hoodie.html]
> 
>  org.apache.hudi
>  hudi-spark-bundle
>  0.5.0-incubating
> 
> 
>  org.apache.hadoop
>  hadoop-aws
>  2.7.3
> 
> 
>  com.amazonaws
>  aws-java-sdk
>  1.10.34
> 
> and add the below configuration:
> sc.hadoopConfiguration.set("fs.defaultFS", "s3://niketest1")
>  sc.hadoopConfiguration.set("fs.s3.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "x")
>  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "x")
>  
> my spark version is spark-2.4.4-bin-hadoop2.7 and when I run below
> {color:#FF}df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath).{color}
> val hudiOptions = Map[String,String](
>  HoodieWriteConfig.TABLE_NAME -> "hudi12",
>  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
>  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "rider",
>  DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
> val hudiTablePath = "s3://niketest1/hudi_test/hudi12"
> the exception occur:
> j{color:#FF}ava.lang.IllegalArgumentException: 
> BlockAlignedAvroParquetWriter does not support scheme s3n{color}
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.getHoodieScheme(HoodieWrapperFileSystem.java:109)
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.convertToHoodiePath(HoodieWrapperFileSystem.java:85)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.(HoodieParquetWriter.java:57)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.newParquetStorageWriter(HoodieStorageWriterFactory.java:60)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.getStorageWriter(HoodieStorageWriterFactory.java:44)
>  at org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:70)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:137)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:120)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  
>  
> Is anyone can tell me what's cause this exception, I tried to use 
> org.apache.hadoop.fs.s3.S3FileSystem to replace 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem for the conf "fs.s3.impl", 
> but other exception occur and it seems org.apache.hadoop.fs.s3.S3FileSystem 
> fit hadoop 2.6.
>  
> Thanks advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-303) Avro schema case sensitivity testing

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-303:


Assignee: Udit Mehrotra

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-307:


Assignee: Udit Mehrotra

> Dataframe written with Date,Timestamp, Decimal is read with same types
> --
>
> Key: HUDI-307
> URL: https://issues.apache.org/jira/browse/HUDI-307
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Cosmin Iordache
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Small test for COW table to check the persistence of Date, Timestamp ,Decimal 
> types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-110:


Assignee: Bhavani Sudha Saktheeswaran

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-15 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108204#comment-17108204
 ] 

sivabalan narayanan commented on HUDI-494:
--

Assigning the ticket to lamber ken. But Gary Li : feel free to assign it to 
yourself if you wanna work on it. 

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.5.3
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-473) IllegalArgumentException in QuickstartUtils

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-473:


Assignee: Bhavani Sudha Saktheeswaran

> IllegalArgumentException in QuickstartUtils 
> 
>
> Key: HUDI-473
> URL: https://issues.apache.org/jira/browse/HUDI-473
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: zhangpu
>Assignee: Bhavani Sudha Saktheeswaran
>Priority: Minor
>  Labels: bug-bash-0.6.0, starter
>
>  First call dataGen.generateInserts to write the data,Then another process 
> call dataGen.generateUpdates ,Throws the following exception:
> Exception in thread "main" java.lang.IllegalArgumentException: bound must be 
> positive
>   at java.util.Random.nextInt(Random.java:388)
>   at 
> org.apache.hudi.QuickstartUtils$DataGenerator.generateUpdates(QuickstartUtils.java:163)
> Is the design reasonable?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-494:


Assignee: lamber-ken  (was: Yanjia Gary Li)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.5.3
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-723:


Assignee: hong dongdong

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: hong dongdong
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-867) Graphite metrics are throwing IllegalArgumentException on continuous mode

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-867:


Assignee: Raymond Xu

> Graphite metrics are throwing IllegalArgumentException on continuous mode
> -
>
> Key: HUDI-867
> URL: https://issues.apache.org/jira/browse/HUDI-867
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: João Esteves
>Assignee: Raymond Xu
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> Hello everyone, I am trying to extract Graphite metrics from Hudi using a 
> Spark Streaming process, but the method that sends metrics is throwing 
> java.lang.IllegalArgumentException after the first microbatch, like this:
> {code:java}
> 20/05/06 11:49:25 ERROR Metrics: Failed to send metrics: 
> java.lang.IllegalArgumentException: A metric named 
> kafka_hudi.finalize.duration already exists
>   at 
> org.apache.hudi.com.codahale.metrics.MetricRegistry.register(MetricRegistry.java:97)
>   at org.apache.hudi.metrics.Metrics.registerGauge(Metrics.java:83)
>   at 
> org.apache.hudi.metrics.HoodieMetrics.updateFinalizeWriteMetrics(HoodieMetrics.java:177)
>   at 
> org.apache.hudi.HoodieWriteClient.lambda$finalizeWrite$14(HoodieWriteClient.java:1233)
>   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
>   at 
> org.apache.hudi.HoodieWriteClient.finalizeWrite(HoodieWriteClient.java:1231)
>   at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:497)
>   at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:479)
>   at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:470)
>   at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:152)
>   at 
> org.apache.hudi.HoodieStreamingSink$$anonfun$1$$anonfun$2.apply(HoodieStreamingSink.scala:51)
>   at 
> org.apache.hudi.HoodieStreamingSink$$anonfun$1$$anonfun$2.apply(HoodieStreamingSink.scala:51)
>   at scala.util.Try$.apply(Try.scala:192)
>   at 
> org.apache.hudi.HoodieStreamingSink$$anonfun$1.apply(HoodieStreamingSink.scala:50)
>   at 
> org.apache.hudi.HoodieStreamingSink$$anonfun$1.apply(HoodieStreamingSink.scala:50)
>   at 
> org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:114)
>   at 
> org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:49)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
>   at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>   at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>   at 
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
>   at 
> 

[jira] [Assigned] (HUDI-395) hudi does not support scheme s3n when wrtiing to S3

2020-05-15 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-395:


Assignee: Raymond Xu  (was: leesf)

> hudi does not support scheme s3n when wrtiing to S3
> ---
>
> Key: HUDI-395
> URL: https://issues.apache.org/jira/browse/HUDI-395
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie, Spark Integration, Usability
> Environment: spark-2.4.4-bin-hadoop2.7
>Reporter: rui feng
>Assignee: Raymond Xu
>Priority: Major
>  Labels: bug-bash-0.6.0
>
> When I use Hudi to create a hudi table then write to s3, I used below maven 
> snnipet which is recommended by [https://hudi.apache.org/s3_hoodie.html]
> 
>  org.apache.hudi
>  hudi-spark-bundle
>  0.5.0-incubating
> 
> 
>  org.apache.hadoop
>  hadoop-aws
>  2.7.3
> 
> 
>  com.amazonaws
>  aws-java-sdk
>  1.10.34
> 
> and add the below configuration:
> sc.hadoopConfiguration.set("fs.defaultFS", "s3://niketest1")
>  sc.hadoopConfiguration.set("fs.s3.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>  sc.hadoopConfiguration.set("fs.s3.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3.awsSecretAccessKey", "x")
>  sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "xx")
>  sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "x")
>  
> my spark version is spark-2.4.4-bin-hadoop2.7 and when I run below
> {color:#FF}df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Overwrite).save(hudiTablePath).{color}
> val hudiOptions = Map[String,String](
>  HoodieWriteConfig.TABLE_NAME -> "hudi12",
>  DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
>  DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "rider",
>  DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
> val hudiTablePath = "s3://niketest1/hudi_test/hudi12"
> the exception occur:
> j{color:#FF}ava.lang.IllegalArgumentException: 
> BlockAlignedAvroParquetWriter does not support scheme s3n{color}
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.getHoodieScheme(HoodieWrapperFileSystem.java:109)
>  at 
> org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.convertToHoodiePath(HoodieWrapperFileSystem.java:85)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.(HoodieParquetWriter.java:57)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.newParquetStorageWriter(HoodieStorageWriterFactory.java:60)
>  at 
> org.apache.hudi.io.storage.HoodieStorageWriterFactory.getStorageWriter(HoodieStorageWriterFactory.java:44)
>  at org.apache.hudi.io.HoodieCreateHandle.(HoodieCreateHandle.java:70)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:137)
>  at 
> org.apache.hudi.func.CopyOnWriteLazyInsertIterable$CopyOnWriteInsertHandler.consumeOneRecord(CopyOnWriteLazyInsertIterable.java:125)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:38)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:120)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  
>  
> Is anyone can tell me what's cause this exception, I tried to use 
> org.apache.hadoop.fs.s3.S3FileSystem to replace 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem for the conf "fs.s3.impl", 
> but other exception occur and it seems org.apache.hadoop.fs.s3.S3FileSystem 
> fit hadoop 2.6.
>  
> Thanks advance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] umehrot2 commented on pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-15 Thread GitBox


umehrot2 commented on pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596#issuecomment-629173116


   @rolandjohann The fix makes sense to me. Lets add a test for decimal type 
handling, and make it nested within another type as well. Recently we have 
tried to exhaustively test as many data types handling as possible, by adding 
fields with those data types in the schema for test data that Hudi uses across 
various tests: 
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java#L95
 .  We recently added **Map** and **Array** types test. What you can do is add 
a **decimal** field to the trip record schema as a nested type. Infact, easy 
solution with be in **FARE_NESTED_SCHEMA** change the type of **amount** from 
**double** to **decimal**. **double** is either ways tested through other 
fields, this would accurately test **decimal** field in **nested** schema.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] rolandjohann edited a comment on pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-15 Thread GitBox


rolandjohann edited a comment on pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596#issuecomment-629170944


   @vinothchandar @umehrot2 is right: only when the field is not at level.
   This happened because the avro schema has been passed to each recursion of 
the method, but without selecting the actual field whose schema must be passed 
to the recursion.
   
   It seems that there is no one with decimals at nested structs using hudi 
currently - or at least doesn't report that ;)
   
   EDIT: @vinothchandar that's great. Time is currently limited, I try to 
implement a rudimentary test over the weekend. Will keep you updated



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] rolandjohann commented on pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-15 Thread GitBox


rolandjohann commented on pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596#issuecomment-629170944


   @vinothchandar @umehrot2 is right: only when the field is not at level.
   This happened because the avro schema has been passed to each recursion of 
the method, but without selecting the actual field whose schema must be passed 
to the recursion.
   
   It seems that there is no one with decimals at nested structs using hudi 
currently - or at least doesn't report that ;)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] umehrot2 commented on pull request #1596: [HUDI-863] get decimal properties from derived spark DataType

2020-05-15 Thread GitBox


umehrot2 commented on pull request #1596:
URL: https://github.com/apache/incubator-hudi/pull/1596#issuecomment-629169708


   > LGTM overall.. If you can throw in a test, like you mentioned, that'd be 
great.
   > 
   > Also trying to understand the scope of the issue.. without this, does 
every decimal type conversion fail?
   
   @vinothchandar every **decimal** conversion does not fail. As I understand 
from this PR, the NPE would occur when **decimal field** is not a **top level 
field** in the avro schema, but nested within another type. @rolandjohann is 
this understanding correct ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1566: [HUDI-603]: DeltaStreamer can now fetch schema before every run in continuous mode

2020-05-15 Thread GitBox


bvaradar commented on pull request #1566:
URL: https://github.com/apache/incubator-hudi/pull/1566#issuecomment-629167984


   @pratyakshsharma : I updated this PR to address comments in the interest of 
reducing the review cycle time. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] rolandjohann commented on pull request #1622: [HUDI-888] fix NullPointerException

2020-05-15 Thread GitBox


rolandjohann commented on pull request #1622:
URL: https://github.com/apache/incubator-hudi/pull/1622#issuecomment-629165710


   Is it possible that this is a test infrastructure related issue?
   ```
   [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
7.416 s <<< FAILURE! - in 
org.apache.hudi.cli.commands.TestArchivedCommitsCommand
   [ERROR] 
org.apache.hudi.cli.commands.TestArchivedCommitsCommand.testShowArchivedCommits 
 Time elapsed: 2.116 s  <<< ERROR!
   java.net.BindException: Problem binding to [localhost:42364] 
java.net.BindException: Address already in use; For more details see:  
http://wiki.apache.org/hadoop/BindException
at 
org.apache.hudi.cli.commands.TestArchivedCommitsCommand.init(TestArchivedCommitsCommand.java:58)
   Caused by: java.net.BindException: Address already in use
at 
org.apache.hudi.cli.commands.TestArchivedCommitsCommand.init(TestArchivedCommitsCommand.java:58)
   ```
   Don't see anything related to the changes introduced by this PR. Can we 
retry the failed travis job?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-15 Thread GitBox


yanghua commented on pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-629096996


   @n3nash conflicting files.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] rolandjohann commented on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

2020-05-15 Thread GitBox


rolandjohann commented on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629095378


   @bvaradar 
   After 15 runs the filesystem looks like this:
   ```bash
   $ tree -a /tmp/visitors_hudi_mor/
   
   /tmp/visitors_hudi_mor/
   ├── .hoodie
   │   ├── .20200514221320.clean.crc
   │   ├── .20200514221320.clean.inflight.crc
   │   ├── .20200514221320.clean.requested.crc
   │   ├── .20200514221407.clean.crc
   │   ├── .20200514221407.clean.inflight.crc
   │   ├── .20200514221407.clean.requested.crc
   │   ├── .20200514221449.clean.crc
   │   ├── .20200514221449.clean.inflight.crc
   │   ├── .20200514221449.clean.requested.crc
   │   ├── .20200514221539.clean.crc
   │   ├── .20200514221539.clean.inflight.crc
   │   ├── .20200514221539.clean.requested.crc
   │   ├── .20200514221623.clean.crc
   │   ├── .20200514221623.clean.inflight.crc
   │   ├── .20200514221623.clean.requested.crc
   │   ├── .20200514221623.deltacommit.crc
   │   ├── .20200514221623.deltacommit.inflight.crc
   │   ├── .20200514221623.deltacommit.requested.crc
   │   ├── .20200514221648.commit.crc
   │   ├── .20200514221648.compaction.inflight.crc
   │   ├── .20200514221648.compaction.requested.crc
   │   ├── .20200514221714.clean.crc
   │   ├── .20200514221714.clean.inflight.crc
   │   ├── .20200514221714.clean.requested.crc
   │   ├── .20200514221714.deltacommit.crc
   │   ├── .20200514221714.deltacommit.inflight.crc
   │   ├── .20200514221714.deltacommit.requested.crc
   │   ├── .20200514221759.clean.crc
   │   ├── .20200514221759.clean.inflight.crc
   │   ├── .20200514221759.clean.requested.crc
   │   ├── .20200514221759.deltacommit.crc
   │   ├── .20200514221759.deltacommit.inflight.crc
   │   ├── .20200514221759.deltacommit.requested.crc
   │   ├── .20200514221829.commit.crc
   │   ├── .20200514221829.compaction.inflight.crc
   │   ├── .20200514221829.compaction.requested.crc
   │   ├── .20200514221902.clean.crc
   │   ├── .20200514221902.clean.inflight.crc
   │   ├── .20200514221902.clean.requested.crc
   │   ├── .20200514221902.deltacommit.crc
   │   ├── .20200514221902.deltacommit.inflight.crc
   │   ├── .20200514221902.deltacommit.requested.crc
   │   ├── .20200514221947.clean.crc
   │   ├── .20200514221947.clean.inflight.crc
   │   ├── .20200514221947.clean.requested.crc
   │   ├── .20200514221947.deltacommit.crc
   │   ├── .20200514221947.deltacommit.inflight.crc
   │   ├── .20200514221947.deltacommit.requested.crc
   │   ├── .20200514222010.commit.crc
   │   ├── .20200514222010.compaction.inflight.crc
   │   ├── .20200514222010.compaction.requested.crc
   │   ├── .20200514222036.clean.crc
   │   ├── .20200514222036.clean.inflight.crc
   │   ├── .20200514222036.clean.requested.crc
   │   ├── .20200514222036.deltacommit.crc
   │   ├── .20200514222036.deltacommit.inflight.crc
   │   ├── .20200514222036.deltacommit.requested.crc
   │   ├── .20200514222122.clean.crc
   │   ├── .20200514222122.clean.inflight.crc
   │   ├── .20200514222122.clean.requested.crc
   │   ├── .20200514222122.deltacommit.crc
   │   ├── .20200514222122.deltacommit.inflight.crc
   │   ├── .20200514222122.deltacommit.requested.crc
   │   ├── .20200514222145.commit.crc
   │   ├── .20200514222145.compaction.inflight.crc
   │   ├── .20200514222145.compaction.requested.crc
   │   ├── .20200515094100.clean.crc
   │   ├── .20200515094100.clean.inflight.crc
   │   ├── .20200515094100.clean.requested.crc
   │   ├── .20200515094100.deltacommit.crc
   │   ├── .20200515094100.deltacommit.inflight.crc
   │   ├── .20200515094100.deltacommit.requested.crc
   │   ├── .20200515094159.clean.crc
   │   ├── .20200515094159.clean.inflight.crc
   │   ├── .20200515094159.clean.requested.crc
   │   ├── .20200515094159.deltacommit.crc
   │   ├── .20200515094159.deltacommit.inflight.crc
   │   ├── .20200515094159.deltacommit.requested.crc
   │   ├── .20200515094227.commit.crc
   │   ├── .20200515094227.compaction.inflight.crc
   │   ├── .20200515094227.compaction.requested.crc
   │   ├── .20200515094301.clean.crc
   │   ├── .20200515094301.clean.inflight.crc
   │   ├── .20200515094301.clean.requested.crc
   │   ├── .20200515094301.deltacommit.crc
   │   ├── .20200515094301.deltacommit.inflight.crc
   │   ├── .20200515094301.deltacommit.requested.crc
   │   ├── .20200515094401.clean.crc
   │   ├── .20200515094401.clean.inflight.crc
   │   ├── .20200515094401.clean.requested.crc
   │   ├── .20200515094401.deltacommit.crc
   │   ├── .20200515094401.deltacommit.inflight.crc
   │   ├── .20200515094401.deltacommit.requested.crc
   │   ├── .20200515094431.commit.crc
   │   ├── .20200515094431.compaction.inflight.crc
   │   ├── .20200515094431.compaction.requested.crc
   │   ├── .20200515094508.clean.crc
   │   ├── .20200515094508.clean.inflight.crc
   │   ├── .20200515094508.clean.requested.crc
   │   ├── .20200515094508.deltacommit.crc
   │   ├── 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar commented on a change in pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#discussion_r425626686



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -460,8 +471,17 @@ private void syncHive() {
* this constraint.
*/
   public void setupWriteClient() {
+setupWriteClient(schemaProvider, false);
+  }
+
+  /**
+   * Note that depending on configs and source-type, schemaProvider could 
either be eagerly or lazily created.
+   * SchemaProvider creation is a precursor to HoodieWriteClient and 
AsyncCompactor creation. This method takes care of
+   * this constraint.
+   */
+  private void setupWriteClient(SchemaProvider schemaProvider, boolean 
forceRecreate) {
 LOG.info("Setting up Hoodie Write Client");
-if ((null != schemaProvider) && (null == writeClient)) {
+if (forceRecreate || (null != schemaProvider) && (null == writeClient)) {

Review comment:
   Actually, I just realized HoodieWriteConfig schema also needs to be 
updated in this case. Trying to see if we can have a consistent point of 
creating WriteClient and SchemaProvider 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on pull request #1151: [HUDI-476] Add hudi-examples module

2020-05-15 Thread GitBox


lamber-ken commented on pull request #1151:
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-629076102


   > > Can you confirm if you have run these examples locally once and verified 
the instructions work?
   > 
   > @vinothchandar , I ran these examples locally and ensured they do work, 
but haven't tried them in a yarn-cluster mode.
   
   Will check, thanks  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-15 Thread GitBox


n3nash commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r425613654



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/DeltaWriterFactory.java
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite;
+
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.testsuite.configuration.DFSDeltaConfig;
+import org.apache.hudi.testsuite.configuration.DeltaConfig;
+import org.apache.hudi.testsuite.writer.AvroDeltaInputWriter;
+import org.apache.hudi.testsuite.writer.FileDeltaInputWriter;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+
+/**
+ * A factory to help instantiate different {@link DeltaWriterAdapter}s 
depending on the {@link DeltaOutputType} and
+ * {@link DeltaInputFormat}.
+ */
+public class DeltaWriterFactory {
+
+  private DeltaWriterFactory() {
+  }
+
+  public static DeltaWriterAdapter getDeltaWriterAdapter(DeltaConfig config, 
Integer batchId) throws IOException {
+switch (config.getDeltaOutputType()) {
+  case DFS:
+switch (config.getDeltaInputFormat()) {
+  case AVRO:
+DFSDeltaConfig dfsDeltaConfig = (DFSDeltaConfig) config;
+dfsDeltaConfig.setBatchId(batchId);
+FileDeltaInputWriter fileDeltaInputGenerator = new 
AvroDeltaInputWriter(
+dfsDeltaConfig.getConfiguration(),
+StringUtils
+.join(new String[]{dfsDeltaConfig.getDeltaBasePath(), 
dfsDeltaConfig.getBatchId().toString()},
+"/"), dfsDeltaConfig.getSchemaStr(), 
dfsDeltaConfig.getMaxFileSize());
+DFSDeltaWriterAdapter workloadSink = new 
DFSDeltaWriterAdapter(fileDeltaInputGenerator);

Review comment:
   ack





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-15 Thread GitBox


n3nash commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r425613491



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/dag/nodes/InsertNode.java
##
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.dag.nodes;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.testsuite.configuration.DeltaConfig.Config;
+import org.apache.hudi.testsuite.dag.ExecutionContext;
+import org.apache.hudi.testsuite.generator.DeltaGenerator;
+import org.apache.hudi.testsuite.writer.DeltaWriter;
+
+import org.apache.spark.api.java.JavaRDD;
+
+public class InsertNode extends DagNode> {
+
+  public InsertNode(Config config) {
+this.config = config;
+  }
+
+  @Override
+  public void execute(ExecutionContext executionContext) throws Exception {
+generate(executionContext.getDeltaGenerator());
+log.info("Configs => " + this.config);
+if (!config.isDisableIngest()) {
+  log.info(String.format("- inserting input data %s 
--", this.getName()));
+  Option commitTime = 
executionContext.getDeltaWriter().startCommit();
+  JavaRDD writeStatus = 
ingest(executionContext.getDeltaWriter(), commitTime);
+  executionContext.getDeltaWriter().commit(writeStatus, commitTime);
+  this.result = writeStatus;
+}
+validate();

Review comment:
   Remnant code from before, removed now.

##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/job/HoodieTestSuiteJob.java
##
@@ -0,0 +1,188 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.job;
+
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.keygen.KeyGenerator;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.testsuite.DeltaInputFormat;
+import org.apache.hudi.testsuite.DeltaOutputType;
+import org.apache.hudi.testsuite.configuration.DFSDeltaConfig;
+import org.apache.hudi.testsuite.dag.DagUtils;
+import org.apache.hudi.testsuite.dag.WorkflowDag;
+import org.apache.hudi.testsuite.dag.WorkflowDagGenerator;
+import org.apache.hudi.testsuite.dag.scheduler.DagScheduler;
+import org.apache.hudi.testsuite.generator.DeltaGenerator;
+import org.apache.hudi.testsuite.writer.DeltaWriter;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SparkSession;
+
+import java.io.IOException;
+
+/**
+ * This is the entry point for running a Hudi Test Suite. Although 

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-15 Thread GitBox


n3nash commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r425613413



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/converter/UpdateConverter.java
##
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.converter;
+
+import org.apache.hudi.testsuite.generator.LazyRecordGeneratorIterator;
+import org.apache.hudi.testsuite.generator.UpdateGeneratorIterator;
+import org.apache.hudi.utilities.converter.Converter;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.List;
+
+/**
+ * This converter creates an update {@link GenericRecord} from an existing 
{@link GenericRecord}.
+ */
+public class UpdateConverter implements Converter {

Review comment:
   This has been lying around before we removed it, I've refactored this 
code.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar commented on a change in pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#discussion_r425603474



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -460,8 +471,17 @@ private void syncHive() {
* this constraint.
*/
   public void setupWriteClient() {
+setupWriteClient(schemaProvider, false);
+  }
+
+  /**
+   * Note that depending on configs and source-type, schemaProvider could 
either be eagerly or lazily created.
+   * SchemaProvider creation is a precursor to HoodieWriteClient and 
AsyncCompactor creation. This method takes care of
+   * this constraint.
+   */
+  private void setupWriteClient(SchemaProvider schemaProvider, boolean 
forceRecreate) {
 LOG.info("Setting up Hoodie Write Client");
-if ((null != schemaProvider) && (null == writeClient)) {
+if (forceRecreate || (null != schemaProvider) && (null == writeClient)) {

Review comment:
   I agree. We should just do schema re-registration. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1518: [HUDI-723] Register avro schema if infered from SQL transformation

2020-05-15 Thread GitBox


bvaradar commented on a change in pull request #1518:
URL: https://github.com/apache/incubator-hudi/pull/1518#discussion_r425603474



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
##
@@ -460,8 +471,17 @@ private void syncHive() {
* this constraint.
*/
   public void setupWriteClient() {
+setupWriteClient(schemaProvider, false);
+  }
+
+  /**
+   * Note that depending on configs and source-type, schemaProvider could 
either be eagerly or lazily created.
+   * SchemaProvider creation is a precursor to HoodieWriteClient and 
AsyncCompactor creation. This method takes care of
+   * this constraint.
+   */
+  private void setupWriteClient(SchemaProvider schemaProvider, boolean 
forceRecreate) {
 LOG.info("Setting up Hoodie Write Client");
-if ((null != schemaProvider) && (null == writeClient)) {
+if (forceRecreate || (null != schemaProvider) && (null == writeClient)) {

Review comment:
   I agree. We should just do schema re-registration alone and not recreate 
client.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-15 Thread GitBox


n3nash commented on a change in pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r425602987



##
File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/DFSDeltaWriterAdapter.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite;
+
+import org.apache.hudi.testsuite.writer.FileDeltaInputWriter;
+import org.apache.hudi.testsuite.writer.WriteStats;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+/**
+ * {@link org.apache.hadoop.hdfs.DistributedFileSystem} (or {@link 
org.apache.hadoop.fs.LocalFileSystem}) based delta
+ * generator.
+ */
+public class DFSDeltaWriterAdapter implements 
DeltaWriterAdapter {
+
+  private FileDeltaInputWriter deltaInputGenerator;
+  private List metrics = new ArrayList<>();
+
+  public DFSDeltaWriterAdapter(FileDeltaInputWriter 
deltaInputGenerator) {
+this.deltaInputGenerator = deltaInputGenerator;
+  }
+
+  @Override
+  public List write(Iterator input) throws 
IOException {
+deltaInputGenerator.open();
+while (input.hasNext()) {
+  if (this.deltaInputGenerator.canWrite()) {
+this.deltaInputGenerator.writeData(input.next());
+  } else if (input.hasNext()) {
+rollOver();
+  }
+}
+close();
+return this.metrics;
+  }
+
+  public void rollOver() throws IOException {
+close();

Review comment:
   good idea





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1611: [HUDI-705]Add unit test for RollbacksCommand

2020-05-15 Thread GitBox


yanghua commented on a change in pull request #1611:
URL: https://github.com/apache/incubator-hudi/pull/1611#discussion_r425584108



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRollbacksCommand.java
##
@@ -0,0 +1,189 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieIndexConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.index.HoodieIndex;
+
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+import java.util.stream.Stream;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+
+/**
+ * Test class for {@link org.apache.hudi.cli.commands.RollbacksCommand}.
+ */
+public class TestRollbacksCommand extends AbstractShellIntegrationTest {
+
+  @BeforeEach
+  public void init() throws IOException {
+String tableName = "test_table";
+String tablePath = basePath + File.separator + tableName;
+new TableCommand().createTable(
+tablePath, tableName, HoodieTableType.MERGE_ON_READ.name(),
+"", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
+
+//Create some commits files and parquet files
+String commitTime1 = "100";
+String commitTime2 = "101";
+String commitTime3 = "102";
+HoodieTestDataGenerator.writePartitionMetadata(fs, 
HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS, tablePath);
+
+// two commit files
+HoodieTestUtils.createCommitFiles(tablePath, commitTime1, commitTime2);
+// one .inflight commit file
+HoodieTestUtils.createInflightCommitFiles(tablePath, commitTime3);
+
+// generate commit1 files
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH, commitTime1, "file-1-1");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH, commitTime1, "file-1-2");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_THIRD_PARTITION_PATH, commitTime1, "file-1-3");
+
+// generate commit2 files
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH, commitTime2, "file-2-1");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH, commitTime2, "file-2-2");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_THIRD_PARTITION_PATH, commitTime2, "file-2-3");
+
+// generate commit3 files
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_FIRST_PARTITION_PATH, commitTime3, "file-3-1");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_SECOND_PARTITION_PATH, commitTime3, "file-3-2");
+HoodieTestUtils.createDataFile(tablePath, 
HoodieTestDataGenerator.DEFAULT_THIRD_PARTITION_PATH, commitTime3, "file-3-3");

Review comment:
   Can we introduce a `for` loop to simple this code snippet?





  1   2   >