[GitHub] [incubator-hudi] yihua commented on a change in pull request #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-02-25 Thread GitBox
yihua commented on a change in pull request #1165: [HUDI-76] Add CSV Source 
support for Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#discussion_r384313736
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -74,20 +74,30 @@
   public static final String[] DEFAULT_PARTITION_PATHS =
   {DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH, 
DEFAULT_THIRD_PARTITION_PATH};
   public static final int DEFAULT_PARTITION_DEPTH = 3;
-  public static final String TRIP_EXAMPLE_SCHEMA = "{\"type\": \"record\"," + 
"\"name\": \"triprec\"," + "\"fields\": [ "
+  public static final String TRIP_SCHEMA_PREFIX = "{\"type\": \"record\"," + 
"\"name\": \"triprec\"," + "\"fields\": [ "
   + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
   + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
   + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
-  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
-  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
-  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
-  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},";
+  public static final String TRIP_SCHEMA_SUFFIX = "{\"name\": 
\"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false} ]}";
+  public static final String FARE_NESTED_SCHEMA = "{\"name\": 
\"fare\",\"type\": {\"type\":\"record\", \"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},";
+  public static final String FARE_FLATTENED_SCHEMA = "{\"name\": \"fare\", 
\"type\": \"double\"},"
+  + "{\"name\": \"currency\", \"type\": \"string\"},";
+
+  public static final String TRIP_EXAMPLE_SCHEMA =
+  TRIP_SCHEMA_PREFIX + FARE_NESTED_SCHEMA + TRIP_SCHEMA_SUFFIX;
+  public static final String TRIP_FLATTENED_SCHEMA =
 
 Review comment:
   Yes, for CSV format, the nested schema is not well supported.  So to test 
CSV source, we need to generate the test CSV data with a flattened schema.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-25 Thread GitBox
smarthi commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r384312049
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/compact/HoodieMergeOnReadTableCompactor.java
 ##
 @@ -18,6 +18,7 @@
 
 package org.apache.hudi.io.compact;
 
+import com.google.common.collect.Sets;
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-25 Thread GitBox
yanghua commented on issue #1350: [HUDI-629]: Replace Guava's Hashing with an 
equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#issuecomment-591268945
 
 
   @leesf Would you please check it again?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-25 Thread GitBox
yanghua commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r384297013
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/compact/HoodieMergeOnReadTableCompactor.java
 ##
 @@ -18,6 +18,7 @@
 
 package org.apache.hudi.io.compact;
 
+import com.google.common.collect.Sets;
 
 Review comment:
   Can we put this back the original place?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar merged pull request #1353: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-25 Thread GitBox
vinothchandar merged pull request #1353: [HUDI-632] Update documentation 
(docker_demo) to mention both commit and deltacommit files
URL: https://github.com/apache/incubator-hudi/pull/1353
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files (#1353)

2020-02-25 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new ab25eed  [HUDI-632] Update documentation (docker_demo) to mention both 
commit and deltacommit files (#1353)
ab25eed is described below

commit ab25eed4e3eb577e8aca063756c8393980172a5b
Author: Vikrant Goel 
AuthorDate: Tue Feb 25 21:39:36 2020 -0800

[HUDI-632] Update documentation (docker_demo) to mention both commit and 
deltacommit files (#1353)

* Update docs to specify commit/deltacommit variance for clarity

* for docker_demo updates, revert the changes on html apply the changes to 
md file instead
---
 docs/_docs/0_4_docker_demo.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_docs/0_4_docker_demo.md b/docs/_docs/0_4_docker_demo.md
index 3033371..88ead1b 100644
--- a/docs/_docs/0_4_docker_demo.md
+++ b/docs/_docs/0_4_docker_demo.md
@@ -178,7 +178,7 @@ exit
 You can use HDFS web-browser to look at the tables
 `http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow`.
 
-You can explore the new partition folder created in the table along with a 
"deltacommit"
+You can explore the new partition folder created in the table along with a 
"commit" / "deltacommit"
 file under .hoodie which signals a successful commit.
 
 There will be a similar setup when you browse the MOR table



[GitHub] [incubator-hudi] vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-25 Thread GitBox
vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-591243675
 
 
   Clo


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar closed pull request #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-25 Thread GitBox
vinothchandar closed pull request #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-581:

Component/s: Release & Administrative

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-634) Document breaking changes for 0.5.2 release

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-634:

Priority: Blocker  (was: Major)

> Document breaking changes for 0.5.2 release
> ---
>
> Key: HUDI-634
> URL: https://issues.apache.org/jira/browse/HUDI-634
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.5.2
>
>
> * Write Client restructuring has moved classes around (HUDI-554) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-589) Fix references to Views in some of the pages. Replace with Query instead

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-589:

Priority: Blocker  (was: Minor)

> Fix references to Views in some of the pages. Replace with Query instead
> 
>
> Key: HUDI-589
> URL: https://issues.apache.org/jira/browse/HUDI-589
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> querying_data.html still has some references to 'views'. This needs to be 
> replaced with 'queries'/'query types' appropriately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-599) Update release guide & release scripts due to the change of scala 2.12 build

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-599:

Priority: Blocker  (was: Major)

> Update release guide & release scripts due to the change of scala 2.12 build
> 
>
> Key: HUDI-599
> URL: https://issues.apache.org/jira/browse/HUDI-599
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Release  Administrative
>Reporter: leesf
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.2
>
>
> Update release guide due to the change of scala 2.12 build, PR link below
> [https://github.com/apache/incubator-hudi/pull/1293]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-635:

Component/s: Performance

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #199

2020-02-25 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.27 KB...]
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] satishkotha edited a comment on issue #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
satishkotha edited a comment on issue #1355: [HUDI-633] limit archive file 
block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#issuecomment-591204425
 
 
   > @satishkotha left some comments, I'm also assuming you will address the 
sizeToRead in a later PR ?
   yes, that one is more challenging and requires more elaborate change. If we 
can find a way to make these .clean files not get too large, that would be 
ideal. @nbalajee made some suggestions too. This will likely be another change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
satishkotha commented on issue #1355: [HUDI-633] limit archive file block size 
by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#issuecomment-591204425
 
 
   > @satishkotha left some comments, I'm also assuming you will address the 
sizeToRead in a later PR ?
   yes, that one is more challenging requires more elaborate change. If we can 
find a way to make these .clean files not get too large, that would be ideal. 
@nbalajee made some suggestions too. This will likely be another change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
satishkotha commented on a change in pull request #1355: [HUDI-633] limit 
archive file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384242337
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 ##
 @@ -103,6 +104,8 @@
   private static final String DEFAULT_MAX_COMMITS_TO_KEEP = "30";
   private static final String DEFAULT_MIN_COMMITS_TO_KEEP = "20";
   private static final String DEFAULT_COMMITS_ARCHIVAL_BATCH_SIZE = 
String.valueOf(10);
+  // Do not read more than 6MB at a time (6MB observed p95 in prod from 2 
datasets)
 
 Review comment:
   will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
satishkotha commented on a change in pull request #1355: [HUDI-633] limit 
archive file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384242297
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 ##
 @@ -103,6 +104,8 @@
   private static final String DEFAULT_MAX_COMMITS_TO_KEEP = "30";
   private static final String DEFAULT_MIN_COMMITS_TO_KEEP = "20";
   private static final String DEFAULT_COMMITS_ARCHIVAL_BATCH_SIZE = 
String.valueOf(10);
+  // Do not read more than 6MB at a time (6MB observed p95 in prod from 2 
datasets)
+  private static final String DEFAULT_COMMITS_ARCHIVAL_MEM_SIZE = 
String.valueOf(6 * 1024 * 1024);
 
 Review comment:
   any suggestions on how to pick this number? I just looked at 2 production 
dataset and measured size of 10 records in archived file. Max noticed is 10MB. 
p95 is 6MB.  Let me know if you have suggestions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
satishkotha commented on a change in pull request #1355: [HUDI-633] limit 
archive file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384241881
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -245,11 +249,23 @@ public void archive(List instants) throws 
HoodieCommitException {
   Schema wrapperSchema = HoodieArchivedMetaEntry.getClassSchema();
   LOG.info("Wrapper schema " + wrapperSchema.toString());
   List records = new ArrayList<>();
+  long totalInMemSize = 0;
   for (HoodieInstant hoodieInstant : instants) {
 try {
-  records.add(convertToAvroRecord(commitTimeline, hoodieInstant));
-  if (records.size() >= this.config.getCommitArchivalBatchSize()) {
+  IndexedRecord record = convertToAvroRecord(commitTimeline, 
hoodieInstant);
+  totalInMemSize += this.sizeEstimator.sizeEstimate(record);
 
 Review comment:
   size of clean files is very uneven. If we set 500MB as size limit, sample 
every 4 instants. There is a posibility that there are 3 consecutive clean 
files with 200MB each and break the memory limit set.
   
   I did some analysis and time taken in worst case is 680ms for really large 
record. For small records, its almost negligible. (Note max is taken on 
multiple runs. this is based on combination of clean and commit files in a 
production dataset)
   
   Time-taken-for-size-estimate: 1 ms. size: 584
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 681 ms. size: 144275608
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 159 ms. size: 50727808
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 129 ms. size: 38859144
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 127 ms. size: 37873704
   Time-taken-for-size-estimate: 0 ms. size: 408
   Time-taken-for-size-estimate: 0 ms. size: 400
   Time-taken-for-size-estimate: 129 ms. size: 34290648
   Time-taken-for-size-estimate: 0 ms. size: 584
   Time-taken-for-size-estimate: 3 ms. size: 921256
   
   If you think this is a concern, we can also consider using file size as 
estimate. (Disadvantage is we need to make api call to hdfs to get file size)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (HUDI-580) Fix incorrect license header in files

2020-02-25 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang resolved HUDI-580.
---
Resolution: Fixed

Fixed via master branch: 159cb060f0502bb01ec4cefaa743d3678b711254

> Fix incorrect license header in files
> -
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: compliance, pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-629) Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-25 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-629:
---
Fix Version/s: (was: 0.5.2)
   0.6.0

> Replace Guava's Hashing with an equivalent in NumericUtils.java
> ---
>
> Key: HUDI-629
> URL: https://issues.apache.org/jira/browse/HUDI-629
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.6.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-479) Eliminate use of guava if possible

2020-02-25 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-479:
---
Fix Version/s: (was: 0.5.2)
   0.6.0

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on issue #1242: [HUDI-544] Archived commits command code cleanup

2020-02-25 Thread GitBox
hddong commented on issue #1242: [HUDI-544] Archived commits command code 
cleanup
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-591183992
 
 
   @n3nash @vinothchandar please rereivew this agian.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong edited a comment on issue #1242: [HUDI-544] Archived commits command code cleanup

2020-02-25 Thread GitBox
hddong edited a comment on issue #1242: [HUDI-544] Archived commits command 
code cleanup
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-591183992
 
 
   @n3nash @vinothchandar please review this agian.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-25 Thread GitBox
satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#issuecomment-591175742
 
 
   @smarthi could you review this when you get a chance?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-614) EMR Presto cannot read Hudi tables saved to S3

2020-02-25 Thread Andrew Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated HUDI-614:
-
Description: 
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

Code I tried: 
[https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off of 
[https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
 

In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
results in the error: Could not find partitionDepth in partition metafile. 
Querying from Spark or Hive works as expected.

This problem does not occur when using HDFS.

cc: [~bhasudha]

  was:
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

Code I tried: 
[https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off of 
[https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
 

In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
results in the error: Could not find partitionDepth in partition metafile. 
Querying from Spark or Hive works as expected.

This problem does not occur in the docker env.

cc: [~bhasudha]


> EMR Presto cannot read Hudi tables saved to S3
> --
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> Code I tried: 
> [https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off 
> of 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
>  
> In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
> results in the error: Could not find partitionDepth in partition metafile. 
> Querying from Spark or Hive works as expected.
> This problem does not occur when using HDFS.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-614) EMR Presto cannot read Hudi tables saved to S3

2020-02-25 Thread Andrew Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated HUDI-614:
-
Summary: EMR Presto cannot read Hudi tables saved to S3  (was: EMR Presto 
cannot read Hudi tables)

> EMR Presto cannot read Hudi tables saved to S3
> --
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> Code I tried: 
> [https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off 
> of 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
>  
> In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
> results in the error: Could not find partitionDepth in partition metafile. 
> Querying from Spark or Hive works as expected.
> This problem does not occur in the docker env.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-614) EMR Presto cannot read Hudi tables

2020-02-25 Thread Andrew Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated HUDI-614:
-
Description: 
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

Code I tried: 
[https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off of 
[https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
 

In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
results in the error: Could not find partitionDepth in partition metafile. 
Querying from Spark or Hive works as expected.

This problem does not occur in the docker env.

cc: [~bhasudha]

  was:
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

Code I tried: 
[https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off of 
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.

 

In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
results in the error: Could not find partitionDepth in partition metafile. In 
docker environment Presto works fine.

cc: [~bhasudha]


> EMR Presto cannot read Hudi tables
> --
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> Code I tried: 
> [https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off 
> of 
> [https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html].
>  
> In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
> results in the error: Could not find partitionDepth in partition metafile. 
> Querying from Spark or Hive works as expected.
> This problem does not occur in the docker env.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-614) EMR Presto cannot read Hudi tables

2020-02-25 Thread Andrew Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated HUDI-614:
-
Description: 
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

Code I tried: 
[https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off of 
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.

 

In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
results in the error: Could not find partitionDepth in partition metafile. In 
docker environment Presto works fine.

cc: [~bhasudha]

  was:
Original issue: [https://github.com/apache/incubator-hudi/issues/1329]

I made a non-partitioned Hudi table using Spark. I was able to query it with 
Spark & Hive, but when I tried querying it with Presto, I received the error 
{{Could not find partitionDepth in partition metafile}}.

I attempted this task using emr-5.28.0 in AWS. I tried using the built-in 
spark-shell with both Amazon's /usr/lib/hudi/hudi-spark-bundle.jar (following 
[https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/)]
 and the org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating jar (following 
[https://hudi.apache.org/docs/quick-start-guide.html]).

I used NonpartitionedKeyGenerator & NonPartitionedExtractor in my write 
options, according to 
[https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIuseDeltaStreamerorSparkDataSourceAPItowritetoaNon-partitionedHudidataset?].
 You can see my code in the github issue linked above.

In both cases I see the .hoodie_partition_metadata file was created in the 
table path in S3. Querying the table worked in spark-shell & hive-cli, but 
attempting to query the table in presto-cli resulted in the error, "Could not 
find partitionDepth in partition metafile".

Please look into the bug or check the documentation. If there is a problem with 
the EMR install I can contact the AWS team responsible.

cc: [~bhasudha]


> EMR Presto cannot read Hudi tables
> --
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> Code I tried: 
> [https://gist.github.com/popart/c16a4661528fe1819aa63a1fed351e0c,] based off 
> of 
> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.
>  
> In AWS EMR 5.28.0 & 5.28.0, attempting to query a hudi table from Presto 
> results in the error: Could not find partitionDepth in partition metafile. In 
> docker environment Presto works fine.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-614) EMR Presto cannot read Hudi tables

2020-02-25 Thread Andrew Wong (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wong updated HUDI-614:
-
Summary: EMR Presto cannot read Hudi tables  (was: 
.hoodie_partition_metadata created for non-partitioned table)

> EMR Presto cannot read Hudi tables
> --
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> I made a non-partitioned Hudi table using Spark. I was able to query it with 
> Spark & Hive, but when I tried querying it with Presto, I received the error 
> {{Could not find partitionDepth in partition metafile}}.
> I attempted this task using emr-5.28.0 in AWS. I tried using the built-in 
> spark-shell with both Amazon's /usr/lib/hudi/hudi-spark-bundle.jar (following 
> [https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/)]
>  and the org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating jar 
> (following [https://hudi.apache.org/docs/quick-start-guide.html]).
> I used NonpartitionedKeyGenerator & NonPartitionedExtractor in my write 
> options, according to 
> [https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIuseDeltaStreamerorSparkDataSourceAPItowritetoaNon-partitionedHudidataset?].
>  You can see my code in the github issue linked above.
> In both cases I see the .hoodie_partition_metadata file was created in the 
> table path in S3. Querying the table worked in spark-shell & hive-cli, but 
> attempting to query the table in presto-cli resulted in the error, "Could not 
> find partitionDepth in partition metafile".
> Please look into the bug or check the documentation. If there is a problem 
> with the EMR install I can contact the AWS team responsible.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
n3nash commented on a change in pull request #1355: [HUDI-633] limit archive 
file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384113038
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 ##
 @@ -103,6 +104,8 @@
   private static final String DEFAULT_MAX_COMMITS_TO_KEEP = "30";
   private static final String DEFAULT_MIN_COMMITS_TO_KEEP = "20";
   private static final String DEFAULT_COMMITS_ARCHIVAL_BATCH_SIZE = 
String.valueOf(10);
+  // Do not read more than 6MB at a time (6MB observed p95 in prod from 2 
datasets)
 
 Review comment:
   Please remove comments about "prod" and leave a generic comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
n3nash commented on a change in pull request #1355: [HUDI-633] limit archive 
file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384113186
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 ##
 @@ -103,6 +104,8 @@
   private static final String DEFAULT_MAX_COMMITS_TO_KEEP = "30";
   private static final String DEFAULT_MIN_COMMITS_TO_KEEP = "20";
   private static final String DEFAULT_COMMITS_ARCHIVAL_BATCH_SIZE = 
String.valueOf(10);
+  // Do not read more than 6MB at a time (6MB observed p95 in prod from 2 
datasets)
+  private static final String DEFAULT_COMMITS_ARCHIVAL_MEM_SIZE = 
String.valueOf(6 * 1024 * 1024);
 
 Review comment:
   Also, 6MB seems very small, we should increasing this size


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-25 Thread GitBox
n3nash commented on a change in pull request #1355: [HUDI-633] limit archive 
file block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355#discussion_r384112616
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -245,11 +249,23 @@ public void archive(List instants) throws 
HoodieCommitException {
   Schema wrapperSchema = HoodieArchivedMetaEntry.getClassSchema();
   LOG.info("Wrapper schema " + wrapperSchema.toString());
   List records = new ArrayList<>();
+  long totalInMemSize = 0;
   for (HoodieInstant hoodieInstant : instants) {
 try {
-  records.add(convertToAvroRecord(commitTimeline, hoodieInstant));
-  if (records.size() >= this.config.getCommitArchivalBatchSize()) {
+  IndexedRecord record = convertToAvroRecord(commitTimeline, 
hoodieInstant);
+  totalInMemSize += this.sizeEstimator.sizeEstimate(record);
 
 Review comment:
   Why do we need to call the sizeEstimate for every record ? Can we call it 
for X number of records and use that average ? SizeEsimation of an object is 
CPU intensive and time consuming..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-625.
---
Resolution: Fixed

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
>

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Status: Open  (was: New)

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native 

[GitHub] [incubator-hudi] ramachandranms commented on issue #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-25 Thread GitBox
ramachandranms commented on issue #1347: [HUDI-627] Aggregate code coverage and 
publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#issuecomment-591009814
 
 
   @vinothchandar codecov lets us just upload reports without ownership. here 
is a screenshot of the admin page. it says that anyone with github permissions 
will be able to modify the codecov settings for that repo.
   
   
![image](https://user-images.githubusercontent.com/2144396/75277426-d6f3ee00-57bc-11ea-9e1f-0973c6e8d73a.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-25 Thread GitBox
ramachandranms commented on a change in pull request #1347: [HUDI-627] 
Aggregate code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r384053060
 
 

 ##
 File path: pom.xml
 ##
 @@ -51,6 +51,7 @@
 packaging/hudi-timeline-server-bundle
 docker/hoodie/hadoop
 hudi-integ-test
+hudi-test-coverage-aggregator
 
 Review comment:
   for now we cannot do without adding a new module in jacoco. there is a 
[PR](https://github.com/jacoco/jacoco/pull/1007) out in jacoco OSS to avoid 
this. But it is not getting much traction. i am closely following it. will 
remove this module once that PR gets merged to a release. for now this is just 
a wrapper module that does not get deployed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vikrantgoel commented on a change in pull request #1353: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-25 Thread GitBox
vikrantgoel commented on a change in pull request #1353: [HUDI-632] Update 
documentation (docker_demo) to mention both commit and deltacommit files
URL: https://github.com/apache/incubator-hudi/pull/1353#discussion_r384043681
 
 

 ##
 File path: content/docs/docker_demo.html
 ##
 @@ -543,7 +543,7 @@ Step 2: Incrementally
 You can use HDFS web-browser to look at the tables
 http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow.
 
-You can explore the new partition folder created in the table along with a 
“deltacommit”
+You can explore the new partition folder created in the table along with a 
“commit”/“deltacommit”
 
 Review comment:
   @lamber-ken @vinothchandar Thanks, updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1357: [HUDI-636] Fix could not get sources warnings while compiling

2020-02-25 Thread GitBox
bvaradar commented on a change in pull request #1357: [HUDI-636] Fix could not 
get sources warnings while compiling
URL: https://github.com/apache/incubator-hudi/pull/1357#discussion_r384002451
 
 

 ##
 File path: packaging/hudi-hadoop-mr-bundle/pom.xml
 ##
 @@ -48,7 +48,7 @@
   shade
 
 
-  true
+  false
 
 Review comment:
   Does this mean sources jar will not be getting generated. Instead, can you 
enable sources jar for non-bundle packages like hudi-common, hudi-client, . 
 ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar merged pull request #1356: [HUDI-580] Fix incorrect license header in files

2020-02-25 Thread GitBox
bvaradar merged pull request #1356: [HUDI-580] Fix incorrect license header in 
files
URL: https://github.com/apache/incubator-hudi/pull/1356
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (83c8ad5 -> 11fb2c2)

2020-02-25 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 83c8ad5  [HUDI-625] Fixing performance issues around DiskBasedMap & 
kryo (#1352)
 add 11fb2c2  [HUDI-580] Fix incorrect license header in files

No new revisions were added by this update.

Summary of changes:
 docker/build_local_docker_images.sh   | 2 --
 docker/compose/hadoop.env | 3 +--
 docker/demo/compaction.commands   | 3 +--
 docker/demo/config/base.properties| 3 +--
 docker/demo/config/dfs-source.properties  | 3 +--
 docker/demo/config/kafka-source.properties| 3 +--
 docker/demo/get_min_commit_time.sh| 3 +--
 docker/demo/hive-batch1.commands  | 3 +--
 docker/demo/hive-batch2-after-compaction.commands | 3 +--
 docker/demo/hive-incremental.commands | 3 +--
 docker/demo/hive-table-check.commands | 3 +--
 docker/demo/setup_demo_container.sh   | 3 +--
 docker/hoodie/hadoop/base/Dockerfile  | 3 +--
 docker/hoodie/hadoop/base/entrypoint.sh   | 2 --
 docker/hoodie/hadoop/base/export_container_ip.sh  | 3 +--
 docker/hoodie/hadoop/datanode/Dockerfile  | 3 +--
 docker/hoodie/hadoop/datanode/run_dn.sh   | 2 --
 docker/hoodie/hadoop/historyserver/Dockerfile | 3 +--
 docker/hoodie/hadoop/historyserver/run_history.sh | 2 --
 docker/hoodie/hadoop/hive_base/Dockerfile | 3 +--
 docker/hoodie/hadoop/hive_base/conf/beeline-log4j2.properties | 3 +--
 docker/hoodie/hadoop/hive_base/conf/hive-env.sh   | 3 +--
 docker/hoodie/hadoop/hive_base/conf/hive-exec-log4j2.properties   | 3 +--
 docker/hoodie/hadoop/hive_base/conf/hive-log4j2.properties| 3 +--
 docker/hoodie/hadoop/hive_base/conf/llap-daemon-log4j2.properties | 3 +--
 docker/hoodie/hadoop/hive_base/entrypoint.sh  | 2 --
 docker/hoodie/hadoop/hive_base/startup.sh | 2 --
 docker/hoodie/hadoop/namenode/Dockerfile  | 3 +--
 docker/hoodie/hadoop/namenode/run_nn.sh   | 2 --
 docker/hoodie/hadoop/prestobase/Dockerfile| 3 +--
 docker/hoodie/hadoop/prestobase/bin/entrypoint.sh | 2 --
 docker/hoodie/hadoop/prestobase/bin/mustache.sh   | 2 --
 docker/hoodie/hadoop/prestobase/etc/catalog/hive.properties   | 4 ++--
 docker/hoodie/hadoop/prestobase/etc/catalog/jmx.properties| 4 ++--
 docker/hoodie/hadoop/prestobase/etc/catalog/localfile.properties  | 4 ++--
 docker/hoodie/hadoop/prestobase/etc/coordinator.properties.mustache   | 4 ++--
 docker/hoodie/hadoop/prestobase/etc/jvm.config.mustache   | 4 ++--
 docker/hoodie/hadoop/prestobase/etc/log.properties| 4 ++--
 docker/hoodie/hadoop/prestobase/etc/node.properties.mustache  | 4 ++--
 docker/hoodie/hadoop/prestobase/etc/worker.properties.mustache| 4 ++--
 docker/hoodie/hadoop/prestobase/lib/mustache.sh   | 3 +--
 docker/hoodie/hadoop/spark_base/Dockerfile| 3 +--
 docker/hoodie/hadoop/spark_base/execute-step.sh   | 2 --
 docker/hoodie/hadoop/spark_base/finish-step.sh| 2 --
 docker/hoodie/hadoop/spark_base/wait-for-step.sh  | 2 --
 docker/hoodie/hadoop/sparkadhoc/Dockerfile| 3 +--
 docker/hoodie/hadoop/sparkadhoc/adhoc.sh  | 2 --
 docker/hoodie/hadoop/sparkmaster/Dockerfile   | 3 +--
 docker/hoodie/hadoop/sparkmaster/master.sh| 2 --
 docker/hoodie/hadoop/sparkworker/Dockerfile   | 3 +--
 docker/hoodie/hadoop/sparkworker/worker.sh| 2 --
 docker/setup_demo.sh  | 2 --
 docker/stop_demo.sh   | 2 --
 hudi-cli/conf/hudi-env.sh | 2 --
 hudi-cli/hudi-cli.sh  | 2 --
 hudi-hive/run_sync_tool.sh| 2 --
 hudi-spark/run_hoodie_app.sh  | 2 --
 .../META-INF/services/org.apache.spark.sql.sources.DataSourceRegister | 3 +--
 

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1357: [HUDI-636] Fix could not get sources warnings while compiling

2020-02-25 Thread GitBox
lamber-ken opened a new pull request #1357: [HUDI-636] Fix could not get 
sources warnings while compiling
URL: https://github.com/apache/incubator-hudi/pull/1357
 
 
   ## What is the purpose of the pull request
   
   During the voting process on rc1 0.5.1-incubating release[1], Justin pointed 
out that mvn log display could not get sources warnings, like
   ```
   
   [INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle 
---
   [INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the 
shaded jar.
   Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
   [WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
   [INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 
from the shaded jar.
   [INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 
from the shaded jar.
   [INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
   [INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the 
shaded jar.
   [INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded 
jar.
   [INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the 
shaded jar.
   [INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the 
shaded jar.
   [INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
   [INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
   [INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded 
jar.
   [INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
   [INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
   ```
   
   [1] 
https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E
   
   ## Brief change log
   
 - set createSourcesJar false.
   
   ## Verify this pull request
   
   ```
   mvn clean package -DskipTests -DskipITs
   ```
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-636:

Labels: pull-request-available  (was: )

> Fix could not get sources warnings while compiling 
> ---
>
> Key: HUDI-636
> URL: https://issues.apache.org/jira/browse/HUDI-636
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>
> During the voting process on rc1 0.5.1-incubating release, Justin pointed out 
> that mvn log display could not get sources warnings
>  
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> {code:java}
> [INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle 
> ---
> [INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
> jar.
> Downloading from aliyun: 
> http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from cloudera: 
> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from confluent: 
> https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-milestone: 
> https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-release: 
> https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from apache.snapshots: 
> https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> [WARNING] Could not get sources for 
> org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
> [INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 
> from the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
> the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the 
> shaded jar.
> [INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded 
> jar.
> [INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
> jar.
> [INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
> [INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
> [INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
> [INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
> [INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-636:

Status: Open  (was: New)

> Fix could not get sources warnings while compiling 
> ---
>
> Key: HUDI-636
> URL: https://issues.apache.org/jira/browse/HUDI-636
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> During the voting process on rc1 0.5.1-incubating release, Justin pointed out 
> that mvn log display could not get sources warnings
>  
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> {code:java}
> [INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle 
> ---
> [INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
> jar.
> Downloading from aliyun: 
> http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from cloudera: 
> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from confluent: 
> https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-milestone: 
> https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-release: 
> https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from apache.snapshots: 
> https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> [WARNING] Could not get sources for 
> org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
> [INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 
> from the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
> the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the 
> shaded jar.
> [INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded 
> jar.
> [INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
> jar.
> [INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
> [INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
> [INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
> [INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
> [INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-636:

Description: 
During the voting process on rc1 0.5.1-incubating release, Justin pointed out 
that mvn log display could not get sources warnings

 

[https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]

 
{code:java}
[INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle ---
[INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
jar.
Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
[WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
[INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
[INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded jar.
[INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the shaded 
jar.
[INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
[INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
[INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
[INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
[INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
{code}

  was:
During the voting process on rc1 0.5.1-incubating release, Justin pointed out

There are other similar warnings as well

[https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]

 
{code:java}
[INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle ---
[INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
jar.
Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
[WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
[INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
[INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded jar.
[INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the shaded 
jar.
[INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
[INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
[INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
[INFO] 

[jira] [Assigned] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-636:
---

Assignee: lamber-ken

> Fix could not get sources warnings while compiling 
> ---
>
> Key: HUDI-636
> URL: https://issues.apache.org/jira/browse/HUDI-636
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> During the voting process on rc1 0.5.1-incubating release, Justin pointed out 
> that mvn log display could not get sources warnings
>  
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> {code:java}
> [INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle 
> ---
> [INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
> jar.
> Downloading from aliyun: 
> http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from cloudera: 
> https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from confluent: 
> https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-milestone: 
> https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from libs-release: 
> https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> Downloading from apache.snapshots: 
> https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
> [WARNING] Could not get sources for 
> org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
> [INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 
> from the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
> the shaded jar.
> [INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the 
> shaded jar.
> [INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded 
> jar.
> [INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the 
> shaded jar.
> [INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
> jar.
> [INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
> [INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
> [INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
> [INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
> [INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-636:

Description: 
During the voting process on rc1 0.5.1-incubating release, Justin pointed out

There are other similar warnings as well

[https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]

 
{code:java}
[INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle ---
[INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
jar.
Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
[WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
[INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
[INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded jar.
[INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the shaded 
jar.
[INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
[INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
[INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
[INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
[INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
{code}

  was:
{code:java}
[INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle ---
[INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
jar.
Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
[WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
[INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
[INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded jar.
[INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the shaded 
jar.
[INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
[INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
[INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
[INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
[INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
{code}


> Fix could not get sources warnings while compiling 
> ---
>
> Key: 

[jira] [Created] (HUDI-636) Fix could not get sources warnings while compiling

2020-02-25 Thread lamber-ken (Jira)
lamber-ken created HUDI-636:
---

 Summary: Fix could not get sources warnings while compiling 
 Key: HUDI-636
 URL: https://issues.apache.org/jira/browse/HUDI-636
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: lamber-ken


{code:java}
[INFO] --- maven-shade-plugin:3.1.1:shade (default) @ hudi-hadoop-mr-bundle ---
[INFO] Including org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT in the shaded 
jar.
Downloading from aliyun: 
http://maven.aliyun.com/nexus/content/groups/public/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from cloudera: 
https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from confluent: 
https://packages.confluent.io/maven/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-milestone: 
https://repo.spring.io/libs-milestone/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from libs-release: 
https://repo.spring.io/libs-release/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
Downloading from apache.snapshots: 
https://repository.apache.org/snapshots/org/apache/hudi/hudi-common/0.5.2-SNAPSHOT/hudi-common-0.5.2-SNAPSHOT-sources.jar
[WARNING] Could not get sources for 
org.apache.hudi:hudi-common:jar:0.5.2-SNAPSHOT:compile
[INFO] Excluding com.fasterxml.jackson.core:jackson-annotations:jar:2.6.7 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-databind:jar:2.6.7.1 from 
the shaded jar.
[INFO] Excluding com.fasterxml.jackson.core:jackson-core:jar:2.6.7 from the 
shaded jar.
[INFO] Excluding org.apache.httpcomponents:fluent-hc:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-logging:commons-logging:jar:1.1.3 from the shaded jar.
[INFO] Excluding org.apache.httpcomponents:httpclient:jar:4.3.6 from the shaded 
jar.
[INFO] Excluding org.apache.httpcomponents:httpcore:jar:4.3.2 from the shaded 
jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.6 from the shaded jar.
[INFO] Excluding org.rocksdb:rocksdbjni:jar:5.17.2 from the shaded jar.
[INFO] Including com.esotericsoftware:kryo-shaded:jar:4.0.2 in the shaded jar.
[INFO] Including com.esotericsoftware:minlog:jar:1.3.0 in the shaded jar.
[INFO] Including org.objenesis:objenesis:jar:2.5.1 in the shaded jar.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-02-25 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-635:
---

 Summary: MergeHandle's DiskBasedMap entries can be thinner
 Key: HUDI-635
 URL: https://issues.apache.org/jira/browse/HUDI-635
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Writer Core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar


Instead of , we can just track  ... Helps with 
use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-25 Thread GitBox
vinothchandar commented on a change in pull request #1351: [WIP] [HUDI-625] 
Fixing performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#discussion_r383703512
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -86,6 +98,129 @@
 return (T) SERIALIZER_REF.get().deserialize(objectData);
   }
 
+  public static class HoodieKeySerializer extends Serializer 
implements Serializable {
 
 Review comment:
   These are no longer needed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-25 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-625:

Fix Version/s: (was: 0.6.0)
   0.5.2

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at 

[jira] [Updated] (HUDI-580) Fix incorrect license header in files

2020-02-25 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-580:

Labels: compliance pull-request-available  (was: compliance)

> Fix incorrect license header in files
> -
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: compliance, pull-request-available
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1356: [HUDI-580] Fix incorrect license header in files

2020-02-25 Thread GitBox
lamber-ken opened a new pull request #1356: [HUDI-580] Fix incorrect license 
header in files
URL: https://github.com/apache/incubator-hudi/pull/1356
 
 
   ## What is the purpose of the pull request
   
   During the voting process on rc1 0.5.1-incubating release, Justin pointed 
out 
   docker/hoodie/hadoop/base/entrypoint.sh has an incorrect license header,
   But, many script files used the same license header like "entrypoint.sh" has.
   
   From apache license guide[2], it says "The text should be enclosed in the 
appropriate comment syntax for the file format." So, need to remove the 
repeated "#".
   
   Also, refered to 
   https://github.com/apache/hive/blob/master/bin/init-hive-dfs.sh
   https://github.com/apache/hive/blob/master/bin/hive-config.sh
   
   https://github.com/apache/kafka/blob/trunk/bin/connect-distributed.sh
   https://github.com/apache/kafka/blob/trunk/bin/kafka-leader-election.sh
   
   ## Brief change log
   
 - Correct right license header
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-580) Fix incorrect license header in files

2020-02-25 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-580:

Summary: Fix incorrect license header in files  (was: Incorrect license 
header in docker/hoodie/hadoop/base/entrypoint.sh)

> Fix incorrect license header in files
> -
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: compliance
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)