[jira] [Updated] (HUDI-725) Remove or rewrite init log in DeltaSync

2020-03-19 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-725:
-
Description: 
When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
initialized in turn. Both of which printed the init config log, the same log.

org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363 - 
LOG.info("Creating delta streamer with configs : " + props.toString());

org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 - LOG.info("Creating 
delta streamer with configs : " + props.toString());

So, I think the log at 
org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
rewrited.

  was:
When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
initialized in turn. Both of which printed the init config log, the same log.

org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363

LOG.info("Creating delta streamer with configs : " + props.toString());

org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171

LOG.info("Creating delta streamer with configs : " + props.toString());

So, I think the log at 
org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
rewrited.


> Remove or rewrite  init log in DeltaSync
> 
>
> Key: HUDI-725
> URL: https://issues.apache.org/jira/browse/HUDI-725
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Trivial
>
> When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
> initialized in turn. Both of which printed the init config log, the same log.
> org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 - 
> LOG.info("Creating delta streamer with configs : " + props.toString());
> So, I think the log at 
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
> rewrited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-19 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-726:
-
Description: 
It seems that this method 
'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
 has never been used.

Can we delete it?

  was:
It seems that this method 
'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
 is never used.

Can we delete it?


> Delete unused method in HoodieDeltaStreamer
> ---
>
> Key: HUDI-726
> URL: https://issues.apache.org/jira/browse/HUDI-726
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Minor
>
> It seems that this method 
> 'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
>  has never been used.
> Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-19 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-726:
-
Description: 
It seems that this method 
'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
 is never used.

Can we delete it?

  was:
It seems the method 
'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
 is never used.

Can we delete it?


> Delete unused method in HoodieDeltaStreamer
> ---
>
> Key: HUDI-726
> URL: https://issues.apache.org/jira/browse/HUDI-726
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Minor
>
> It seems that this method 
> 'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
>  is never used.
> Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-19 Thread wangxianghu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063105#comment-17063105
 ] 

wangxianghu commented on HUDI-726:
--

[~vinoth] what do you think ?

> Delete unused method in HoodieDeltaStreamer
> ---
>
> Key: HUDI-726
> URL: https://issues.apache.org/jira/browse/HUDI-726
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Minor
>
> It seems the method 
> 'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
>  is never used.
> Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-725) Remove or rewrite init log in DeltaSync

2020-03-19 Thread wangxianghu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063104#comment-17063104
 ] 

wangxianghu commented on HUDI-725:
--

[~vinoth] what do you think ?

> Remove or rewrite  init log in DeltaSync
> 
>
> Key: HUDI-725
> URL: https://issues.apache.org/jira/browse/HUDI-725
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Trivial
>
> When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
> initialized in turn. Both of which printed the init config log, the same log.
> org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363
> LOG.info("Creating delta streamer with configs : " + props.toString());
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171
> LOG.info("Creating delta streamer with configs : " + props.toString());
> So, I think the log at 
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
> rewrited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-726) Delete unused method in HoodieDeltaStreamer

2020-03-19 Thread wangxianghu (Jira)
wangxianghu created HUDI-726:


 Summary: Delete unused method in HoodieDeltaStreamer
 Key: HUDI-726
 URL: https://issues.apache.org/jira/browse/HUDI-726
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: wangxianghu


It seems the method 
'org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#getDeltaSyncService'
 is never used.

Can we delete it?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395433296
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
 
 Review comment:
   good point, will move to method like 
getSmallFilesForPartitions(partitionPaths, jsc)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395433104
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
+//Parellelize the GetSmallFile Operation by using RDDs
+List partitionPathsList = new ArrayList<>(partitionPaths);
+JavaRDD partitionPathRdds = 
jsc.parallelize(partitionPathsList, partitionPathsList.size());
+List>> partitionSmallFileTuples =
+partitionPathRdds.map(it -> new Tuple2>(it, getSmallFiles(it))).collect();
+
+for (Tuple2> tuple : partitionSmallFileTuples) 
{
+  partitionSmallFilesMap.put(tuple._1, tuple._2);
+}
+  }
+
   for (String partitionPath : partitionPaths) {
 WorkloadStat pStat = profile.getWorkloadStat(partitionPath);
 if (pStat.getNumInserts() > 0) {
 
-  List smallFiles = getSmallFiles(partitionPath);
+  List smallFiles;
+  if (partitionSmallFilesMap.isEmpty()) {
 
 Review comment:
   sounds good. will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
ffcchi commented on a change in pull request #1421: [HUDI-724] Parallelize 
getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395432973
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java
 ##
 @@ -374,16 +374,12 @@ public void finalizeWrite(JavaSparkContext jsc, String 
instantTs, List

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #222

2020-03-19 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.37 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Updated] (HUDI-725) Remove or rewrite init log in DeltaSync

2020-03-19 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-725:
-
Description: 
When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
initialized in turn. Both of which printed the init config log, the same log.

org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363

LOG.info("Creating delta streamer with configs : " + props.toString());

org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171

LOG.info("Creating delta streamer with configs : " + props.toString());

So, I think the log at 
org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
rewrited.

  was:
When initializing HoodieDeltaStreamer,  DeltaSyncService and DeltaSync are 
initialized in turn. Both of which printed the init config log, the same log.

org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363

{color:#9876aa}LOG{color}.info({color:#6a8759}"Creating delta streamer with 
configs : " {color}+ 
{color:#9876aa}props{color}.toString()){color:#cc7832};{color}

{color:#cc7832}org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171{color}

{color:#9876aa}LOG{color}.info({color:#6a8759}"Creating delta streamer with 
configs : " {color}+ props.toString()){color:#cc7832};{color}

{color:#172b4d}So, I think the log at 
{color:#cc7832}org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171{color}
 can be removed or rewrited.{color}


> Remove or rewrite  init log in DeltaSync
> 
>
> Key: HUDI-725
> URL: https://issues.apache.org/jira/browse/HUDI-725
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Priority: Trivial
>
> When initializing HoodieDeltaStreamer, DeltaSyncService and DeltaSync are 
> initialized in turn. Both of which printed the init config log, the same log.
> org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363
> LOG.info("Creating delta streamer with configs : " + props.toString());
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171
> LOG.info("Creating delta streamer with configs : " + props.toString());
> So, I think the log at 
> org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171 can be removed or 
> rewrited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-725) Remove or rewrite init log in DeltaSync

2020-03-19 Thread wangxianghu (Jira)
wangxianghu created HUDI-725:


 Summary: Remove or rewrite  init log in DeltaSync
 Key: HUDI-725
 URL: https://issues.apache.org/jira/browse/HUDI-725
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: wangxianghu


When initializing HoodieDeltaStreamer,  DeltaSyncService and DeltaSync are 
initialized in turn. Both of which printed the init config log, the same log.

org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java:363

{color:#9876aa}LOG{color}.info({color:#6a8759}"Creating delta streamer with 
configs : " {color}+ 
{color:#9876aa}props{color}.toString()){color:#cc7832};{color}

{color:#cc7832}org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171{color}

{color:#9876aa}LOG{color}.info({color:#6a8759}"Creating delta streamer with 
configs : " {color}+ props.toString()){color:#cc7832};{color}

{color:#172b4d}So, I think the log at 
{color:#cc7832}org/apache/hudi/utilities/deltastreamer/DeltaSync.java:171{color}
 can be removed or rewrited.{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io commented on issue #1422: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
codecov-io commented on issue #1422: [HUDI-400]check upgrade from old plan to 
new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1422#issuecomment-601500333
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=h1) 
Report
   > Merging 
[#1422](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/a752b7b18c5c74c0c9115168188fa4a15a670225?src=pr=desc)
 will **increase** coverage by `0.04%`.
   > The diff coverage is `100%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1422/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1422  +/-   ##
   
   + Coverage 67.51%   67.56%   +0.04% 
 Complexity  255  255  
   
 Files   340  340  
 Lines 1649916499  
 Branches   1686 1687   +1 
   
   + Hits  1113911147   +8 
   + Misses 4622 4615   -7 
   + Partials738  737   -1
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...oning/compaction/CompactionV2MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1422/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `84.21% <100%> (+21.05%)` | `0 <0> (ø)` | :arrow_down: |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1422/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <0%> (-13.52%)` | `0% <0%> (ø)` | |
   | 
[...pache/hudi/common/versioning/MetadataMigrator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1422/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvTWV0YWRhdGFNaWdyYXRvci5qYXZh)
 | `83.33% <0%> (+25%)` | `0% <0%> (ø)` | :arrow_down: |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=footer).
 Last update 
[a752b7b...bf8ecdf](https://codecov.io/gh/apache/incubator-hudi/pull/1422?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601498123
 
 
   > I feel it can go on the contributing guide.. Code reviews are also 
contributing :) .. either way is fine by me.. Draft something and share on the 
mailing list?
   
   Sure, will draft when get a chance.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [HUDI-653] Add JMX Report Config to Doc (#1370)

2020-03-19 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 434e9c5  [HUDI-653] Add JMX Report Config to Doc (#1370)
434e9c5 is described below

commit 434e9c5c125b018af13ee5308631694660780700
Author: ForwardXu 
AuthorDate: Fri Mar 20 10:16:03 2020 +0800

[HUDI-653] Add JMX Report Config to Doc (#1370)
---
 docs/_docs/2_4_configurations.cn.md | 38 ++---
 docs/_docs/2_4_configurations.md| 42 +
 docs/_docs/2_6_deployment.cn.md |  4 ++--
 docs/_docs/2_6_deployment.md|  2 +-
 4 files changed, 67 insertions(+), 19 deletions(-)

diff --git a/docs/_docs/2_4_configurations.cn.md 
b/docs/_docs/2_4_configurations.cn.md
index d1cb780..1b1af05 100644
--- a/docs/_docs/2_4_configurations.cn.md
+++ b/docs/_docs/2_4_configurations.cn.md
@@ -433,29 +433,53 @@ Hudi提供了一个选项,可以通过将对该分区中的插入作为对现
 
 
 ### 指标配置
-能够将Hudi指标报告给graphite。
+配置Hudi指标报告。
 [withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) 
 Hudi会发布有关每次提交、清理、回滚等的指标。
 
- on(metricsOn = true) {#on} 
+ GRAPHITE
+
+# on(metricsOn = true) {#on}
 属性:`hoodie.metrics.on` 
 打开或关闭发送指标。默认情况下处于启用状态。
 
- withReporterType(reporterType = GRAPHITE) {#withReporterType} 
+# withReporterType(reporterType = GRAPHITE) {#withReporterType}
 属性:`hoodie.metrics.reporter.type` 
-指标报告者的类型。默认使用graphite,也是唯一支持的类型。
+指标报告者的类型。默认使用graphite。
 
- toGraphiteHost(host = localhost) {#toGraphiteHost} 
+# toGraphiteHost(host = localhost) {#toGraphiteHost}
 属性:`hoodie.metrics.graphite.host` 
 要连接的graphite主机
 
- onGraphitePort(port = 4756) {#onGraphitePort} 
+# onGraphitePort(port = 4756) {#onGraphitePort}
 属性:`hoodie.metrics.graphite.port` 
 要连接的graphite端口
 
- usePrefix(prefix = "") {#usePrefix} 
+# usePrefix(prefix = "") {#usePrefix}
 属性:`hoodie.metrics.graphite.metric.prefix` 
 适用于所有指标的标准前缀。这有助于添加如数据中心、环境等信息
+
+ JMX
+
+# on(metricsOn = true) {#on}
+属性:`hoodie.metrics.on` 
+打开或关闭发送指标。默认情况下处于启用状态。
+
+# withReporterType(reporterType = JMX) {#withReporterType}
+属性:`hoodie.metrics.reporter.type` 
+指标报告者的类型。
+
+# toJmxHost(host = localhost) {#toJmxHost}
+属性:`hoodie.metrics.jmx.host` 
+要连接的Jmx主机
+
+# onJmxPort(port = 1000-5000) {#onJmxPort}
+属性:`hoodie.metrics.graphite.port` 
+要连接的Jmx端口
+
+# usePrefix(prefix = "") {#usePrefix}
+属性:`hoodie.metrics.jmx.metric.prefix` 
+适用于所有指标的标准前缀。这有助于添加如数据中心、环境等信息
 
 ### 内存配置
 控制由Hudi内部执行的压缩和合并的内存使用情况
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index 16dacba..abb102b 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -398,29 +398,53 @@ Property: `hoodie.compaction.payload.class` 
 
 
 ### Metrics configs
-Enables reporting of Hudi metrics to graphite.
+Enables reporting on Hudi metrics.
 [withMetricsConfig](#withMetricsConfig) (HoodieMetricsConfig) 
-Hudi publishes metrics on every commit, clean, 
rollback etc.
+Hudi publishes metrics on every commit, clean, 
rollback etc.The following sections list the supported reporters.
 
- on(metricsOn = true) {#on} 
-Property: `hoodie.metrics.on` 
+ GRAPHITE
+
+# on(metricsOn = true) {#on}
+`hoodie.metrics.on` 
 Turn sending metrics on/off. on by default.
 
- withReporterType(reporterType = GRAPHITE) {#withReporterType} 
+# withReporterType(reporterType = GRAPHITE) {#withReporterType}
 Property: `hoodie.metrics.reporter.type` 
-Type of metrics reporter. Graphite is the default and 
the only value suppported.
+Type of metrics reporter.
 
- toGraphiteHost(host = localhost) {#toGraphiteHost} 
+# toGraphiteHost(host = localhost) {#toGraphiteHost}
 Property: `hoodie.metrics.graphite.host` 
 Graphite host to connect to
 
- onGraphitePort(port = 4756) {#onGraphitePort} 
+# onGraphitePort(port = 4756) {#onGraphitePort}
 Property: `hoodie.metrics.graphite.port` 
 Graphite port to connect to
 
- usePrefix(prefix = "") {#usePrefix} 
+# usePrefix(prefix = "") {#usePrefix}
 Property: `hoodie.metrics.graphite.metric.prefix` 
 Standard prefix applied to all metrics. This helps to 
add datacenter, environment information for e.g
+
+ JMX
+
+# on(metricsOn = true) {#on}
+`hoodie.metrics.on` 
+Turn sending metrics on/off. on by default.
+
+# withReporterType(reporterType = JMX) {#withReporterType}
+Property: `hoodie.metrics.reporter.type` 
+Type of metrics reporter.
+
+# toJmxHost(host = localhost) {#toJmxHost}
+Property: `hoodie.metrics.jmx.host` 
+Jmx host to connect to
+
+# onJmxPort(port = 1000-5000) {#onJmxPort}
+Property: `hoodie.metrics.jmx.port` 
+Jmx port to connect to
+
+# usePrefix(prefix = "") {#usePrefix}
+Property: `hoodie.metrics.jmx.metric.prefix` 
+Standard prefix applied to all metrics. This helps to 
add datacenter, environment information 

[GitHub] [incubator-hudi] leesf merged pull request #1370: [HUDI-653] Add JMX Report Config to Doc

2020-03-19 Thread GitBox
leesf merged pull request #1370: [HUDI-653] Add JMX Report Config to Doc
URL: https://github.com/apache/incubator-hudi/pull/1370
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063038#comment-17063038
 ] 

Vinoth Chandar commented on HUDI-724:
-

Seems legit... I have not seen this with HDFS atleast... fix makes sense.. left 
comments on the PR 

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for 
Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601497167
 
 
   I feel it can go on the contributing guide.. Code reviews are also 
contributing :) .. either way is fine by me.. Draft something and share on the 
mailing list? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
vinothchandar commented on issue #1421: [HUDI-724] Parallelize getSmallFiles 
for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#issuecomment-601496939
 
 
   @bvaradar  to make final pass and sign off 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
vinothchandar commented on a change in pull request #1421: [HUDI-724] 
Parallelize getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395410649
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
+//Parellelize the GetSmallFile Operation by using RDDs
+List partitionPathsList = new ArrayList<>(partitionPaths);
+JavaRDD partitionPathRdds = 
jsc.parallelize(partitionPathsList, partitionPathsList.size());
+List>> partitionSmallFileTuples =
+partitionPathRdds.map(it -> new Tuple2>(it, getSmallFiles(it))).collect();
+
+for (Tuple2> tuple : partitionSmallFileTuples) 
{
+  partitionSmallFilesMap.put(tuple._1, tuple._2);
+}
+  }
+
   for (String partitionPath : partitionPaths) {
 WorkloadStat pStat = profile.getWorkloadStat(partitionPath);
 if (pStat.getNumInserts() > 0) {
 
-  List smallFiles = getSmallFiles(partitionPath);
+  List smallFiles;
+  if (partitionSmallFilesMap.isEmpty()) {
 
 Review comment:
   Can we change that code so that it always lists in parallel.. i.e no need 
for the fallback in this if block... 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
vinothchandar commented on a change in pull request #1421: [HUDI-724] 
Parallelize getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395410993
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java
 ##
 @@ -374,16 +374,12 @@ public void finalizeWrite(JavaSparkContext jsc, String 
instantTs, List

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
vinothchandar commented on a change in pull request #1421: [HUDI-724] 
Parallelize getSmallFiles for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421#discussion_r395410562
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -602,18 +602,39 @@ private int addUpdateBucket(String fileIdHint) {
   return bucket;
 }
 
-private void assignInserts(WorkloadProfile profile) {
+private void assignInserts(WorkloadProfile profile, JavaSparkContext jsc) {
   // for new inserts, compute buckets depending on how many records we 
have for each partition
   Set partitionPaths = profile.getPartitionPaths();
   long averageRecordSize =
   
averageBytesPerRecord(metaClient.getActiveTimeline().getCommitTimeline().filterCompletedInstants(),
   config.getCopyOnWriteRecordSizeEstimate());
   LOG.info("AvgRecordSize => " + averageRecordSize);
+
+  HashMap> partitionSmallFilesMap = new 
HashMap<>();
+  if (jsc != null && partitionPaths.size() > 1) {
 
 Review comment:
   I think its a reasonable thing to parallelize this.. Listing of cleaning etc 
has been parallelized like this before. Should be safe to do. 
   
   Also can we pull this block into a method? `getSmallFiles(partitionPaths)` 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-400) Add more checks to TestCompactionUtils#testUpgradeDowngrade

2020-03-19 Thread jerry (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jerry updated HUDI-400:
---
Status: In Progress  (was: Open)

> Add more checks to TestCompactionUtils#testUpgradeDowngrade
> ---
>
> Key: HUDI-400
> URL: https://issues.apache.org/jira/browse/HUDI-400
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie, Testing
>Reporter: leesf
>Assignee: jerry
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Currently, the TestCompactionUtils#testUpgradeDowngrade does not check 
> upgrade from old plan to new plan, it is proper to add some checks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] zhaomin1423 opened a new pull request #1422: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
zhaomin1423 opened a new pull request #1422: [HUDI-400]check upgrade from old 
plan to new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1422
 
 
   What is the purpose of the pull request
   Add more test for compaction plan upgrade
   
   Brief change log
   check upgrade from old plan to new plan for compaction
   Verify this pull request
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] zhaomin1423 closed pull request #1419: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
zhaomin1423 closed pull request #1419: [HUDI-400]check upgrade from old plan to 
new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1419
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-19 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063033#comment-17063033
 ] 

Udit Mehrotra commented on HUDI-724:


Thanks Feichi for putting this out ! [~vinoth] [~vbalaji] Feichi is working on 
adopting Hudi for one of AWS teams. He is seeing significant performance 
improvement by parallelizing listing of small files. We did turn on the 
embedded timeline server, but guess it did not help much probably because of 
caching for the first time ? Would like to get your thoughts if this 
parallelization is safe to do.

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 10m
>  Remaining Estimate: 47h 50m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ffcchi opened a new pull request #1421: [HUDI-724] Parallelize getSmallFiles for partitions

2020-03-19 Thread GitBox
ffcchi opened a new pull request #1421: [HUDI-724] Parallelize getSmallFiles 
for partitions
URL: https://github.com/apache/incubator-hudi/pull/1421
 
 
   
   ## What is the purpose of the pull request
   
   *parallelizing the operation of getting small files for partitions when 
constructing the UpsertPartitioner for performance improvement*
   
   ## Brief change log
   
   *(for example:)*
 - *pass through JavaSparkContext*
 - *use RDD for parallelism*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601488498
 
 
   > @leesf this has happened enough times now, that we probably need a Code 
Review guide as well? wdyt
   
   Agree, I would like to update 
https://cwiki.apache.org/confluence/display/HUDI/Committer+On-boarding+Guide to 
add Code Review guide for new committers, wdyt?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r39540
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
 
 Review comment:
   Most of it is supporting cast for HFile read and write. So, not sure how 
much we re-use we could do here. Not too much value in modularizing it because 
the TestParquetInLining may have lot of supporting cast for Parquet. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395401242
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/inline/fs/TestHFileReadWriteFlow.java
 ##
 @@ -0,0 +1,243 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.junit.After;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.Set;
+import java.util.UUID;
+
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.FILE_SCHEME;
+import static org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.RANDOM;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getPhantomFile;
+import static 
org.apache.hudi.utilities.inline.fs.FileSystemTestUtils.getRandomOuterInMemPath;
+
+/**
+ * Tests {@link InlineFileSystem} using HFile writer and reader.
+ */
+public class TestHFileReadWriteFlow {
+
+  private final Configuration inMemoryConf;
+  private final Configuration inlineConf;
+  private final int minBlockSize = 1024;
+  private static final String LOCAL_FORMATTER = "%010d";
+  private int maxRows = 100 + RANDOM.nextInt(1000);
+  private Path generatedPath;
+
+  public TestHFileReadWriteFlow() {
+inMemoryConf = new Configuration();
+inMemoryConf.set("fs." + InMemoryFileSystem.SCHEME + ".impl", 
InMemoryFileSystem.class.getName());
+inlineConf = new Configuration();
+inlineConf.set("fs." + InlineFileSystem.SCHEME + ".impl", 
InlineFileSystem.class.getName());
+  }
+
+  @After
+  public void teardown() throws IOException {
+if (generatedPath != null) {
+  File filePath = new 
File(generatedPath.toString().substring(generatedPath.toString().indexOf(':') + 
1));
+  if (filePath.exists()) {
+FileSystemTestUtils.deleteFile(filePath);
+  }
+}
+  }
+
+  @Test
+  public void testSimpleInlineFileSystem() throws IOException {
+Path outerInMemFSPath = getRandomOuterInMemPath();
+Path outerPath = new Path(FILE_SCHEME + 
outerInMemFSPath.toString().substring(outerInMemFSPath.toString().indexOf(':')));
+generatedPath = outerPath;
+CacheConfig cacheConf = new CacheConfig(inMemoryConf);
+FSDataOutputStream fout = createFSOutput(outerInMemFSPath, inMemoryConf);
+HFileContext meta = new HFileContextBuilder()
+.withBlockSize(minBlockSize)
+.build();
+HFile.Writer writer = HFile.getWriterFactory(inMemoryConf, cacheConf)
+.withOutputStream(fout)
+.withFileContext(meta)
+.withComparator(new KeyValue.KVComparator())
+.create();
+
+writeRecords(writer);
+fout.close();
+
+byte[] inlineBytes = getBytesToInline(outerInMemFSPath);
+long startOffset = generateOuterFile(outerPath, inlineBytes);
+
+long inlineLength = inlineBytes.length;
+
+// Generate phantom inline file
+Path inlinePath = getPhantomFile(outerPath, startOffset, inlineLength);
+
+InlineFileSystem inlineFileSystem = (InlineFileSystem) 
inlinePath.getFileSystem(inlineConf);
+FSDataInputStream fin = inlineFileSystem.open(inlinePath);
+
+HFile.Reader reader = HFile.createReader(inlineFileSystem, inlinePath, 
cacheConf, inlineConf);
+// Load up the index.
+reader.loadFileInfo();
+// Get a scanner that caches and that does not use pread.
+HFileScanner scanner = 

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395401013
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+super.initialize(name, conf);
+this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws 
IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
+FileSystem outerFs = outerPath.getFileSystem(conf);
+FSDataInputStream outerStream = outerFs.open(outerPath, bufferSize);
+return new InlineFsDataInputStream(InLineFSUtils.startOffset(inlinePath), 
outerStream, InLineFSUtils.length(inlinePath));
+  }
+
+  @Override
+  public boolean exists(Path f) {
+try {
+  return getFileStatus(f) != null;
+} catch (Exception e) {
+  return false;
+}
+  }
+
+  @Override
+  public FileStatus getFileStatus(Path inlinePath) throws IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
+FileSystem outerFs = outerPath.getFileSystem(conf);
+FileStatus status = outerFs.getFileStatus(outerPath);
+FileStatus toReturn = new FileStatus(InLineFSUtils.length(inlinePath), 
status.isDirectory(), status.getReplication(), status.getBlockSize(),
 
 Review comment:
   have commented above on this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395400942
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
+  private Configuration conf = null;
+
+  @Override
+  public void initialize(URI name, Configuration conf) throws IOException {
+super.initialize(name, conf);
+this.conf = conf;
+  }
+
+  @Override
+  public URI getUri() {
+return URI.create(getScheme());
+  }
+
+  public String getScheme() {
+return SCHEME;
+  }
+
+  @Override
+  public FSDataInputStream open(Path inlinePath, int bufferSize) throws 
IOException {
+Path outerPath = InLineFSUtils.getOuterfilePathFromInlinePath(inlinePath, 
getScheme());
 
 Review comment:
   Within InlineFileSystem, we need to use Path in most places. So even if we 
introduce a POJO, I feel it may not be of much use. 1: all arguments to this 
class is Path. So, we have to convert to P
   OJO before we do anything. We can't store it as instance variable as well 2: 
Introducing a POJO might help in fetching start offset and length which is done 
only once in this class.  Not sure how much of value it adds to define a POJO. 
But open to making change if you feel so.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer offset not handled correctly

2020-03-19 Thread GitBox
lamber-ken commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer offset 
not handled correctly
URL: https://github.com/apache/incubator-hudi/pull/1377#issuecomment-601486061
 
 
   @garyli1019 thanks very much for your detail comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395399661
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
+ * the inlined file to the underneath file without buffer
+ */
+public class InlineFileSystem extends FileSystem {
+
+  static final String SCHEME = "inlinefs";
 
 Review comment:
   yes, I have fixed the scheme issue. Can you check once I have addressed all 
comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395399488
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InlineFileSystem.java
 ##
 @@ -0,0 +1,133 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.permission.FsPermission;
+import org.apache.hadoop.util.Progressable;
+
+import java.io.IOException;
+import java.net.URI;
+
+/**
+ * Enables reading any inline file at a given offset and length. This {@link 
FileSystem} is used only in read path and does not support
+ * any write apis.
+ * 
+ * - Reading an inlined file at a given offset, length, read it out as if it 
were an independent file of that length
+ * - Inlined path is of the form 
"inlinefs:///path/to/outer/file//inline_file/?start_offset==
+ * 
+ * TODO: The reader/writer may try to use relative paths based on the 
inlinepath and it may not work. Need to handle
+ * this gracefully eg. the parquet summary metadata reading. TODO: If this 
shows promise, also support directly writing
 
 Review comment:
   actually I wanted to sync up with you on this. Will sycn up offline.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395398271
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
+);
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String 
outerScheme) {
 
 Review comment:
   actually you are right. will fix it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395395831
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
+return new Path(
+InlineFileSystem.SCHEME + "://" + subPath + "/" + origScheme + "/" + 
INLINE_FILE_STR + "/"
++ "?" + START_OFFSET_STR + EQUALS_STR + inLineStartOffset + "&" + 
LENGTH_STR + EQUALS_STR + inLineLength
+);
+  }
+
+  /**
+   * Eg input : "inlinefs:/file1/file/inline_file/?start_offset=20=40".
+   * Output : "file:/file1"
+   *
+   * @param inlinePath
+   * @param outerScheme
+   * @return
+   */
+  public static Path getOuterfilePathFromInlinePath(Path inlinePath, String 
outerScheme) {
+String scheme = inlinePath.getParent().getParent().getName();
+Path basePath = inlinePath.getParent().getParent().getParent();
+return new Path(basePath.toString().replaceFirst(outerScheme, scheme));
 
 Review comment:
   sorry I don't get you. I haven't used indexOf() in this method at all. not 
sure I get your feedback.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395395084
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
 
 Review comment:
   Nothing significant. I am just using it to delimit regular path info from 
inline related attributes.
   Eg: 
"inlinefs:///s3a/inline_file/?start_offset=20=40". 
If you feel, its not adding much value, I can remove it.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-19 Thread GitBox
satishkotha commented on a change in pull request #1368: [HUDI-650] Modify 
handleUpdate path to validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368#discussion_r395389042
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 ##
 @@ -236,6 +233,10 @@ private boolean writeUpdateRecord(HoodieRecord 
hoodieRecord, Option hoodieRecord, 
Option indexedRecord) {
 Option recordMetadata = hoodieRecord.getData().getMetadata();
+if (!partitionPath.equals(hoodieRecord.getPartitionPath())) {
+  writeStatus.markFailure(hoodieRecord, new 
HoodieUpsertException("mismatched partition path"), recordMetadata);
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-19 Thread GitBox
satishkotha commented on a change in pull request #1368: [HUDI-650] Modify 
handleUpdate path to validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368#discussion_r395389012
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 ##
 @@ -295,6 +294,13 @@ private Writer createLogWriter(Option 
fileSlice, String baseCommitTim
   }
 
   private void writeToBuffer(HoodieRecord record) {
+if (!partitionPath.equals(record.getPartitionPath())) {
+  writeStatus.markFailure(record, new HoodieUpsertException("mismatched 
partition path"),
 
 Review comment:
   added


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on issue #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
satishkotha commented on issue #1396: [HUDI-687] Stop incremental reader on RO 
table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#issuecomment-601473026
 
 
   > @satishkotha : Some minor comments. Will approve once you reply/address 
them. Let's also wait for @vinothchandar to take a pass
   
   @bvaradar Sounds good. I addressed your review comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395386191
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieMergeOnReadTestUtils.java
 ##
 @@ -47,27 +48,35 @@
 
   public static List getRecordsUsingInputFormat(List 
inputPaths, String basePath) {
 JobConf jobConf = new JobConf();
+return getRecordsUsingInputFormat(inputPaths, basePath, jobConf, new 
HoodieParquetRealtimeInputFormat());
+  }
+
+  public static List getRecordsUsingInputFormat(List 
inputPaths,
+   String basePath,
+   JobConf jobConf,
+   
HoodieParquetInputFormat inputFormat) {
 Schema schema = HoodieAvroUtils.addMetadataFields(
 new 
Schema.Parser().parse(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA));
-HoodieParquetRealtimeInputFormat inputFormat = new 
HoodieParquetRealtimeInputFormat();
 setPropsForInputFormat(inputFormat, jobConf, schema, basePath);
 return inputPaths.stream().map(path -> {
   setInputPath(jobConf, path);
   List records = new ArrayList<>();
   try {
 List splits = Arrays.asList(inputFormat.getSplits(jobConf, 
1));
-RecordReader recordReader = inputFormat.getRecordReader(splits.get(0), 
jobConf, null);
-Void key = (Void) recordReader.createKey();
-ArrayWritable writable = (ArrayWritable) recordReader.createValue();
-while (recordReader.next(key, writable)) {
-  GenericRecordBuilder newRecord = new GenericRecordBuilder(schema);
-  // writable returns an array with [field1, field2, 
_hoodie_commit_time,
-  // _hoodie_commit_seqno]
-  Writable[] values = writable.get();
-  schema.getFields().forEach(field -> {
-newRecord.set(field, values[2]);
-  });
-  records.add(newRecord.build());
+for (InputSplit split : splits) {
 
 Review comment:
   yes, correct.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395386043
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/string/TestHoodieActiveTimeline.java
 ##
 @@ -160,6 +159,8 @@ public void testTimelineOperations() {
 .filterCompletedInstants().findInstantsInRange("04", 
"11").getInstants().map(HoodieInstant::getTimestamp));
 HoodieTestUtils.assertStreamEquals("", Stream.of("09", "11"), 
timeline.getCommitTimeline().filterCompletedInstants()
 .findInstantsAfter("07", 
2).getInstants().map(HoodieInstant::getTimestamp));
+HoodieTestUtils.assertStreamEquals("", Stream.of("01", "03", "05"), 
timeline.getCommitTimeline().filterCompletedInstants()
 
 Review comment:
   fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395386006
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -80,6 +85,12 @@
 
 public class TestMergeOnReadTable extends HoodieClientTestHarness {
 
+  private HoodieParquetInputFormat roInputFormat;
 
 Review comment:
   Thanks for suggestion. I modified one of the existing tests on COW table. 
Please take a look. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-601469123
 
 
   No. thank you.. This kind of stuff, gives me energy to keep pushing more :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17063001#comment-17063001
 ] 

Vinoth Chandar commented on HUDI-648:
-

[~liujinhui] Actually, skipping should be supported today with the following 
deltastreamer config.. are you using deltastreamer or spark datasource

{code}
@Parameter(names = {"--commit-on-errors"}, description = "Commit even when some 
records failed to be written")
public Boolean commitOnErrors = false;
{code}

> Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction 
> writes
> 
>
> Key: HUDI-648
> URL: https://issues.apache.org/jira/browse/HUDI-648
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> We would like a way to hand the erroring records from writing or compaction 
> back to the users, in a separate table or log. This needs to work generically 
> across all the different writer paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-19 Thread Feichi Feng (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feichi Feng updated HUDI-724:
-
Description: 
When writing data, a gap was observed between spark stages. By tracking down 
where the time was spent on the spark driver, it's get-small-files operation 
for partitions.

When creating the UpsertPartitioner and trying to assign insert records, it 
uses a normal for-loop for get the list of small files for all partitions that 
the load is going to load data to, and the process is very slow when there are 
a lot of partitions to go through. While the operation is running on spark 
driver process, all other worker nodes are sitting idle waiting for tasks.

For all those partitions, they don't affect each other, so the get-small-files 
operations can be parallelized. The change I made is to pass the 
JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions 
and eventually send the get small files operations to multiple tasks.

 

screenshot attached for 

the gap without the improvement

the spark stage with the improvement (no gap)

  was:
When writing data, a gap was observed between spark stages. By tracking down 
where the time was spent on the spark driver, it's get-small-files operation 
for partitions.

When creating the UpsertPartitioner and trying to assign insert records, it 
uses a normal for-loop for get the list of small files for all partitions that 
the load is going to load data to, and the process is very slow when there are 
a lot of partitions to go through. While the operation is running on spark 
driver process, all other worker nodes are sitting idle waiting for tasks.

For all those partitions, they don't affect each other, so the get-small-files 
operations can be parallelized. The change I made is to pass the 
JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions 
and eventually send the get small files operations to multiple tasks.


> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] garyli1019 closed pull request #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
garyli1019 closed pull request #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
garyli1019 commented on issue #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-601468645
 
 
   ok, I will make a separate PR for the tool. Thanks everyone who participated 
in this long discussion...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-601468133
 
 
   > Do you see any other use case the reverse search would be useful? 
   
   No. not at the moment.. We can close this PR out if you agree 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-03-19 Thread Feichi Feng (Jira)
Feichi Feng created HUDI-724:


 Summary: Parallelize GetSmallFiles For Partitions
 Key: HUDI-724
 URL: https://issues.apache.org/jira/browse/HUDI-724
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Performance, Writer Core
Reporter: Feichi Feng
 Attachments: gap.png, nogapAfterImprovement.png

When writing data, a gap was observed between spark stages. By tracking down 
where the time was spent on the spark driver, it's get-small-files operation 
for partitions.

When creating the UpsertPartitioner and trying to assign insert records, it 
uses a normal for-loop for get the list of small files for all partitions that 
the load is going to load data to, and the process is very slow when there are 
a lot of partitions to go through. While the operation is running on spark 
driver process, all other worker nodes are sitting idle waiting for tasks.

For all those partitions, they don't affect each other, so the get-small-files 
operations can be parallelized. The change I made is to pass the 
JavaSparkContext to the UpsertPartitioner, and create RDD for the partitions 
and eventually send the get small files operations to multiple tasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of HiveDriver for DDL statements for Hive 2.x

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of 
HiveDriver for DDL statements for Hive 2.x
URL: https://github.com/apache/incubator-hudi/pull/1416#discussion_r395341832
 
 

 ##
 File path: 
hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
 ##
 @@ -363,4 +404,51 @@ public void testMultiPartitionKeySync() throws Exception {
 assertEquals("The last commit that was sycned should be updated in the 
TBLPROPERTIES", commitTime,
 hiveClient.getLastCommitTimeSynced(hiveSyncConfig.tableName).get());
   }
+
+  @Test
+  public void testSchemeFromMOR() throws Exception {
+TestUtil.hiveSyncConfig.useJdbc = this.useJdbc;
+String commitTime = "100";
+String snapshotTableName = TestUtil.hiveSyncConfig.tableName + 
HiveSyncTool.SUFFIX_SNAPSHOT_TABLE;
+TestUtil.createMORTable(commitTime, "", 5, false);
+HoodieHiveClient hiveClientRT =
+new HoodieHiveClient(TestUtil.hiveSyncConfig, TestUtil.getHiveConf(), 
TestUtil.fileSystem);
+
+assertFalse("Table " + TestUtil.hiveSyncConfig.tableName + 
HiveSyncTool.SUFFIX_SNAPSHOT_TABLE
++ " should not exist initially", 
hiveClientRT.doesTableExist(snapshotTableName));
+
+// Lets do the sync
+HiveSyncTool tool = new HiveSyncTool(TestUtil.hiveSyncConfig, 
TestUtil.getHiveConf(), TestUtil.fileSystem);
+tool.syncHoodieTable();
+
+assertTrue("Table " + TestUtil.hiveSyncConfig.tableName + 
HiveSyncTool.SUFFIX_SNAPSHOT_TABLE
++ " should exist after sync completes", 
hiveClientRT.doesTableExist(snapshotTableName));
+
+// Schema being read from compacted base files
+assertEquals("Hive Schema should match the table schema + partition 
field", hiveClientRT.getTableSchema(snapshotTableName).size(),
+SchemaTestUtil.getSimpleSchema().getFields().size() + 1);
 
 Review comment:
   What is the reason for adding 1 to schema fields count ? Is it to account 
for partition field ? Can you add a comment here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of HiveDriver for DDL statements for Hive 2.x

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of 
HiveDriver for DDL statements for Hive 2.x
URL: https://github.com/apache/incubator-hudi/pull/1416#discussion_r395337083
 
 

 ##
 File path: 
hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -198,7 +198,8 @@ private String getPartitionClause(String partition) {
 for (String partition : partitions) {
   String partitionClause = getPartitionClause(partition);
   Path partitionPath = FSUtils.getPartitionPath(syncConfig.basePath, 
partition);
-  String fullPartitionPath = 
partitionPath.toUri().getScheme().equals(StorageSchemes.HDFS.getScheme())
+  String partitionScheme = partitionPath.toUri().getScheme();
+  String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 
 Review comment:
   No, I think it does support cloud stores. This special handling was part of 
https://jira.apache.org/jira/browse/HUDI-325  to get correct behavior for HDFS  
for alter partitions.  This should be fine.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062945#comment-17062945
 ] 

Vinoth Chandar commented on HUDI-686:
-

[~vbalaji] [~shivnarayan] Please review this information closely.. In short, we 
can support an indexing option that eliminates memory caching, but not sure if 
that will outperform current BloomIndex. Is it worth cleaning this 
implementation up and checking it in? 

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062940#comment-17062940
 ] 

Vinoth Chandar commented on HUDI-686:
-

Timing the individual stages 

Roughly, here is how it looks like ..  Not sure how much more we can optimize 
this further, since the time spent is mostly inside parquet reading 

the metadata read cost i.e reading the footers dominates the first stage 
{code:java}
System.err.format("LazyRangeBloomChecker: %d, %d, %d, %d, %d, %d \n",
totalCount, totalMatches, totalTimeNs, 
totalMetadataReadTimeNs, totalRangeCheckTimeNs, totalBloomCheckTimeNs);

LazyRangeBloomChecker: 18632, 5068, 481673381, 439685698, 5872344, 26426499 
LazyRangeBloomChecker: 29312, 0, 397373925, 361189515, 12336753, 3205152 
LazyRangeBloomChecker: 36422, 0, 395838972, 364965143, 6870027, 3088563 
LazyRangeBloomChecker: 32698, 21252, 502987672, 374374961, 15190330, 94478078 
LazyRangeBloomChecker: 36633, 0, 420441840, 388992165, 7971801, 5222196 
LazyRangeBloomChecker: 35919, 35919, 547982738, 382770288, 17042127, 130529090 
LazyRangeBloomChecker: 26448, 26448, 673972735, 497887634, 12918131, 150188682 
LazyRangeBloomChecker: 29827, 25338, 739789660, 568953445, 14633164, 140977007 
LazyRangeBloomChecker: 40694, 40694, 611867636, 364297491, 20609305, 206717514 
LazyRangeBloomChecker: 41515, 41515, 754657982, 379440879, 18670251, 337857948 
LazyRangeBloomChecker: 46672, 46672, 761187684, 364060859, 18887398, 359483525 
LazyRangeBloomChecker: 26931, 2360, 296764733, 275044606, 3439711, 11417543 
LazyRangeBloomChecker: 41863, 20714, 831527864, 656157121, 13784027, 143710665 
LazyRangeBloomChecker: 36429, 0, 181597122, 157965082, 5342164, 3072219 
LazyRangeBloomChecker: 45618, 0, 180005379, 154248797, 6254112, 3332647 
LazyRangeBloomChecker: 60916, 60916, 730395000, 244153313, 24926738, 439724359 
 {code}
the reading of the actual keys themselves, dominate the second.. 
{code:java}
System.err.println("LazyKeyChecker: " + totalTimeNs + "," + totalCount + "," + 
totalReadTimeNs);

LazyKeyChecker: 32576530,2119,30998522
LazyKeyChecker: 39189497,3415,3074
LazyKeyChecker: 36683534,3726,33878272
LazyKeyChecker: 293554458,38523,264821882
LazyKeyChecker: 297414709,39263,268215304
LazyKeyChecker: 212946950,65474,169525572
LazyKeyChecker: 1047598045,65998,1003946915
LazyKeyChecker: 1048062757,66734,1003969635
LazyKeyChecker: 1041348181,74948,992863777
[Stage 141:== {code}

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] garyli1019 commented on issue #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
garyli1019 commented on issue #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-601416005
 
 
   > Given that, do we still need the ability to search for the checkpoints in 
reverse time order? 
   
   Maybe not anymore? If I have a tool to tell me where the checkpoint is, I 
can use the `--checkpoint` field directly. Do you see any other use case the 
reverse search would be useful? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer offset not handled correctly

2020-03-19 Thread GitBox
garyli1019 commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer offset 
not handled correctly
URL: https://github.com/apache/incubator-hudi/pull/1377#issuecomment-601412133
 
 
   @vinothchandar I thought the empty checkpoint was created by a bug before, 
but if the empty checkpoint is intended then we need to somehow handle it like 
this PR  
   If resetting the checkpoint when an empty checkpoint present makes sense to 
you guys then I think this PR should be good to go.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1416: [HUDI-717] Fixed usage of HiveDriver for DDL statements for Hive 2.x

2020-03-19 Thread GitBox
prashantwason commented on a change in pull request #1416: [HUDI-717] Fixed 
usage of HiveDriver for DDL statements for Hive 2.x
URL: https://github.com/apache/incubator-hudi/pull/1416#discussion_r395308902
 
 

 ##
 File path: 
hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -198,7 +198,8 @@ private String getPartitionClause(String partition) {
 for (String partition : partitions) {
   String partitionClause = getPartitionClause(partition);
   Path partitionPath = FSUtils.getPartitionPath(syncConfig.basePath, 
partition);
-  String fullPartitionPath = 
partitionPath.toUri().getScheme().equals(StorageSchemes.HDFS.getScheme())
+  String partitionScheme = partitionPath.toUri().getScheme();
+  String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 
 Review comment:
   The original code itself does not seem to support non HDFS sources. Are non 
HDFS cloud stores even supported by Hive?
   
   This fix has only changed the order of comparsion as getScheme() may return 
NULL when there is no scheme prefix like in the unit-tests. The boolean result 
of the comparison remains the same.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1368: [HUDI-650] Modify 
handleUpdate path to validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368#discussion_r395228837
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
 ##
 @@ -295,6 +294,13 @@ private Writer createLogWriter(Option 
fileSlice, String baseCommitTim
   }
 
   private void writeToBuffer(HoodieRecord record) {
+if (!partitionPath.equals(record.getPartitionPath())) {
+  writeStatus.markFailure(record, new HoodieUpsertException("mismatched 
partition path"),
 
 Review comment:
   Can you add what the expected and actual partition path was in the exception 
message ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1368: [HUDI-650] Modify handleUpdate path to validate partitionPath

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1368: [HUDI-650] Modify 
handleUpdate path to validate partitionPath
URL: https://github.com/apache/incubator-hudi/pull/1368#discussion_r395229769
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java
 ##
 @@ -236,6 +233,10 @@ private boolean writeUpdateRecord(HoodieRecord 
hoodieRecord, Option hoodieRecord, 
Option indexedRecord) {
 Option recordMetadata = hoodieRecord.getData().getMetadata();
+if (!partitionPath.equals(hoodieRecord.getPartitionPath())) {
+  writeStatus.markFailure(hoodieRecord, new 
HoodieUpsertException("mismatched partition path"), recordMetadata);
 
 Review comment:
   Same here. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint generator helper tool

2020-03-19 Thread GitBox
vinothchandar commented on issue #1362: [WIP]HUDI-644 Implement checkpoint 
generator helper tool
URL: https://github.com/apache/incubator-hudi/pull/1362#issuecomment-601321422
 
 
   Given that, do we still need the ability to search for the checkpoints in 
reverse time order? tbh I don't see a value in it, since there cannot be 
multiple writers to a hudi table anyway. 
   
   May be we can think about an `CheckPointProvider` abstraction where if 
DeltaStreamer cannot find a checkpoint from the last delta commit/commit, it 
invokes `checkpointProvider.getCheckpoint()`. We can actually introduce that in 
this PR and have two implementations
   
   1) (default)NoOpCheckpointProvider (throws an error if it cannot find a 
checkpoint) 
   2) ScanOlderCommitsCheckpointProvider (what you have now) 
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395181942
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieMergeOnReadTestUtils.java
 ##
 @@ -47,27 +48,35 @@
 
   public static List getRecordsUsingInputFormat(List 
inputPaths, String basePath) {
 JobConf jobConf = new JobConf();
+return getRecordsUsingInputFormat(inputPaths, basePath, jobConf, new 
HoodieParquetRealtimeInputFormat());
+  }
+
+  public static List getRecordsUsingInputFormat(List 
inputPaths,
+   String basePath,
+   JobConf jobConf,
+   
HoodieParquetInputFormat inputFormat) {
 Schema schema = HoodieAvroUtils.addMetadataFields(
 new 
Schema.Parser().parse(HoodieTestDataGenerator.TRIP_EXAMPLE_SCHEMA));
-HoodieParquetRealtimeInputFormat inputFormat = new 
HoodieParquetRealtimeInputFormat();
 setPropsForInputFormat(inputFormat, jobConf, schema, basePath);
 return inputPaths.stream().map(path -> {
   setInputPath(jobConf, path);
   List records = new ArrayList<>();
   try {
 List splits = Arrays.asList(inputFormat.getSplits(jobConf, 
1));
-RecordReader recordReader = inputFormat.getRecordReader(splits.get(0), 
jobConf, null);
-Void key = (Void) recordReader.createKey();
-ArrayWritable writable = (ArrayWritable) recordReader.createValue();
-while (recordReader.next(key, writable)) {
-  GenericRecordBuilder newRecord = new GenericRecordBuilder(schema);
-  // writable returns an array with [field1, field2, 
_hoodie_commit_time,
-  // _hoodie_commit_seqno]
-  Writable[] values = writable.get();
-  schema.getFields().forEach(field -> {
-newRecord.set(field, values[2]);
-  });
-  records.add(newRecord.build());
+for (InputSplit split : splits) {
 
 Review comment:
   Was this working because these tests only dealt with 1 file-groups ? I can 
see count based assertions using records collected from this method 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395187395
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -80,6 +85,12 @@
 
 public class TestMergeOnReadTable extends HoodieClientTestHarness {
 
+  private HoodieParquetInputFormat roInputFormat;
 
 Review comment:
   We can add similar test-cases for Copy-On-Write table with RO inputformat 
alone to make it uniform. @satishkotha : Can you look at it as part of this PR 
or a follow-up ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r395185815
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/string/TestHoodieActiveTimeline.java
 ##
 @@ -160,6 +159,8 @@ public void testTimelineOperations() {
 .filterCompletedInstants().findInstantsInRange("04", 
"11").getInstants().map(HoodieInstant::getTimestamp));
 HoodieTestUtils.assertStreamEquals("", Stream.of("09", "11"), 
timeline.getCommitTimeline().filterCompletedInstants()
 .findInstantsAfter("07", 
2).getInstants().map(HoodieInstant::getTimestamp));
+HoodieTestUtils.assertStreamEquals("", Stream.of("01", "03", "05"), 
timeline.getCommitTimeline().filterCompletedInstants()
 
 Review comment:
   nit: Add non-empty message for the assertion.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062781#comment-17062781
 ] 

Vinoth Chandar commented on HUDI-686:
-

Running a local microbenchmark, I actually found that the extra shuffling in V2 
Implementation (shuffles input two times, vs just 1 time caching and shuffling 
just the keys in the current impl) actually makes it a tad slower.

*BloomIndexV2*

*!Screen Shot 2020-03-19 at 10.15.10 AM.png!*

*BloomIndex*

 

*!image-2020-03-19-10-17-43-048.png!*

 

 

 

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Attachment: image-2020-03-19-10-17-43-048.png

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Attachment: Screen Shot 2020-03-19 at 10.15.10 AM.png

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Attachment: Screen Shot 2020-03-19 at 10.15.10 AM.png

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Attachment: Screen Shot 2020-03-19 at 10.15.10 AM.png

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1419: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
codecov-io edited a comment on issue #1419: [HUDI-400]check upgrade from old 
plan to new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1419#issuecomment-601246921
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=h1) 
Report
   > Merging 
[#1419](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/a752b7b18c5c74c0c9115168188fa4a15a670225=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1419/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1419  +/-   ##
   
   + Coverage 67.51%   67.54%   +0.03% 
 Complexity  255  255  
   
 Files   340  340  
 Lines 1649916499  
 Branches   1686 1687   +1 
   
   + Hits  1113911144   +5 
   + Misses 4622 4618   -4 
   + Partials738  737   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...oning/compaction/CompactionV2MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `84.21% <100.00%> (+21.05%)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-60.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <0.00%> (-13.52%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...pache/hudi/common/versioning/MetadataMigrator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvTWV0YWRhdGFNaWdyYXRvci5qYXZh)
 | `83.33% <0.00%> (+25.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=footer).
 Last update 
[a752b7b...08d8069](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062771#comment-17062771
 ] 

Vinoth Chandar commented on HUDI-686:
-

candidates can be as big as N * size of HoodieRecord, where N is the number of 
files in a partition?  that's not as bad right

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer offset not handled correctly

2020-03-19 Thread GitBox
vinothchandar commented on issue #1377: [HUDI-663] Fix HoodieDeltaStreamer 
offset not handled correctly
URL: https://github.com/apache/incubator-hudi/pull/1377#issuecomment-601303031
 
 
   
>https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L331
   This is handling the case that the transformer filtered everything out, but 
we want to do an empty commit to record that the offsets moved.
   
   The resets happening around Kafka, should be left alone IMO.. It can happen 
due to variety of reasons like topic retention kicking in etc.. 
   
   For this PR, I am not sure if we want to special case an empty string.. ? 
Can we formulate the problem we are trying to solve again..  
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-19 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062761#comment-17062761
 ] 

lamber-ken commented on HUDI-686:
-

[~vinoth] thanks for bring up this new idea. here are some concerns to consider:

1. +candidates+ may cause OOM, although we can increase the num of partitions 
to solve it. that may will impact the user's experience, because use

need think about it.
{quote}List, String>> candidates = new ArrayList<>();
{quote}
 2.  +fileIDToBloomFilter+ is an external map that spills content to disk, we 
need to think about the seri / dese performance
{quote}this.fileIDToBloomFilter = new ExternalSpillableMap<>(10L 
...)BloomFilter filter = 
fileIDToBloomFilter.get(partitionFileIdPair.getRight());
{quote}
[https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java]
{code:java}
@Override
protected List, String>> computeNext() {
  List, String>> candidates = new ArrayList<>();
  if (inputItr.hasNext()) {
HoodieRecord record = inputItr.next();
try {
  initIfNeeded(record.getPartitionPath());
} catch (IOException e) {
  throw new HoodieIOException(
  "Error reading index metadata for " + record.getPartitionPath(), e);
}

indexFileFilter
.getMatchingFilesAndPartition(record.getPartitionPath(), 
record.getRecordKey())
.forEach(partitionFileIdPair -> {
  BloomFilter filter = 
fileIDToBloomFilter.get(partitionFileIdPair.getRight());
  if (filter.mightContain(record.getRecordKey())) {
candidates.add(Pair.of(record, partitionFileIdPair.getRight()));
  }
});

if (candidates.size() == 0) {
  candidates.add(Pair.of(record, ""));
}
  }

  return candidates;
}
{code}
 

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
vinothchandar commented on issue #1165: [HUDI-76] Add CSV Source support for 
Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601285910
 
 
   @leesf this has happened enough times now, that we probably need a Code 
Review guide as well?  wdyt


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-19 Thread GitBox
vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-601283565
 
 
   @yanghua LGTM . lets roll the dice again 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1409: [HUDI-714]Add javadoc and comments to hudi write method link

2020-03-19 Thread GitBox
vinothchandar commented on issue #1409: [HUDI-714]Add javadoc and comments to 
hudi write method link
URL: https://github.com/apache/incubator-hudi/pull/1409#issuecomment-601280309
 
 
   @nsivabalan could you please review this


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] deabreu opened a new issue #1420: Broken Maven dependencies.

2020-03-19 Thread GitBox
deabreu opened a new issue #1420: Broken Maven dependencies.
URL: https://github.com/apache/incubator-hudi/issues/1420
 
 
   The following artifacts are missing from https://packages.confluent.io/maven
   org.apache.hudi:hudi-client::0.6.0-SNAPSHOT
   org.apache.hudi:hudi-common::0.6.0-SNAPSHOT
   org.apache.hudi:hudi-hadoop-mr::0.6.0-SNAPSHOT
   org.apache.hudi:hudi-hive-sync::0.6.0-SNAPSHOT
   
   when opening hudi-spark project.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1400: optimization debian package manager tweaks

2020-03-19 Thread GitBox
bvaradar commented on issue #1400: optimization debian package manager tweaks
URL: https://github.com/apache/incubator-hudi/pull/1400#issuecomment-601277565
 
 
   @Rajpratik71 : Just pinging to see if you are planning to work on this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of HiveDriver for DDL statements for Hive 2.x

2020-03-19 Thread GitBox
bvaradar commented on a change in pull request #1416: [HUDI-717] Fixed usage of 
HiveDriver for DDL statements for Hive 2.x
URL: https://github.com/apache/incubator-hudi/pull/1416#discussion_r395155031
 
 

 ##
 File path: 
hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -198,7 +198,8 @@ private String getPartitionClause(String partition) {
 for (String partition : partitions) {
   String partitionClause = getPartitionClause(partition);
   Path partitionPath = FSUtils.getPartitionPath(syncConfig.basePath, 
partition);
-  String fullPartitionPath = 
partitionPath.toUri().getScheme().equals(StorageSchemes.HDFS.getScheme())
+  String partitionScheme = partitionPath.toUri().getScheme();
+  String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
 
 Review comment:
   How does this work for non hdfs FS like cloud stores ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-03-19 Thread GitBox
nsivabalan commented on a change in pull request #1176: [HUDI-430] Adding 
InlineFileSystem to support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#discussion_r395120821
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/inline/fs/InLineFSUtils.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.inline.fs;
+
+import org.apache.hadoop.fs.Path;
+
+/**
+ * Utils to parse InlineFileSystem paths.
+ * Inline FS format: 
"inlinefs:"
+ * 
"inlinefs:///inline_file/?start_offset=start_offset>="
+ */
+public class InLineFSUtils {
+
+  private static final String INLINE_FILE_STR = "inline_file";
+  private static final String START_OFFSET_STR = "start_offset";
+  private static final String LENGTH_STR = "length";
+  private static final String EQUALS_STR = "=";
+
+  /**
+   * Fetch embedded inline file path from outer path.
+   * Eg
+   * Input:
+   * Path = file:/file1, origScheme: file, startOffset = 20, length = 40
+   * Output: "inlinefs:/file1/file/inline_file/?start_offset=20=40"
+   *
+   * @param outerPath
+   * @param origScheme
+   * @param inLineStartOffset
+   * @param inLineLength
+   * @return
+   */
+  public static Path getEmbeddedInLineFilePath(Path outerPath, String 
origScheme, long inLineStartOffset, long inLineLength) {
+String subPath = 
outerPath.toString().substring(outerPath.toString().indexOf(":") + 1);
 
 Review comment:
   I checked Path, but couldn't find anything. toString() does have scheme if 
present. 
   
   @Override
 public String toString() {
   // we can't use uri.toString(), which escapes everything, because we want
   // illegal characters unescaped in the string, for glob processing, etc.
   StringBuilder buffer = new StringBuilder();
   if (uri.getScheme() != null) {
 buffer.append(uri.getScheme());
 buffer.append(":");
   }
   if (uri.getAuthority() != null) {
 buffer.append("//");
 buffer.append(uri.getAuthority());
   }
   if (uri.getPath() != null) {
 String path = uri.getPath();
 if (path.indexOf('/')==0 &&
 hasWindowsDrive(path) &&// has windows drive
 uri.getScheme() == null &&  // but no scheme
 uri.getAuthority() == null) // or authority
   path = path.substring(1); // remove slash before 
drive
 buffer.append(path);
   }
   if (uri.getFragment() != null) {
 buffer.append("#");
 buffer.append(uri.getFragment());
   }
   return buffer.toString();
 }


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1419: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
codecov-io edited a comment on issue #1419: [HUDI-400]check upgrade from old 
plan to new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1419#issuecomment-601246921
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=h1) 
Report
   > Merging 
[#1419](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/c40a0d4e91896dece51969f5308016ecb3aa635c=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1419/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1419  +/-   ##
   
   + Coverage 67.50%   67.56%   +0.05% 
   - Complexity  230  255  +25 
   
 Files   337  340   +3 
 Lines 1635416499 +145 
 Branches   1671 1687  +16 
   
   + Hits  1104011147 +107 
   - Misses 4579 4615  +36 
   - Partials735  737   +2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...oning/compaction/CompactionV2MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `84.21% <100.00%> (+24.21%)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <0.00%> (-11.82%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/common/util/NumericUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvTnVtZXJpY1V0aWxzLmphdmE=)
 | `76.47% <0.00%> (-6.87%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...va/org/apache/hudi/config/HoodieMetricsConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1ldHJpY3NDb25maWcuamF2YQ==)
 | `52.77% <0.00%> (-2.07%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...oning/compaction/CompactionV1MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjFNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `83.33% <0.00%> (-0.67%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh)
 | `77.59% <0.00%> (-0.37%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...mmon/versioning/clean/CleanV1MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5WMU1pZ3JhdGlvbkhhbmRsZXIuamF2YQ==)
 | `90.00% <0.00%> (-0.25%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=)
 | `92.57% <0.00%> (-0.09%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...i/common/table/view/HoodieTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh)
 | `98.27% <0.00%> (-0.09%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...main/java/org/apache/hudi/common/util/FSUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvRlNVdGlscy5qYXZh)
 | `69.27% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | ... and [24 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=continue).
   > **Legend** - [Click here to learn 

[GitHub] [incubator-hudi] codecov-io commented on issue #1419: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
codecov-io commented on issue #1419: [HUDI-400]check upgrade from old plan to 
new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1419#issuecomment-601246921
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=h1) 
Report
   > Merging 
[#1419](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/c40a0d4e91896dece51969f5308016ecb3aa635c=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1419/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1419  +/-   ##
   
   + Coverage 67.50%   67.56%   +0.05% 
   - Complexity  230  255  +25 
   
 Files   337  340   +3 
 Lines 1635416499 +145 
 Branches   1671 1687  +16 
   
   + Hits  1104011147 +107 
   - Misses 4579 4615  +36 
   - Partials735  737   +2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...oning/compaction/CompactionV2MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjJNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `84.21% <100.00%> (+24.21%)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <0.00%> (-11.82%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/common/util/NumericUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvTnVtZXJpY1V0aWxzLmphdmE=)
 | `76.47% <0.00%> (-6.87%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...va/org/apache/hudi/config/HoodieMetricsConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1ldHJpY3NDb25maWcuamF2YQ==)
 | `52.77% <0.00%> (-2.07%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...oning/compaction/CompactionV1MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY29tcGFjdGlvbi9Db21wYWN0aW9uVjFNaWdyYXRpb25IYW5kbGVyLmphdmE=)
 | `83.33% <0.00%> (-0.67%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...on/table/view/RemoteHoodieTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvUmVtb3RlSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh)
 | `77.59% <0.00%> (-0.37%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...mmon/versioning/clean/CleanV1MigrationHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3ZlcnNpb25pbmcvY2xlYW4vQ2xlYW5WMU1pZ3JhdGlvbkhhbmRsZXIuamF2YQ==)
 | `90.00% <0.00%> (-0.25%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...common/table/view/AbstractTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvQWJzdHJhY3RUYWJsZUZpbGVTeXN0ZW1WaWV3LmphdmE=)
 | `92.57% <0.00%> (-0.09%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...i/common/table/view/HoodieTableFileSystemView.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvSG9vZGllVGFibGVGaWxlU3lzdGVtVmlldy5qYXZh)
 | `98.27% <0.00%> (-0.09%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...main/java/org/apache/hudi/common/util/FSUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvRlNVdGlscy5qYXZh)
 | `69.27% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | ... and [24 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1419/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1419?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   

[GitHub] [incubator-hudi] zhaomin1423 opened a new pull request #1419: [HUDI-400]check upgrade from old plan to new plan for compaction

2020-03-19 Thread GitBox
zhaomin1423 opened a new pull request #1419: [HUDI-400]check upgrade from old 
plan to new plan for compaction
URL: https://github.com/apache/incubator-hudi/pull/1419
 
 
   
   ## What is the purpose of the pull request
   
   Add more test for compaction plan upgrade
   
   ## Brief change log
   
 - check upgrade from old plan to new plan for compaction
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-400) Add more checks to TestCompactionUtils#testUpgradeDowngrade

2020-03-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-400:

Labels: pull-request-available  (was: )

> Add more checks to TestCompactionUtils#testUpgradeDowngrade
> ---
>
> Key: HUDI-400
> URL: https://issues.apache.org/jira/browse/HUDI-400
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie, Testing
>Reporter: leesf
>Assignee: jerry
>Priority: Minor
>  Labels: pull-request-available
>
> Currently, the TestCompactionUtils#testUpgradeDowngrade does not check 
> upgrade from old plan to new plan, it is proper to add some checks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi 
Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601219527
 
 
   got it, sure.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
leesf commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601204501
 
 
   > @leesf : Thanks. I got the permission now.
   
   You are welcome and a nice shot. just one minor tip, please merge(squash & 
merge / rebase & merge) with [HUDI-xxx] at the begining of commit. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new cf765df  [HUDI-76] Add CSV Source support for Hudi Delta Streamer
 new a752b7b  Merge pull request #1165 from 
yihua/HUDI-76-deltastreamer-csv-source
cf765df is described below

commit cf765df6062f896ad6bf0a8ddf711aad19dbf59b
Author: Y Ethan Guo 
AuthorDate: Sat Jan 18 23:11:14 2020 -0800

[HUDI-76] Add CSV Source support for Hudi Delta Streamer
---
 .../hudi/common/HoodieTestDataGenerator.java   |  92 +---
 hudi-utilities/pom.xml |   5 +
 .../hudi/utilities/sources/CsvDFSSource.java   | 126 
 .../hudi/utilities/TestHoodieDeltaStreamer.java| 164 +++--
 .../apache/hudi/utilities/UtilitiesTestBase.java   |  60 +++-
 .../utilities/sources/AbstractBaseTestSource.java  |   2 +-
 .../sources/AbstractDFSSourceTestBase.java |   3 +-
 .../hudi/utilities/sources/TestCsvDFSSource.java   |  61 
 .../delta-streamer-config/source-flattened.avsc|  57 +++
 .../delta-streamer-config/target-flattened.avsc|  60 
 10 files changed, 597 insertions(+), 33 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java 
b/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
index 6d86e93..3307881 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
@@ -74,20 +74,30 @@ public class HoodieTestDataGenerator {
   public static final String[] DEFAULT_PARTITION_PATHS =
   {DEFAULT_FIRST_PARTITION_PATH, DEFAULT_SECOND_PARTITION_PATH, 
DEFAULT_THIRD_PARTITION_PATH};
   public static final int DEFAULT_PARTITION_DEPTH = 3;
-  public static final String TRIP_EXAMPLE_SCHEMA = "{\"type\": \"record\"," + 
"\"name\": \"triprec\"," + "\"fields\": [ "
+  public static final String TRIP_SCHEMA_PREFIX = "{\"type\": \"record\"," + 
"\"name\": \"triprec\"," + "\"fields\": [ "
   + "{\"name\": \"timestamp\",\"type\": \"double\"}," + "{\"name\": 
\"_row_key\", \"type\": \"string\"},"
   + "{\"name\": \"rider\", \"type\": \"string\"}," + "{\"name\": 
\"driver\", \"type\": \"string\"},"
   + "{\"name\": \"begin_lat\", \"type\": \"double\"}," + "{\"name\": 
\"begin_lon\", \"type\": \"double\"},"
-  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},"
-  + "{\"name\": \"fare\",\"type\": {\"type\":\"record\", 
\"name\":\"fare\",\"fields\": ["
-  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},"
-  + "{\"name\": \"_hoodie_is_deleted\", \"type\": \"boolean\", 
\"default\": false} ]}";
+  + "{\"name\": \"end_lat\", \"type\": \"double\"}," + "{\"name\": 
\"end_lon\", \"type\": \"double\"},";
+  public static final String TRIP_SCHEMA_SUFFIX = "{\"name\": 
\"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false} ]}";
+  public static final String FARE_NESTED_SCHEMA = "{\"name\": 
\"fare\",\"type\": {\"type\":\"record\", \"name\":\"fare\",\"fields\": ["
+  + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},";
+  public static final String FARE_FLATTENED_SCHEMA = "{\"name\": \"fare\", 
\"type\": \"double\"},"
+  + "{\"name\": \"currency\", \"type\": \"string\"},";
+
+  public static final String TRIP_EXAMPLE_SCHEMA =
+  TRIP_SCHEMA_PREFIX + FARE_NESTED_SCHEMA + TRIP_SCHEMA_SUFFIX;
+  public static final String TRIP_FLATTENED_SCHEMA =
+  TRIP_SCHEMA_PREFIX + FARE_FLATTENED_SCHEMA + TRIP_SCHEMA_SUFFIX;
+
   public static final String NULL_SCHEMA = 
Schema.create(Schema.Type.NULL).toString();
   public static final String TRIP_HIVE_COLUMN_TYPES = 
"double,string,string,string,double,double,double,double,"
   + 
"struct,boolean";
+
   public static final Schema AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA);
   public static final Schema AVRO_SCHEMA_WITH_METADATA_FIELDS =
   HoodieAvroUtils.addMetadataFields(AVRO_SCHEMA);
+  public static final Schema FLATTENED_AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_FLATTENED_SCHEMA);
 
   private static final Random RAND = new Random(46474747);
 
@@ -115,10 +125,33 @@ public class HoodieTestDataGenerator {
   }
 
   /**
-   * Generates a new avro record of the above schema format, retaining the key 
if optionally provided.
+   * Generates a new avro record of the above nested schema format,
+   * retaining the key if optionally provided.
+   *
+   * @param key  Hoodie key.
+   * @param commitTime  Commit time to use.
+   * @return  Raw paylaod of a test record.
+   * @throws 

[GitHub] [incubator-hudi] nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
nsivabalan commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi 
Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601195118
 
 
   @leesf : I got the permission now. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan merged pull request #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
nsivabalan merged pull request #1165: [HUDI-76] Add CSV Source support for Hudi 
Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-03-19 Thread GitBox
nsivabalan edited a comment on issue #1165: [HUDI-76] Add CSV Source support 
for Hudi Delta Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-601195118
 
 
   @leesf : Thanks. I got the permission now. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] XuQianJin-Stars commented on issue #1370: [HUDI-653] Add JMX Report Config to Doc

2020-03-19 Thread GitBox
XuQianJin-Stars commented on issue #1370: [HUDI-653] Add JMX Report Config to 
Doc
URL: https://github.com/apache/incubator-hudi/pull/1370#issuecomment-601152532
 
 
   > @XuQianJin-Stars Would you please only update the docs under _docs and 
please not update the docs under 0.5.0/0.5.1. Thanks.
   
   well,Let me change it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io commented on issue #1418: [HUDI-678] Make config package spark free

2020-03-19 Thread GitBox
codecov-io commented on issue #1418: [HUDI-678] Make config package spark free
URL: https://github.com/apache/incubator-hudi/pull/1418#issuecomment-601151728
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=h1) 
Report
   > Merging 
[#1418](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/99b7e9eb9ef8827c1e06b7e8621b6be6403b061e?src=pr=desc)
 will **decrease** coverage by `66.84%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1418/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #1418   +/-   ##
   
   - Coverage 67.47%   0.63%   -66.85% 
   + Complexity  230   2  -228 
   
 Files   338 296   -42 
 Lines 16365   14435 -1930 
 Branches   16711472  -199 
   
   - Hits  11043  92-10951 
   - Misses 4583   14340 +9757 
   + Partials739   3  -736
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...a/org/apache/hudi/client/config/ConfigHelpers.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9Db25maWdIZWxwZXJzLmphdmE=)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...n/java/org/apache/hudi/index/hbase/HBaseIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvaGJhc2UvSEJhc2VJbmRleC5qYXZh)
 | `0% <0%> (-84.22%)` | `0 <0> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `0% <0%> (-69.78%)` | `0 <0> (ø)` | |
   | 
[.../org/apache/hudi/client/config/AbstractConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9BYnN0cmFjdENvbmZpZy5qYXZh)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `0% <0%> (-70.9%)` | `0 <0> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `0% <0%> (-94.74%)` | `0 <0> (ø)` | |
   | 
[...ava/org/apache/hudi/client/config/SparkConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9TcGFya0NvbmZpZy5qYXZh)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...org/apache/hudi/common/model/HoodieEngineType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUVuZ2luZVR5cGUuamF2YQ==)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `0% <0%> (-83.85%)` | `0 <0> (ø)` | |
   | 
[...ava/org/apache/hudi/config/HoodieMemoryConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1lbW9yeUNvbmZpZy5qYXZh)
 | `0% <0%> (-76%)` | `0 <0> (ø)` | |
   | ... and [307 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=footer).
 Last update 
[99b7e9e...b43f27d](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1418: [HUDI-678] Make config package spark free

2020-03-19 Thread GitBox
codecov-io edited a comment on issue #1418: [HUDI-678] Make config package 
spark free
URL: https://github.com/apache/incubator-hudi/pull/1418#issuecomment-601151728
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=h1) 
Report
   > Merging 
[#1418](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/99b7e9eb9ef8827c1e06b7e8621b6be6403b061e?src=pr=desc)
 will **decrease** coverage by `66.84%`.
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1418/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #1418   +/-   ##
   
   - Coverage 67.47%   0.63%   -66.85% 
   + Complexity  230   2  -228 
   
 Files   338 296   -42 
 Lines 16365   14435 -1930 
 Branches   16711472  -199 
   
   - Hits  11043  92-10951 
   - Misses 4583   14340 +9757 
   + Partials739   3  -736
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...a/org/apache/hudi/client/config/ConfigHelpers.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9Db25maWdIZWxwZXJzLmphdmE=)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...n/java/org/apache/hudi/index/hbase/HBaseIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvaGJhc2UvSEJhc2VJbmRleC5qYXZh)
 | `0% <0%> (-84.22%)` | `0 <0> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `0% <0%> (-69.78%)` | `0 <0> (ø)` | |
   | 
[.../org/apache/hudi/client/config/AbstractConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9BYnN0cmFjdENvbmZpZy5qYXZh)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `0% <0%> (-70.9%)` | `0 <0> (ø)` | |
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `0% <0%> (-94.74%)` | `0 <0> (ø)` | |
   | 
[...ava/org/apache/hudi/client/config/SparkConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L2NvbmZpZy9TcGFya0NvbmZpZy5qYXZh)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...org/apache/hudi/common/model/HoodieEngineType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUVuZ2luZVR5cGUuamF2YQ==)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `0% <0%> (-83.85%)` | `0 <0> (ø)` | |
   | 
[...ava/org/apache/hudi/config/HoodieMemoryConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1lbW9yeUNvbmZpZy5qYXZh)
 | `0% <0%> (-76%)` | `0 <0> (ø)` | |
   | ... and [307 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1418/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=footer).
 Last update 
[99b7e9e...b43f27d](https://codecov.io/gh/apache/incubator-hudi/pull/1418?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
 

[GitHub] [incubator-hudi] leesf commented on issue #1370: [HUDI-653] Add JMX Report Config to Doc

2020-03-19 Thread GitBox
leesf commented on issue #1370: [HUDI-653] Add JMX Report Config to Doc
URL: https://github.com/apache/incubator-hudi/pull/1370#issuecomment-601144978
 
 
   @XuQianJin-Stars Would you please only update the docs under _docs and 
please not update the docs under 0.5.0/0.5.1. Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-209] Implement JMX metrics reporter (#1106)

2020-03-19 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1e321c2  [HUDI-209] Implement JMX metrics reporter (#1106)
1e321c2 is described below

commit 1e321c2fc0f9fedcbb8a0339c90a28a9be5afd64
Author: ForwardXu 
AuthorDate: Thu Mar 19 20:10:35 2020 +0800

[HUDI-209] Implement JMX metrics reporter (#1106)
---
 hudi-client/pom.xml|   4 +
 .../apache/hudi/config/HoodieMetricsConfig.java|  16 ++-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   4 +-
 .../hudi/metrics/InMemoryMetricsReporter.java  |   5 +
 .../apache/hudi/metrics/JmxMetricsReporter.java|  95 
 .../org/apache/hudi/metrics/JmxReporterServer.java | 160 +
 .../main/java/org/apache/hudi/metrics/Metrics.java |   4 +-
 .../hudi/metrics/MetricsGraphiteReporter.java  |   7 +
 .../org/apache/hudi/metrics/MetricsReporter.java   |   5 +
 .../hudi/metrics/MetricsReporterFactory.java   |   2 +-
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |  27 ++--
 pom.xml|   5 +
 12 files changed, 292 insertions(+), 42 deletions(-)

diff --git a/hudi-client/pom.xml b/hudi-client/pom.xml
index 06d6017..3cab43a 100644
--- a/hudi-client/pom.xml
+++ b/hudi-client/pom.xml
@@ -117,6 +117,10 @@
   io.dropwizard.metrics
   metrics-core
 
+
+  io.dropwizard.metrics
+  metrics-jmx
+
 
 
   com.beust
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java
index a21e4cc..b17e935 100644
--- a/hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java
+++ b/hudi-client/src/main/java/org/apache/hudi/config/HoodieMetricsConfig.java
@@ -101,6 +101,16 @@ public class HoodieMetricsConfig extends 
DefaultHoodieConfig {
   return this;
 }
 
+public Builder toJmxHost(String host) {
+  props.setProperty(JMX_HOST, host);
+  return this;
+}
+
+public Builder onJmxPort(String port) {
+  props.setProperty(JMX_PORT, port);
+  return this;
+}
+
 public Builder usePrefix(String prefix) {
   props.setProperty(GRAPHITE_METRIC_PREFIX, prefix);
   return this;
@@ -115,8 +125,10 @@ public class HoodieMetricsConfig extends 
DefaultHoodieConfig {
   DEFAULT_GRAPHITE_SERVER_HOST);
   setDefaultOnCondition(props, !props.containsKey(GRAPHITE_SERVER_PORT), 
GRAPHITE_SERVER_PORT,
   String.valueOf(DEFAULT_GRAPHITE_SERVER_PORT));
-  setDefaultOnCondition(props, !props.containsKey(GRAPHITE_SERVER_PORT), 
GRAPHITE_SERVER_PORT,
-  String.valueOf(DEFAULT_GRAPHITE_SERVER_PORT));
+  setDefaultOnCondition(props, !props.containsKey(JMX_HOST), JMX_HOST,
+  DEFAULT_JMX_HOST);
+  setDefaultOnCondition(props, !props.containsKey(JMX_PORT), JMX_PORT,
+  String.valueOf(DEFAULT_JMX_PORT));
   return config;
 }
   }
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index 04c1dfd..2d323a3 100644
--- a/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -500,8 +500,8 @@ public class HoodieWriteConfig extends DefaultHoodieConfig {
 return props.getProperty(HoodieMetricsConfig.JMX_HOST);
   }
 
-  public int getJmxPort() {
-return Integer.parseInt(props.getProperty(HoodieMetricsConfig.JMX_PORT));
+  public String getJmxPort() {
+return props.getProperty(HoodieMetricsConfig.JMX_PORT);
   }
 
   /**
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/metrics/InMemoryMetricsReporter.java
 
b/hudi-client/src/main/java/org/apache/hudi/metrics/InMemoryMetricsReporter.java
index a0221c8..a145024 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/metrics/InMemoryMetricsReporter.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/metrics/InMemoryMetricsReporter.java
@@ -35,4 +35,9 @@ public class InMemoryMetricsReporter extends MetricsReporter {
   public Closeable getReporter() {
 return null;
   }
+
+  @Override
+  public void stop() {
+
+  }
 }
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java 
b/hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java
index 921dcea..c7c596c 100644
--- a/hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java
+++ b/hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java
@@ -18,45 +18,53 @@
 
 package org.apache.hudi.metrics;
 
+import com.google.common.base.Preconditions;
+import java.lang.management.ManagementFactory;
+import javax.management.MBeanServer;
 import 

[GitHub] [incubator-hudi] leesf merged pull request #1106: [HUDI-209] Implement JMX metrics reporter

2020-03-19 Thread GitBox
leesf merged pull request #1106: [HUDI-209] Implement JMX metrics reporter
URL: https://github.com/apache/incubator-hudi/pull/1106
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf merged pull request #1414: [HUDI-437] Add user-defined index config

2020-03-19 Thread GitBox
leesf merged pull request #1414: [HUDI-437] Add user-defined index config
URL: https://github.com/apache/incubator-hudi/pull/1414
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [HUDI-437] Add user-defined index config (#1414)

2020-03-19 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e500cc1  [HUDI-437] Add user-defined index config (#1414)
e500cc1 is described below

commit e500cc1b0cf6ecf3dc7bae5e38d2bdc773830c48
Author: leesf <490081...@qq.com>
AuthorDate: Thu Mar 19 20:04:45 2020 +0800

[HUDI-437] Add user-defined index config (#1414)
---
 docs/_docs/0.5.0/2_4_configurations.md | 2 +-
 docs/_docs/0.5.1/2_4_configurations.md | 2 +-
 docs/_docs/2_4_configurations.cn.md| 4 
 docs/_docs/2_4_configurations.md   | 6 +-
 4 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/docs/_docs/0.5.0/2_4_configurations.md 
b/docs/_docs/0.5.0/2_4_configurations.md
index 009362d..72b1f71 100644
--- a/docs/_docs/0.5.0/2_4_configurations.md
+++ b/docs/_docs/0.5.0/2_4_configurations.md
@@ -222,7 +222,7 @@ Following configs control indexing behavior, which tags 
incoming records as eith
 
 [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) 
 This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files
-
+
  withIndexType(indexType = BLOOM) {#withIndexType}
 Property: `hoodie.index.type` 
 Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files
diff --git a/docs/_docs/0.5.1/2_4_configurations.md 
b/docs/_docs/0.5.1/2_4_configurations.md
index a535727..ac35a9a 100644
--- a/docs/_docs/0.5.1/2_4_configurations.md
+++ b/docs/_docs/0.5.1/2_4_configurations.md
@@ -222,7 +222,7 @@ Following configs control indexing behavior, which tags 
incoming records as eith
 
 [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) 
 This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files
-
+
  withIndexType(indexType = BLOOM) {#withIndexType}
 Property: `hoodie.index.type` 
 Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files
diff --git a/docs/_docs/2_4_configurations.cn.md 
b/docs/_docs/2_4_configurations.cn.md
index bd0ddf5..d1cb780 100644
--- a/docs/_docs/2_4_configurations.cn.md
+++ b/docs/_docs/2_4_configurations.cn.md
@@ -230,6 +230,10 @@ Hudi将有关提交、保存点、清理审核日志等的所有主要元数据
 
 [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) 
 可插入以具有外部索引(HBase)或使用存储在Parquet文件中的默认布隆过滤器(bloom 
filter)
+
+ withIndexClass(indexClass = "x.y.z.UserDefinedIndex") {#withIndexClass}
+属性:`hoodie.index.class` 
+用户自定义索引的全路径名,索引类必须为HoodieIndex的子类,当指定该配置时,其会优先于`hoodie.index.type`配置
 
  withIndexType(indexType = BLOOM) {#withIndexType}
 属性:`hoodie.index.type` 
diff --git a/docs/_docs/2_4_configurations.md b/docs/_docs/2_4_configurations.md
index 33c5e2c..16dacba 100644
--- a/docs/_docs/2_4_configurations.md
+++ b/docs/_docs/2_4_configurations.md
@@ -221,7 +221,11 @@ Following configs control indexing behavior, which tags 
incoming records as eith
 
 [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) 
 This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files
-
+
+ withIndexClass(indexClass = "x.y.z.UserDefinedIndex") {#withIndexClass}
+Property: `hoodie.index.class` 
+Full path of user-defined index class and must be a 
subclass of HoodieIndex class. It will take precedence over the 
`hoodie.index.type` configuration if specified
+
  withIndexType(indexType = BLOOM) {#withIndexType}
 Property: `hoodie.index.type` 
 Type of index to use. Default is Bloom filter. 
Possible options are [BLOOM | HBASE | INMEMORY]. Bloom filters removes the 
dependency on a external system and is stored in the footer of the Parquet Data 
Files



[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-19 Thread GitBox
yanghua commented on a change in pull request #1417: [HUDI-720] NOTICE file 
needs to add more content based on the NOTICE files of the ASF projects that 
hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#discussion_r394906742
 
 

 ##
 File path: NOTICE
 ##
 @@ -1,5 +1,264 @@
 Apache Hudi (incubating)
-Copyright 2019 and onwards The Apache Software Foundation
+Copyright 2019-2020 The Apache Software Foundation
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
+
+
+
+Apache Hive
+Copyright 2008-2018 The Apache Software Foundation
+
+This product includes software developed by The Apache Software
+Foundation (http://www.apache.org/).
+
+This project includes software licensed under the JSON license.
+
+
+
+Apache SystemML
+Copyright [2015-2018] The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+Apache Spark
+Copyright 2014 and onwards The Apache Software Foundation.
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+ for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data sent across the 
network between
+services.
+
+
+Metrics
+Copyright 2010-2013 Coda Hale and Yammer, Inc.
+
+This product includes software developed by Coda Hale and Yammer, Inc.
+
+This product includes code derived from the JSR-166 project 
(ThreadLocalRandom, Striped64,
+LongAdder), which was released with the following comments:
+
+Written by Doug Lea with assistance from members of JCP JSR-166
+Expert Group and released to the public domain, as explained at
+http://creativecommons.org/publicdomain/zero/1.0/
+
+
+
+This product includes code from 
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/objectsize/ObjectSizeCalculator.java
 with the following license
+
+=
+ Copyright 2011 Twitter, Inc.
+ 
-
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this work except in compliance with the License.
+ You may obtain a copy of the License in the LICENSE file, or at:
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+=
+
+This product includes code from Apache Spark
 
 Review comment:
   Yes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-19 Thread GitBox
yanghua commented on a change in pull request #1417: [HUDI-720] NOTICE file 
needs to add more content based on the NOTICE files of the ASF projects that 
hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#discussion_r394884156
 
 

 ##
 File path: NOTICE
 ##
 @@ -1,5 +1,264 @@
 Apache Hudi (incubating)
-Copyright 2019 and onwards The Apache Software Foundation
+Copyright 2019-2020 The Apache Software Foundation
 
 This product includes software developed at
 The Apache Software Foundation (http://www.apache.org/).
+
+
+
+Apache Hive
+Copyright 2008-2018 The Apache Software Foundation
+
+This product includes software developed by The Apache Software
+Foundation (http://www.apache.org/).
+
+This project includes software licensed under the JSON license.
+
+
+
+Apache SystemML
+Copyright [2015-2018] The Apache Software Foundation
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+
+Apache Spark
+Copyright 2014 and onwards The Apache Software Foundation.
+
+This product includes software developed at
+The Apache Software Foundation (http://www.apache.org/).
+
+
+Export Control Notice
+-
+
+This distribution includes cryptographic software. The country in which you 
currently reside may have
+restrictions on the import, possession, use, and/or re-export to another 
country, of encryption software.
+BEFORE using any encryption software, please check your country's laws, 
regulations and policies concerning
+the import, possession, or use, and re-export of encryption software, to see 
if this is permitted. See
+ for more information.
+
+The U.S. Government Department of Commerce, Bureau of Industry and Security 
(BIS), has classified this
+software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes 
information security software
+using or performing cryptographic functions with asymmetric algorithms. The 
form and manner of this Apache
+Software Foundation distribution makes it eligible for export under the 
License Exception ENC Technology
+Software Unrestricted (TSU) exception (see the BIS Export Administration 
Regulations, Section 740.13) for
+both object code and source code.
+
+The following provides more details on the included cryptographic software:
+
+This software uses Apache Commons Crypto 
(https://commons.apache.org/proper/commons-crypto/) to
+support authentication, and encryption and decryption of data sent across the 
network between
+services.
+
+
+Metrics
+Copyright 2010-2013 Coda Hale and Yammer, Inc.
+
+This product includes software developed by Coda Hale and Yammer, Inc.
+
+This product includes code derived from the JSR-166 project 
(ThreadLocalRandom, Striped64,
+LongAdder), which was released with the following comments:
+
+Written by Doug Lea with assistance from members of JCP JSR-166
+Expert Group and released to the public domain, as explained at
+http://creativecommons.org/publicdomain/zero/1.0/
+
+
+
+This product includes code from 
https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/objectsize/ObjectSizeCalculator.java
 with the following license
+
+=
+ Copyright 2011 Twitter, Inc.
+ 
-
+ Licensed under the Apache License, Version 2.0 (the "License");
+ you may not use this work except in compliance with the License.
+ You may obtain a copy of the License in the LICENSE file, or at:
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+=
+
+This product includes code from Apache Spark
 
 Review comment:
   We can follow [this 
pattern](https://github.com/apache/arrow/blob/master/NOTICE.txt#L66)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


  1   2   >