[GitHub] [hudi] vinothchandar commented on pull request #1593: [WIP] [HUDI-839] Introducing rollback strategy using marker files

2020-06-27 Thread GitBox


vinothchandar commented on pull request #1593:
URL: https://github.com/apache/hudi/pull/1593#issuecomment-650699704


   Closing in favor of #1756 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar closed pull request #1593: [WIP] [HUDI-839] Introducing rollback strategy using marker files

2020-06-27 Thread GitBox


vinothchandar closed pull request #1593:
URL: https://github.com/apache/hudi/pull/1593


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-27 Thread GitBox


vinothchandar commented on pull request #1100:
URL: https://github.com/apache/hudi/pull/1100#issuecomment-650699629


   @n3nash @yanghua do you mind me pushing some changes to this and land this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-06-27 Thread GitBox


vinothchandar commented on pull request #1512:
URL: https://github.com/apache/hudi/pull/1512#issuecomment-650699549


   @prashantwason @bvaradar IIUC some of this PR overlaps with the changes you 
are making as well. can you both clarify so we can close or revive this as 
needed..  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1767: [MINOR] Adding test to WriteClient to validate update partition path with global bloom

2020-06-27 Thread GitBox


vinothchandar commented on pull request #1767:
URL: https://github.com/apache/hudi/pull/1767#issuecomment-650698829


   @nsivabalan this is more than the 50 lines we agreed on for MINOR prefix. 
can you please file a JIRA and let me know if you think this quallifies for the 
MINOR prefix.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-06-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147213#comment-17147213
 ] 

Vinoth Chandar commented on HUDI-839:
-

yes on it.. will have a review done by monday

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #322

2020-06-27 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.43 KB...]
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Commented] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-06-27 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17147183#comment-17147183
 ] 

liwei commented on HUDI-839:


hello [~vinoth], the pr 1756 have pass all the unit tests. can you help to 
review 

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] lw309637554 edited a comment on pull request #1756: [HUDI-839] Adding unit test for MarkerFiles,RollbackUtils, RollbackActionExecutor for markers and filelisting

2020-06-27 Thread GitBox


lw309637554 edited a comment on pull request #1756:
URL: https://github.com/apache/hudi/pull/1756#issuecomment-650096565


   > Took a quick pass at the three test classes you have added.. LGTM .
   > Will do a detailed pass once you confirm PR is indeed ready..
   
   @vinothchandar hello,i  have add more test for  end to end rollback. And the 
all hudi unit test passed.
   There are a few questions to discuss with you:
   1.  for rollback successful commit, in HoodieWriteClient.java i remove the 
deleteMarkerDir() in postcommit when is in usingmarkers mode.   But it will 
double the file numbers in dfs. 
   2. if the markers file retain, if we should clean it when the datafile is 
cleaned, also if we should archive the markers file when archiveCommitsWith
   
   if  these two question clear and solved, i think the patch will be ok .



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-06-27 Thread GitBox


yanghua commented on pull request #1558:
URL: https://github.com/apache/hudi/pull/1558#issuecomment-650673354


   > @yanghua Can we merge this now?
   
   Will review soon.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-349) Make cleaner retention based on time period to account for higher deviations in ingestion runs

2020-06-27 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma updated HUDI-349:
--
Status: In Progress  (was: Open)

> Make cleaner retention based on time period to account for higher deviations 
> in ingestion runs
> --
>
> Key: HUDI-349
> URL: https://issues.apache.org/jira/browse/HUDI-349
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Cleaner, newbie, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Cleaner by commits is based on number of commits to be retained.  Ingestion 
> time could vary across runs due to various factors. For providing a bound on 
> the maximum running time for a query and for providing consistent retention 
> period, it is better to use a retention config based on time (e:g 12h) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-349) Make cleaner retention based on time period to account for higher deviations in ingestion runs

2020-06-27 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma reassigned HUDI-349:
-

Assignee: Pratyaksh Sharma  (was: Aravind Suresh)

> Make cleaner retention based on time period to account for higher deviations 
> in ingestion runs
> --
>
> Key: HUDI-349
> URL: https://issues.apache.org/jira/browse/HUDI-349
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Cleaner, newbie, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Cleaner by commits is based on number of commits to be retained.  Ingestion 
> time could vary across runs due to various factors. For providing a bound on 
> the maximum running time for a query and for providing consistent retention 
> period, it is better to use a retention config based on time (e:g 12h) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] pratyakshsharma commented on a change in pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-06-27 Thread GitBox


pratyakshsharma commented on a change in pull request #1558:
URL: https://github.com/apache/hudi/pull/1558#discussion_r446569482



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##
@@ -263,13 +265,26 @@ private static int compact(JavaSparkContext jsc, String 
basePath, String tableNa
   }
 
   private static int deduplicatePartitionPath(JavaSparkContext jsc, String 
duplicatedPartitionPath,
-  String repairedOutputPath, String basePath, String dryRun) {
+  String repairedOutputPath, String basePath, boolean dryRun, String 
dedupeType) {
 DedupeSparkJob job = new DedupeSparkJob(basePath, duplicatedPartitionPath, 
repairedOutputPath, new SQLContext(jsc),
-FSUtils.getFs(basePath, jsc.hadoopConfiguration()));
-job.fixDuplicates(Boolean.parseBoolean(dryRun));
+FSUtils.getFs(basePath, jsc.hadoopConfiguration()), 
getDedupeType(dedupeType));
+job.fixDuplicates(dryRun);
 return 0;
   }
 
+  private static Enumeration.Value getDedupeType(String type) {
+switch (type) {
+  case "insertType":
+return DeDupeType.insertType();
+  case "updateType":
+return DeDupeType.updateType();
+  case "upsertType":
+return DeDupeType.upsertType();
+  default:
+throw new IllegalArgumentException("Please provide valid dedupe 
type!");
+}
+  }
+

Review comment:
   Taken care. :) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #1562: [HUDI-837]: implemented custom deserializer for AvroKafkaSource

2020-06-27 Thread GitBox


pratyakshsharma commented on pull request #1562:
URL: https://github.com/apache/hudi/pull/1562#issuecomment-650631442


   @n3nash @vinothchandar I guess we can merge this? :) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-06-27 Thread GitBox


pratyakshsharma commented on pull request #1558:
URL: https://github.com/apache/hudi/pull/1558#issuecomment-650630601


   @yanghua Can we merge this now? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on a change in pull request #1648: [HUDI-916]: added support for multiple input formats in TimestampBasedKeyGenerator

2020-06-27 Thread GitBox


pratyakshsharma commented on a change in pull request #1648:
URL: https://github.com/apache/hudi/pull/1648#discussion_r446538524



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/parser/HoodieDateTimeParser.java
##
@@ -0,0 +1,48 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen.parser;
+
+import org.joda.time.DateTimeZone;
+import org.joda.time.format.DateTimeFormatter;
+
+public interface HoodieDateTimeParser {
+
+  /**
+   * Returns the output date format in which the partition paths will be 
created for the hudi dataset.
+   * @return
+   */
+  String getOutputDateFormat();

Review comment:
   The purpose of not passing any params here is to actually use the 
parameters defined in TimestampBasedKeyGenerator.Config class. I want to have 
that Config class as the only place for having all the config parameters for 
timestamp based key generation and I have actually used those config parameters 
in my implementation as well. But I do not have any strong reason for not 
passing any parameters around as part of signature of all functions in 
HoodieDateTimeParser interface. The other reason is we cannot enforce the usage 
of parameter by passing in the function as someone might still want to change 
the code and they can change the interface contract itself at a later point of 
time. I guess we can take care of config parameters getting used as part of 
code review itself :) 
   
   Please let me know your thoughts on this. @nsivabalan 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-983) Add Metrics section to asf-site

2020-06-27 Thread Hong Shen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17146989#comment-17146989
 ] 

Hong Shen commented on HUDI-983:


[~rxu] I have pull a request in [https://github.com/apache/hudi/pull/1769] , 
but I have not add DatadogReporter since I don't have datadog.api.key, can you 
help to add doc for DatadogReporter?

> Add Metrics section to asf-site
> ---
>
> Key: HUDI-983
> URL: https://issues.apache.org/jira/browse/HUDI-983
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: Hong Shen
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 0.6.0
>
>
> Document the use of metrics system in Hudi, include all supported metrics 
> reporter.
> See the example
> https://user-images.githubusercontent.com/20113411/83055820-f5e97100-a086-11ea-9ea3-52b342aca9d4.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-708) Add unit test for TempViewCommand

2020-06-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-708:

Labels: pull-request-available  (was: )

> Add unit test for TempViewCommand
> -
>
> Key: HUDI-708
> URL: https://issues.apache.org/jira/browse/HUDI-708
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hddong opened a new pull request #1770: [HUDI-708]Add temps show and unit test for TempViewCommand

2020-06-27 Thread GitBox


hddong opened a new pull request #1770:
URL: https://github.com/apache/hudi/pull/1770


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *1. Add command `temps show`*
   *2. Add unit test for TempViewCommand*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 opened a new pull request #1769: [DOC] Add document for the use of metrics system in Hudi.

2020-06-27 Thread GitBox


shenh062326 opened a new pull request #1769:
URL: https://github.com/apache/hudi/pull/1769


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1732: [HUDI-1004] Support update metrics in HoodieDeltaStreamerMetrics

2020-06-27 Thread GitBox


shenh062326 commented on pull request #1732:
URL: https://github.com/apache/hudi/pull/1732#issuecomment-650534678


   > @leesf Given the limited scope of the pr, can we try and avoid copying 
code from other places
   
   Done.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1732: [HUDI-1004] Support update metrics in HoodieDeltaStreamerMetrics

2020-06-27 Thread GitBox


shenh062326 commented on a change in pull request #1732:
URL: https://github.com/apache/hudi/pull/1732#discussion_r446508116



##
File path: hudi-client/src/main/java/org/apache/hudi/metrics/HudiGauge.java
##
@@ -25,22 +25,21 @@
  * Similar to {@link Gauge}, but metric value is updated via calling {@link 
#setValue(T)} instead.
  * 
  */
-public class SimpleSettableGauge implements SettableGauge {
+public class HudiGauge implements Gauge {
   private volatile T value;
 
   /**
* Create an instance with a default value.
*
* @param defaultValue default value
*/
-  public SimpleSettableGauge(T defaultValue) {
+  public HudiGauge(T defaultValue) {

Review comment:
   Change to HoodieGauge.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1768: [HUDI-1054][Peformance] Several performance fixes during finalizing writes

2020-06-27 Thread GitBox


leesf commented on a change in pull request #1768:
URL: https://github.com/apache/hudi/pull/1768#discussion_r446502715



##
File path: hudi-common/pom.xml
##
@@ -147,6 +147,16 @@
   test
 
 
+
+
+  org.apache.spark
+  spark-core_${scala.binary.version}
+
+
+  org.apache.spark
+  spark-sql_${scala.binary.version}
+
+

Review comment:
   `hudi-common` is a base module and feels a little weird that it relies 
on spark-core/spark-sql, would we remove them and move 
`getAllDataFilesForMarkers` method to `hudi-client` module, wdyt? cc @bvaradar 
@vinothchandar 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1767: [MINOR] Adding test to WriteClient to validate update partition path with global bloom

2020-06-27 Thread GitBox


leesf commented on pull request #1767:
URL: https://github.com/apache/hudi/pull/1767#issuecomment-650524267


   @xushiyan Could you please review this PR since you did the related work 
before.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1761: [MINOR] Add documentation for using multi-column table keys and for n…

2020-06-27 Thread GitBox


leesf commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r446499358



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -176,15 +176,49 @@ In some cases, you may want to migrate your existing 
table into Hudi beforehand.
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
+The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table. There are a number of options available:
 
+**`HoodieWriteConfig`**:
+
+**TABLE_NAME** (Required)
+
+
+**`DataSourceWriteOptions`**:
+
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields 
can be specified using the dot notation eg: `a.b.c`. When using multiple 
columns as primary key use comma seperated notaion, eg: `"col1,col2,col3,etc"`. 
Single or multiple columns as primary key specified by 
`KEYGENERATOR_CLASS_OPT_KEY` property.
+Default value: `"uuid"`
+
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify paritioning/no partitioning using 
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY`

Review comment:
   > Yap, seems both are necessary:
   
   The `HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY` is used to sync to hive, if you 
are meaning sync to hive during the write, it makes sense, or it makes no sense.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1761: [MINOR] Add documentation for using multi-column table keys and for n…

2020-06-27 Thread GitBox


leesf commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r446499146



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -176,15 +176,49 @@ In some cases, you may want to migrate your existing 
table into Hudi beforehand.
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
+The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table. There are a number of options available:
 
+**`HoodieWriteConfig`**:
+
+**TABLE_NAME** (Required)
+
+
+**`DataSourceWriteOptions`**:
+
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields 
can be specified using the dot notation eg: `a.b.c`. When using multiple 
columns as primary key use comma seperated notaion, eg: `"col1,col2,col3,etc"`. 
Single or multiple columns as primary key specified by 
`KEYGENERATOR_CLASS_OPT_KEY` property.
+Default value: `"uuid"`
+
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify paritioning/no partitioning using 
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY`
+Default value: `"partitionpath"`
+
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key 
value, the record with the largest value from the field specified here will be 
choosen.
+Default value: `"ts"`
+
+**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use. Note: 
this cannot change across writes.

Review comment:
   > Good catch, that was meant to go in the description of the next 
parameter, TABLE_TYPE_OPT_KEY.
   > 
   > The text can be made more clear by using the following language instead
   > Instead of "Note: this cannot change across writes."
   > Use "Note: After the initial creation of a table, this value must stay 
consistent when writing to (updating) the table using the Spark SaveMode.Append 
mode."
   > 
   > Also, I believe that the key and partition columns, cannot be changed 
after the table is first created as well, so this note snippet can be added to 
those fields as well. Is this correct?
   
   Yes, `TABLE_TYPE_OPT_KEY` is not meant to be changed in next write, and 
`OPERATION_OPT_KEY` would be changed during different writes, and maybe I 
misunderstand what you meant.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1761: [MINOR] Add documentation for using multi-column table keys and for n…

2020-06-27 Thread GitBox


leesf commented on a change in pull request #1761:
URL: https://github.com/apache/hudi/pull/1761#discussion_r446498933



##
File path: docs/_docs/2_2_writing_data.md
##
@@ -176,15 +176,49 @@ In some cases, you may want to migrate your existing 
table into Hudi beforehand.
 
 ## Datasource Writer
 
-The `hudi-spark` module offers the DataSource API to write (and also read) any 
data frame into a Hudi table.
-Following is how we can upsert a dataframe, while specifying the field names 
that need to be used
-for `recordKey => _row_key`, `partitionPath => partition` and `precombineKey 
=> timestamp`
+The `hudi-spark` module offers the DataSource API to write (and read) a Spark 
DataFrame into a Hudi table. There are a number of options available:
 
+**`HoodieWriteConfig`**:
+
+**TABLE_NAME** (Required)
+
+
+**`DataSourceWriteOptions`**:
+
+**RECORDKEY_FIELD_OPT_KEY** (Required): Primary key field(s). Nested fields 
can be specified using the dot notation eg: `a.b.c`. When using multiple 
columns as primary key use comma seperated notaion, eg: `"col1,col2,col3,etc"`. 
Single or multiple columns as primary key specified by 
`KEYGENERATOR_CLASS_OPT_KEY` property.
+Default value: `"uuid"`
+
+**PARTITIONPATH_FIELD_OPT_KEY** (Required): Columns to be used for 
partitioning the table. To prevent partitioning, provide empty string as value 
eg: `""`. Specify paritioning/no partitioning using 
`HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY`
+Default value: `"partitionpath"`
+
+**PRECOMBINE_FIELD_OPT_KEY** (Required): When two records have the same key 
value, the record with the largest value from the field specified here will be 
choosen.
+Default value: `"ts"`
+
+**OPERATION_OPT_KEY**: The [write operations](#write-operations) to use. Note: 
this cannot change across writes.

Review comment:
   in fact, there are some ways to delete records in hudi, 1. we would use 
"org.apache.hudi.EmptyHoodieRecordPayload" to delete a batch of records, 2. we 
could directly use `delete` api, 3. we could we `_hoodie_is_delete` flag to 
delete record. Also the `HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ` has nothing 
to do with the writing path IIRC, it is only related to hive sync.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org