[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605391332
 
 
   @lamber-ken  So I swtich the java version of the maven docker image that I 
used from java 11 to java 8, 
   
![image](https://user-images.githubusercontent.com/8300535/77814372-e3eb5200-7086-11ea-918f-52d7e155229c.png)
   And now the issue is gone.
   
   Ok, I will take this issue and make the patch and testing on this one.
   
   Thanks @lamber-ken 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605389857
 
 
   > So mine is (JI)I, what is your java build version?
   
   Opened a jira to fix this issue, if you are interested in it, feel free take 
over it : )
   https://issues.apache.org/jira/browse/HUDI-742
   
   
   
   IMO, following changes are ok.
   
   
![image](https://user-images.githubusercontent.com/20113411/77814247-3742bc00-70ea-11ea-860c-8405b8b88372.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread lamber-ken (Jira)
lamber-ken created HUDI-742:
---

 Summary: Fix java.lang.NoSuchMethodError: 
java.lang.Math.floorMod(JI)I
 Key: HUDI-742
 URL: https://issues.apache.org/jira/browse/HUDI-742
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Spark Integration
Reporter: lamber-ken


*ISSUE* : https://github.com/apache/incubator-hudi/issues/1455

{code:java}
at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
... 49 elided
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
at 
org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:966)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2262)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2211)
at 

[jira] [Updated] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-742:

Status: Open  (was: New)

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Priority: Major
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:966)
> at 
> 

[GitHub] [incubator-hudi] EdwinGuo edited a comment on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo edited a comment on issue #1455: [SUPPORT] Hudi upsert run into 
exception:  java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605388998
 
 
   So mine is (JI)I, what is your java build version?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605388928
 
 
   @lamber-ken  Yes, I build the jar myself. Below is my java version:
   
![image](https://user-images.githubusercontent.com/8300535/77814065-12b3f900-7084-11ea-9505-f7cef8316125.png)
   and :
   
![image](https://user-images.githubusercontent.com/8300535/77814068-1d6e8e00-7084-11ea-8d9b-239cb0b581b2.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605388998
 
 
   So mine is (JI)I


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #230

2020-03-27 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.32 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] EdwinGuo edited a comment on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo edited a comment on issue #1455: [SUPPORT] Hudi upsert run into 
exception:  java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605388928
 
 
   @lamber-ken  Yes, I build the jar myself. Below is my java version:
   
![image](https://user-images.githubusercontent.com/8300535/77814065-12b3f900-7084-11ea-9505-f7cef8316125.png)
   and :
   
![image](https://user-images.githubusercontent.com/8300535/77814074-2eb79a80-7084-11ea-9a46-954da6816ac4.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605386569
 
 
   I checked `0.5.2-incubating `
   
![image](https://user-images.githubusercontent.com/20113411/77813629-057b2680-70e5-11ea-83f2-44c90031ad0a.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] PhatakN1 commented on issue #1458: Issue with running compaction on a MOR dataset with org.apache.hudi.payload.AWSDmsAvroPayload

2020-03-27 Thread GitBox
PhatakN1 commented on issue #1458: Issue with running compaction on a MOR 
dataset with org.apache.hudi.payload.AWSDmsAvroPayload
URL: https://github.com/apache/incubator-hudi/issues/1458#issuecomment-605385347
 
 
   I am using 0.5.1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken edited a comment on issue #1455: [SUPPORT] Hudi upsert run into 
exception:  java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605368991
 
 
   hi @EdwinGuo, the log level is `WARN` in spark-shell by default, you can 
`sc.setLogLevel("ERROR")`
   
   BTW, the hudi jar was built by you self? if so, please share result 
   `javap -v BucketizedBloomCheckPartitioner.class | grep Math`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster of cleans run command

2020-03-27 Thread GitBox
hddong commented on a change in pull request #1452: [HUDI-740]Fix can not 
specify the sparkMaster of cleans run command
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r399609420
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
 ##
 @@ -62,7 +63,9 @@ public static void main(String[] args) throws Exception {
 
 SparkCommand cmd = SparkCommand.valueOf(command);
 
-JavaSparkContext jsc = SparkUtil.initJavaSparkConf("hoodie-cli-" + 
command);
+JavaSparkContext jsc = cmd == SparkCommand.CLEAN
 
 Review comment:
   @pratyakshsharma only CLEAN command can  specify the sparkMaster now. 
Otherwise, `sparkMaster` not contained in `args` of other command.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-723:

Status: Open  (was: New)

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-723:
---

Assignee: lamber-ken

> SqlTransformer's schema sometimes is not registered. 
> -
>
> Key: HUDI-723
> URL: https://issues.apache.org/jira/browse/HUDI-723
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> If schema is inferred from RowBasedSchemaProvider when SQL transformer is 
> used it also needs to be registered. 
>  
> Current way only works if SchemaProvider has a valid target schema. Is one 
> wants to use schema from SQL transformation, the result of 
> RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
> {code:java}
> private void setupWriteClient(SchemaProvider schemaProvider) {
>   LOG.info("Setting up Hoodie Write Client");
>   registerAvroSchemas(schemaProvider);
>   HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
>   writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
>   onInitializingHoodieWriteClient.apply(writeClient);
> }
> {code}
> Existent method will not work as it is checking for:
> {code:java}
> if ((null != schemaProvider) && (null == writeClient)) {
> {code}
> and writeClient is already configured. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1458: Issue with running compaction on a MOR dataset with org.apache.hudi.payload.AWSDmsAvroPayload

2020-03-27 Thread GitBox
lamber-ken commented on issue #1458: Issue with running compaction on a MOR 
dataset with org.apache.hudi.payload.AWSDmsAvroPayload
URL: https://github.com/apache/incubator-hudi/issues/1458#issuecomment-605376759
 
 
   hi @PhatakN1, thanks for reporting this issue, which hudi version you are 
using?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-418) Bootstrap Index - Implementation

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-418:

Labels: pull-request-available  (was: )

> Bootstrap Index - Implementation
> 
>
> Key: HUDI-418
> URL: https://issues.apache.org/jira/browse/HUDI-418
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> An implementation for 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:]
>  is present in 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-common/src/main/java/org/apache/hudi/common/consolidated/CompositeMapFile.java]
>  
> We need to make it solid with unit-tests and cleanup. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar opened a new pull request #1459: [WIP] [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-03-27 Thread GitBox
bvaradar opened a new pull request #1459: [WIP] [HUDI-418] [HUDI-421] Bootstrap 
Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459
 
 
   ## What is the purpose of the pull request
   
   This is part of code changes needed to support RFC-12 to support efficient 
bootstrap of legacy tables. This PR contains 2 commits
   
 * HUDI-418 - Bootstrap Index using HFile with APIs to create and lookup. 
Note that this implementation is different from what was initially proposed in 
RFC-12. This implementation uses existing HFile layout as this layout works 
well for our usecase and the implementation is battle tested and stable. There 
are 2 HFiles maintained - one for storing all bootstrap mappings at partition 
level (for all file-system view calls) and another at per file-id level (for 
compaction lookups)
  * HUDI-421 Core Changes in File System View to integrate with Bootstrap 
Index. With this change the upper layers which uses FileSystem View can easily 
identify and handle external base-files.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-718:

Status: Open  (was: New)

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-718:
---

Assignee: lamber-ken

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069160#comment-17069160
 ] 

lamber-ken commented on HUDI-722:
-

Sure. Hi [~afilipchik], if you have time, can you share more context about it? 
e.g demo code, thanks.

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-722:
---

Assignee: lamber-ken

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-27 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-722:

Status: Open  (was: New)

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605368991
 
 
   hi @EdwinGuo, the log level is `WARN` in spark-shell by default, you can 
`sc.setLogLevel("ERROR")`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17069083#comment-17069083
 ] 

sivabalan narayanan commented on HUDI-686:
--

yeah,  was about to respond to lander-ken that intention of this index is to 
speed up simpler cases and def not intended for one size fits all. at higher 
scale, prob this may not be the right index to use.

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399542571
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[] 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399542370
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[] 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399542132
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[] 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399541806
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[] 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399542054
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
 
 Review comment:
   HoodieDeltaStreamer.Config can be used at all places. Have made the changes. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399541700
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
+TableExecutionObject executionObject;
+for (String table : tablesToBeIngested) {
+  String[] 

[GitHub] [incubator-hudi] melkimohamed commented on issue #1439: Hudi class loading problem

2020-03-27 Thread GitBox
melkimohamed commented on issue #1439: Hudi class loading problem
URL: https://github.com/apache/incubator-hudi/issues/1439#issuecomment-605300589
 
 
   @lamber-ken , @bvaradar 
   thank you for your answers, I solved the problem by changing the deployment 
method:
   the variable   hive.reloadable.aux.jars.path  does not support mapreduce 
queries of type count, max, ..
   it works with
   export
   HIVE_AUX_JARS_PATH = / usr / lib/hudi / hudi-hive-bundle-0.5.0-incubating.jar


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] melkimohamed closed issue #1439: Hudi class loading problem

2020-03-27 Thread GitBox
melkimohamed closed issue #1439: Hudi class loading problem
URL: https://github.com/apache/incubator-hudi/issues/1439
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399519630
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java
 ##
 @@ -0,0 +1,259 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.util.FSUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.UtilHelpers;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.Config;
+import org.apache.hudi.utilities.schema.SchemaRegistryProvider;
+
+import com.beust.jcommander.JCommander;
+import com.google.common.base.Strings;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+/**
+ * Wrapper over HoodieDeltaStreamer.java class.
+ * Helps with ingesting incremental data into hoodie datasets for multiple 
tables.
+ * Currently supports only COPY_ON_WRITE storage type.
+ */
+public class HoodieMultiTableDeltaStreamer {
+
+  private static Logger logger = 
LogManager.getLogger(HoodieMultiTableDeltaStreamer.class);
+
+  private List tableExecutionObjects;
+  private transient JavaSparkContext jssc;
+  private Set successTables;
+  private Set failedTables;
+
+  public HoodieMultiTableDeltaStreamer(String[] args, JavaSparkContext jssc) 
throws IOException {
+this.tableExecutionObjects = new ArrayList<>();
+this.successTables = new HashSet<>();
+this.failedTables = new HashSet<>();
+this.jssc = jssc;
+String commonPropsFile = getCommonPropsFileName(args);
+String configFolder = getConfigFolder(args);
+FileSystem fs = FSUtils.getFs(commonPropsFile, jssc.hadoopConfiguration());
+configFolder = configFolder.charAt(configFolder.length() - 1) == '/' ? 
configFolder.substring(0, configFolder.length() - 1) : configFolder;
+checkIfPropsFileAndConfigFolderExist(commonPropsFile, configFolder, fs);
+TypedProperties properties = UtilHelpers.readConfig(fs, new 
Path(commonPropsFile), new ArrayList<>()).getConfig();
+//get the tables to be ingested and their corresponding config files from 
this properties instance
+populateTableExecutionObjectList(properties, configFolder, fs, args);
+  }
+
+  private void checkIfPropsFileAndConfigFolderExist(String commonPropsFile, 
String configFolder, FileSystem fs) throws IOException {
+if (!fs.exists(new Path(commonPropsFile))) {
+  throw new IllegalArgumentException("Please provide valid common config 
file path!");
+}
+
+if (!fs.exists(new Path(configFolder))) {
+  fs.mkdirs(new Path(configFolder));
+}
+  }
+
+  private void checkIfTableConfigFileExists(String configFolder, FileSystem 
fs, String configFilePath) throws IOException {
+if (!fs.exists(new Path(configFilePath)) || !fs.isFile(new 
Path(configFilePath))) {
+  throw new IllegalArgumentException("Please provide valid table config 
file path!");
+}
+
+Path path = new Path(configFilePath);
+Path filePathInConfigFolder = new Path(configFolder, path.getName());
+if (!fs.exists(filePathInConfigFolder)) {
+  FileUtil.copy(fs, path, fs, filePathInConfigFolder, false, fs.getConf());
+}
+  }
+
+  //commonProps are passed as parameter which contain table to config file 
mapping
+  private void populateTableExecutionObjectList(TypedProperties properties, 
String configFolder, FileSystem fs, String[] args) throws IOException {
+List tablesToBeIngested = getTablesToBeIngested(properties);
 
 Review comment:
   Done


This is an 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399518706
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -155,12 +166,20 @@ public Operation convert(String value) throws 
ParameterException {
 required = true)
 public String targetBasePath;
 
+@Parameter(names = {"--base-path-prefix"},
 
 Review comment:
   Yes, this is needed for HoodieMultiTableDeltaStreamer. Now I am defining a 
separate config class for MultiTableStreamer. All unused configs can be removed 
from HoodieDeltaStreamer.Config class. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on issue #1150: [HUDI-288]: Add support for ingesting 
multiple kafka streams in a single DeltaStreamer deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#issuecomment-605297172
 
 
   @gdineshbabu88 Yeah I will update documentation for this tool. That is going 
to be my next task after taking care of code review comments. You can expect 
this in 0.6.0 release (which is going to be the next). 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399504930
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/TableExecutionObject.java
 ##
 @@ -0,0 +1,85 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.deltastreamer;
+
+import org.apache.hudi.common.util.TypedProperties;
+
+import java.util.Objects;
+
+/**
+ * Wrapper over TableConfig objects.
+ * Useful for incrementally syncing multiple tables one by one via 
HoodieMultiTableDeltaStreamer.java class.
+ */
+public class TableExecutionObject {
 
 Review comment:
   Done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399503228
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -342,6 +361,32 @@ public static void main(String[] args) throws Exception {
  */
 private transient DeltaSync deltaSync;
 
+public DeltaSyncService(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf,
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399500147
 
 

 ##
 File path: 
hudi-hive-sync/src/test/java/org/apache/hudi/hive/util/HiveTestService.java
 ##
 @@ -121,6 +121,20 @@ public HiveServer2 start() throws IOException {
 return hiveServer;
   }
 
+  public void stop() {
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] PhatakN1 opened a new issue #1458: Issue with running compaction on a MOR dataset with org.apache.hudi.payload.AWSDmsAvroPayload

2020-03-27 Thread GitBox
PhatakN1 opened a new issue #1458: Issue with running compaction on a MOR 
dataset with org.apache.hudi.payload.AWSDmsAvroPayload
URL: https://github.com/apache/incubator-hudi/issues/1458
 
 
   Hi,
   
   I have integrated Hoodiedeltastreamer with DMS using --payload-class 
org.apache.hudi.payload.AWSDmsAvroPayload in the deltastreamer command. When I 
try to use the hudi-cli to run compaction on the data, it errors out with the 
following error.
   
   20/03/27 19:17:23 ERROR log.AbstractHoodieLogRecordScanner: Got exception 
when reading log file
   org.apache.hudi.exception.HoodieException: Unable to load class
at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:47)
at 
org.apache.hudi.common.util.ReflectionUtils.loadPayload(ReflectionUtils.java:67)
at 
org.apache.hudi.common.util.SpillableMapUtils.convertToHoodieRecordPayload(SpillableMapUtils.java:116)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processAvroDataBlock(AbstractHoodieLogRecordScanner.java:276)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:305)
at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:238)
at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:81)
at 
org.apache.hudi.io.compact.HoodieMergeOnReadTableCompactor.compact(HoodieMergeOnReadTableCompactor.java:126)
at 
org.apache.hudi.io.compact.HoodieMergeOnReadTableCompactor.lambda$compact$e841120d$1(HoodieMergeOnReadTableCompactor.java:98)
at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at 
org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
at 
org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182)
at 
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at 
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.payload.AWSDmsAvroPayload
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:44)
... 32 more
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399494359
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -286,24 +350,46 @@ public static void createSavepointFile(String basePath, 
String commitTime, Confi
* @return  List of {@link HoodieRecord}s
*/
   public List generateInserts(String commitTime, Integer n, 
boolean isFlattened) {
-return generateInsertsStream(commitTime, n, 
isFlattened).collect(Collectors.toList());
+return generateInsertsStream(commitTime, n, isFlattened, 
TRIP_EXAMPLE_SCHEMA).collect(Collectors.toList());
+  }
+
+  /**
+   * Generates new inserts, uniformly across the partition paths above. It 
also updates the list of existing keys.
+   */
+  public Stream generateInsertsStream(String commitTime, Integer 
n, String schemaStr) {
+int currSize = getNumExistingKeys(schemaStr);
+
+return IntStream.range(0, n).boxed().map(i -> {
+  String partitionPath = 
partitionPaths[rand.nextInt(partitionPaths.length)];
+  HoodieKey key = new HoodieKey(UUID.randomUUID().toString(), 
partitionPath);
+  KeyPartition kp = new KeyPartition();
+  kp.key = key;
+  kp.partitionPath = partitionPath;
+  populateKeysBySchema(schemaStr, currSize + i, kp);
+  incrementNumExistingKeysBySchema(schemaStr);
+  try {
+return new HoodieRecord(key, generateRandomValueAsPerSchema(schemaStr, 
key, commitTime));
+  } catch (IOException e) {
+throw new HoodieIOException(e.getMessage(), e);
+  }
+});
   }
 
   /**
* Generates new inserts, uniformly across the partition paths above. It 
also updates the list of existing keys.
*/
   public Stream generateInsertsStream(
 
 Review comment:
   Done. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy default values of fields if not present when rewriting incoming record with new schema

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1427: [HUDI-727]: Copy 
default values of fields if not present when rewriting incoming record with new 
schema
URL: https://github.com/apache/incubator-hudi/pull/1427#discussion_r399490062
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/TestHoodieAvroUtils.java
 ##
 @@ -57,4 +60,16 @@ public void testPropsPresent() {
 }
 Assert.assertTrue("column pii_col doesn't show up", piiPresent);
   }
+
+  @Test
+  public void testDefaultValue() {
+GenericRecord rec = new GenericData.Record(new 
Schema.Parser().parse(EXAMPLE_SCHEMA));
+rec.put("_row_key", "key1");
+rec.put("non_pii_col", "val1");
+rec.put("pii_col", "val2");
+rec.put("timestamp", 3.5);
 
 Review comment:
   My bad. You raised a valid point there. Have made few changes and added more 
test cases to cover schema evolution scenario as well. Please take a pass 
@umehrot2 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605213668
 
 
   @lamber-ken  I have the jar recompiled. A quick question, how to turn on the 
logging for hudi in spark-shell? I don't see the custom logging in the 
spark-shell console.  Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason opened a new pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-03-27 Thread GitBox
prashantwason opened a new pull request #1457: [HUDI-741] Added checks to 
validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457
 
 
   ## What is the purpose of the pull request
   
   HUDI specific validation of schema evolution should ensure that a newer 
schema can be used for the dataset by checking that the data written using the 
old schema can be read using the new schema.
   
   Please see [HUDI-741 ](https://issues.apache.org/jira/browse/HUDI-741) for 
my detailed analysis of why this is required and how this is implemented.
   
   ## Brief change log
   
   Code changes:
   
   1. Added a new config in HoodieWriteConfig to enable schema validation check 
(disabled by default)
   2. Moved code that reads schema from base/log files into hudi-common from 
hudi-hive-sync
   3. Added a class org.apache.hudi.common.util.SchemaCompatibility which 
implements checks for schema evolution. This is based on 
org.apache.avro.SchemaCompatibility but performs HUDI specific checks.
   
   Testing changes:
   
   4. Extended HoodieTestDataGenerator to generate records using a custom Schema
   5. Extended TestHoodieClientBase to add insertBatch API which allows 
inserting a new batch of unique records into a HUDI table
   6. Added a unit test to verify schema evolution for both COW and MOR tables.
   
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
   - Added unit test TestTableSchemaEvolution
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-741:

Labels: pull-request-available  (was: )

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] prashantwason opened a new pull request #1456: [MINOR] Updated HoodieMergeOnReadTestUtils for future testing requirements.

2020-03-27 Thread GitBox
prashantwason opened a new pull request #1456: [MINOR] Updated 
HoodieMergeOnReadTestUtils for future testing requirements.
URL: https://github.com/apache/incubator-hudi/pull/1456
 
 
   ## What is the purpose of the pull request
   
   Minor changes to enhance HoodieMergeOnReadTestUtils for future use (for 
HUDI-741)
   
   ## Brief change log
   
   1. getRecordsUsingInputFormat() can take a custom Configuration which can be 
used to specify HUDI table properties (e.g. .consume.mode or 
.consume.start.timestamp)
   2. Fixed the return to return an empty List rather than raise an Exception 
if no records are found
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1450: [MINOR] Adding .codecov.yml to set exclusions for code coverage reports.

2020-03-27 Thread GitBox
prashantwason commented on a change in pull request #1450: [MINOR] Adding 
.codecov.yml to set exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1450#discussion_r399467520
 
 

 ##
 File path: .codecov.yml
 ##
 @@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# For more configuration details:
+# https://docs.codecov.io/docs/codecov-yaml
+
+# Check if this file is valid by running in bash:
+# curl -X POST --data-binary @.codecov.yml https://codecov.io/validate
+
+# Ignoring Paths
+# --
+# which folders/files to ignore
+ignore:
+  - "hudi-common/src/main/java/org/apache/hudi/avro/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java"
+  - "hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/HoodieJsonPayload"
+  - "hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactionAdminTool.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieWithTimelineServer.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/UpgradePayloadFromUberToApache.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/perf/TimelineServerPerf.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HiveIncrementalPuller.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/adhoc/UpgradePayloadFromUberToApache.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java"
+  - 
"hudi-client/src/main/java/org/apache/hudi/metrics/MetricsGraphiteReporter.java"
+  - 
"hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java"
 
 Review comment:
   hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java
   
   This is just a stub extension class to aid migration from com.uber to 
org.apache. Has no logic.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1450: [MINOR] Adding .codecov.yml to set exclusions for code coverage reports.

2020-03-27 Thread GitBox
prashantwason commented on a change in pull request #1450: [MINOR] Adding 
.codecov.yml to set exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1450#discussion_r399467199
 
 

 ##
 File path: .codecov.yml
 ##
 @@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# For more configuration details:
+# https://docs.codecov.io/docs/codecov-yaml
+
+# Check if this file is valid by running in bash:
+# curl -X POST --data-binary @.codecov.yml https://codecov.io/validate
+
+# Ignoring Paths
+# --
+# which folders/files to ignore
+ignore:
+  - "hudi-common/src/main/java/org/apache/hudi/avro/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java"
+  - "hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/HoodieJsonPayload"
+  - "hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactionAdminTool.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieWithTimelineServer.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/UpgradePayloadFromUberToApache.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/perf/TimelineServerPerf.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HiveIncrementalPuller.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/adhoc/UpgradePayloadFromUberToApache.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java"
+  - 
"hudi-client/src/main/java/org/apache/hudi/metrics/MetricsGraphiteReporter.java"
+  - 
"hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java"
 
 Review comment:
   Reason: There is no easy way to unit test these for now. And this is not 
critical code in data path. 
   
   If you have ideas on how to unit test these then lets file and issue and 
remove these from here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1450: [MINOR] Adding .codecov.yml to set exclusions for code coverage reports.

2020-03-27 Thread GitBox
prashantwason commented on issue #1450: [MINOR] Adding .codecov.yml to set 
exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1450#issuecomment-605203860
 
 
   Exclusions are for classes which most-probably will not see any unit testing 
ever. 
   
   There is positive value to having a correct and high code coverage number so 
we can have greater confidence when upgrading to newer HUDI releases in 
production. (Current code coverage stands at 67%). 
   
   I also suggest we enforce that newer checkins will not be accepted if the 
coverage goes down.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605197774
 
 
   hi @vinothchandar, any thoughs? IMO, it caused by JVM runtime mechanism.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1432: [HUDI-716] Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-27 Thread GitBox
lamber-ken commented on a change in pull request #1432: [HUDI-716] Exception: 
Not an Avro data file when running HoodieCleanClient.runClean
URL: https://github.com/apache/incubator-hudi/pull/1432#discussion_r399455231
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieCleanClient.java
 ##
 @@ -85,7 +85,11 @@ public HoodieCleanMetadata clean(String startCleanTime) 
throws HoodieIOException
 // If there are inflight(failed) or previously requested clean operation, 
first perform them
 
table.getCleanTimeline().filterInflightsAndRequested().getInstants().forEach(hoodieInstant
 -> {
   LOG.info("There were previously unfinished cleaner operations. Finishing 
Instant=" + hoodieInstant);
-  runClean(table, hoodieInstant);
+  try {
+runClean(table, hoodieInstant);
+  } catch (Exception e) {
+LOG.warn("Failed to perform previous clean operation, instant: " + 
hoodieInstant, e);
 
 Review comment:
   Sorry for delay, had updated the pr, any suggestion are welcome, thanks 
@bvaradar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605151836
 
 
   And, try do following changes:
   
   ```
   final int idx = Math.floorMod((int) hashOfKey, candidatePartitions.size());
   ```
   
   
![image](https://user-images.githubusercontent.com/20113411/77784847-2cf6d280-7096-11ea-81b3-9b8e814389ed.png)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on issue #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
garyli1019 commented on issue #1453: HUDI-644 kafka connect checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#issuecomment-605148461
 
 
   @pratyakshsharma Thanks for reviewing. I think the scenario you mentioned 
could be similar to our last discussion. The user can do some hacky stuff based 
on their use cases. And `--checkpoint` option would allow them to do so. We can 
provide more tooling options to generate checkpoint for DFS, Kafka, JDBC, e.t.c 
then the user can just combine those tools to fit their scenario. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect 
checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399431388
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/checkpoint/TestCheckPointProvider.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hudi.common.HoodieCommonTestHarness;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+
+import static org.junit.Assert.assertEquals;
+
+public class TestCheckPointProvider extends HoodieCommonTestHarness {
+  private FileSystem fs = null;
+  private String topicPath = null;
+
+  @Before
+  public void init() {
+// Prepare directories
+initPath();
+topicPath = basePath + "/topic1";
+final Configuration hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
+fs = FSUtils.getFs(basePath, hadoopConf);
+new File(topicPath).mkdirs();
+  }
+
+  @Test
+  public void testKafkaConnectHdfsProvider() throws Exception {
+// create regular kafka connect hdfs dirs
+new File(topicPath + "/year=2016/month=05/day=01/").mkdirs();
+new File(topicPath + "/year=2016/month=05/day=02/").mkdirs();
+// kafka connect tmp folder
+new File(topicPath + "/TMP").mkdirs();
+// tmp file that being written
+new File(topicPath + "/TMP/" + "topic1+0+301+400.parquet").createNewFile();
+// regular parquet files
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+0+100+200.parquet").createNewFile();
 
 Review comment:
   yes, https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html
   I can add this URL to the comment. Not sure if this is a preferable way to 
document other modules.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect 
checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399429038
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/checkpoint/TestCheckPointProvider.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hudi.common.HoodieCommonTestHarness;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+
+import static org.junit.Assert.assertEquals;
+
+public class TestCheckPointProvider extends HoodieCommonTestHarness {
+  private FileSystem fs = null;
+  private String topicPath = null;
+
+  @Before
+  public void init() {
+// Prepare directories
+initPath();
+topicPath = basePath + "/topic1";
+final Configuration hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
+fs = FSUtils.getFs(basePath, hadoopConf);
+new File(topicPath).mkdirs();
+  }
+
+  @Test
+  public void testKafkaConnectHdfsProvider() throws Exception {
+// create regular kafka connect hdfs dirs
+new File(topicPath + "/year=2016/month=05/day=01/").mkdirs();
+new File(topicPath + "/year=2016/month=05/day=02/").mkdirs();
+// kafka connect tmp folder
+new File(topicPath + "/TMP").mkdirs();
+// tmp file that being written
+new File(topicPath + "/TMP/" + "topic1+0+301+400.parquet").createNewFile();
+// regular parquet files
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+0+100+200.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+1+100+200.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=02/"
++ "topic1+0+201+300.parquet").createNewFile();
+// noise parquet file
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "random_snappy_1.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=02/"
++ "random_snappy_2.parquet").createNewFile();
+CheckPointProvider c = new KafkaConnectHdfsProvider(new Path(topicPath), 
fs);
+assertEquals(c.getCheckpoint(), "topic1,0:300,1:200");
 
 Review comment:
   kafka connect keep writing files into a tmp folder. Once reach the cut-off 
time, it moves the files from tmp to the desired partition. So we need to 
ignore the files under tmp.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
garyli1019 commented on a change in pull request #1453: HUDI-644 kafka connect 
checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399427876
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/checkpoint/KafkaConnectHdfsProvider.java
 ##
 @@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PathFilter;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Generate checkpoint from Kafka-Connect-HDFS.
+ */
+public class KafkaConnectHdfsProvider implements CheckPointProvider {
 
 Review comment:
   few different options IMO:
   
   - pass the output of this tool as `--checkpoint` to delta streamer in the 
first run
   - integrate this tool with HoodieWriteClient or Spark writer or HdfsImporter 
as an option controlled by config, user can save the checkpoint when they are 
not using delta streamer. so next time they can switch seamlessly. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1420: Broken Maven dependencies.

2020-03-27 Thread GitBox
bvaradar commented on issue #1420: Broken Maven dependencies.
URL: https://github.com/apache/incubator-hudi/issues/1420#issuecomment-605122681
 
 
   Closing this due to inactivity.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar closed issue #1420: Broken Maven dependencies.

2020-03-27 Thread GitBox
bvaradar closed issue #1420: Broken Maven dependencies.
URL: https://github.com/apache/incubator-hudi/issues/1420
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1439: Hudi class loading problem

2020-03-27 Thread GitBox
bvaradar commented on issue #1439: Hudi class loading problem
URL: https://github.com/apache/incubator-hudi/issues/1439#issuecomment-605121815
 
 
   @melkimohamed  : Let us know if you are still encountering issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-27 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r399408154
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.func.LazyIterableIterator;
+import org.apache.hudi.table.HoodieTable;
+
+import com.clearspring.analytics.util.Lists;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.PairFunction;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields(record key and partition path) 
from parquet and joins with incoming records to find the
+ * tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.persist(config.getSimpleIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Step 2: Load all involved files as  pairs
+List affectedPartitionPathList = 
partitionRecordKeyPairRDD.map(tuple -> tuple._1).distinct().collect();
+JavaRDD> fileInfoList = jsc.parallelize(
+loadAllFilesForPartitions(affectedPartitionPathList, jsc, 
hoodieTable)).sortBy(Tuple2::_1, true, config.getSimpleIndexParallelism());
+
+// Step 3: Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD = 
findMatchingFilesForRecordKeys(fileInfoList, partitionRecordKeyPairRDD, 
hoodieTable);
+
+// Step 4: Tag the incoming records, as inserts or updates, by joining 
with existing record keys
+JavaRDD> taggedRecordRDD = 
tagLocationBacktoRecords(keyFilenamePairRDD, recordRDD);
+
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.unpersist(); // unpersist the input Record RDD
+}
+return taggedRecordRDD;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+ HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Lookup the location for each record key and return the 
pair for all record keys already
+   * present and drop the record keys if not present.
+   */
+  @Override
+  protected JavaPairRDD lookupIndex(
+  JavaPairRDD partitionRecordKeyPairRDD, final 
JavaSparkContext jsc,
+  final HoodieTable hoodieTable) {
+// Obtain records per partition, in the incoming records
+Map recordsPerPartition = 
partitionRecordKeyPairRDD.countByKey();
+List affectedPartitionPathList = new 

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-27 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r399408154
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.func.LazyIterableIterator;
+import org.apache.hudi.table.HoodieTable;
+
+import com.clearspring.analytics.util.Lists;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.FlatMapFunction;
+import org.apache.spark.api.java.function.PairFunction;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields(record key and partition path) 
from parquet and joins with incoming records to find the
+ * tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.persist(config.getSimpleIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Step 2: Load all involved files as  pairs
+List affectedPartitionPathList = 
partitionRecordKeyPairRDD.map(tuple -> tuple._1).distinct().collect();
+JavaRDD> fileInfoList = jsc.parallelize(
+loadAllFilesForPartitions(affectedPartitionPathList, jsc, 
hoodieTable)).sortBy(Tuple2::_1, true, config.getSimpleIndexParallelism());
+
+// Step 3: Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD = 
findMatchingFilesForRecordKeys(fileInfoList, partitionRecordKeyPairRDD, 
hoodieTable);
+
+// Step 4: Tag the incoming records, as inserts or updates, by joining 
with existing record keys
+JavaRDD> taggedRecordRDD = 
tagLocationBacktoRecords(keyFilenamePairRDD, recordRDD);
+
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.unpersist(); // unpersist the input Record RDD
+}
+return taggedRecordRDD;
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+ HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  /**
+   * Lookup the location for each record key and return the 
pair for all record keys already
+   * present and drop the record keys if not present.
+   */
+  @Override
+  protected JavaPairRDD lookupIndex(
+  JavaPairRDD partitionRecordKeyPairRDD, final 
JavaSparkContext jsc,
+  final HoodieTable hoodieTable) {
+// Obtain records per partition, in the incoming records
+Map recordsPerPartition = 
partitionRecordKeyPairRDD.countByKey();
+List affectedPartitionPathList = new 

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-27 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r399405882
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,244 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.clearspring.analytics.util.Lists;
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
 
 Review comment:
   I did some refactoring in HoodieBloomIndex to reuse by SimpleIndex. You can 
check it out. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-27 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r399407043
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.WriteStatus;
+import org.apache.hudi.common.model.HoodieDataFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.Optional;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.storage.StorageLevel;
+
+import java.util.ArrayList;
+import java.util.List;
+import java.util.Map;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+
+/**
+ * A simple index which reads interested fields from parquet and joins with 
incoming records to find the tagged location
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieBloomIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieSimpleIndex.class);
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD partitionRecordKeyPairRDD =
+hoodieKeys.mapToPair(key -> new Tuple2<>(key.getPartitionPath(), 
key.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD recordKeyLocationRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+JavaPairRDD keyHoodieKeyPairRDD = 
hoodieKeys.mapToPair(key -> new Tuple2<>(key, null));
+
+return 
keyHoodieKeyPairRDD.leftOuterJoin(recordKeyLocationRDD).mapToPair(keyLoc -> {
+  Option> partitionPathFileidPair;
+  if (keyLoc._2._2.isPresent()) {
+partitionPathFileidPair = 
Option.of(Pair.of(keyLoc._1().getPartitionPath(), 
keyLoc._2._2.get().getFileId()));
+  } else {
+partitionPathFileidPair = Option.empty();
+  }
+  return new Tuple2<>(keyLoc._1, partitionPathFileidPair);
+});
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+
+// Step 0: cache the input record RDD
+if (config.getBloomIndexUseCaching()) {
+  recordRDD.persist(config.getBloomIndexInputStorageLevel());
+}
+
+// Step 1: Extract out thinner JavaPairRDD of (partitionPath, recordKey)
+JavaPairRDD partitionRecordKeyPairRDD =
+recordRDD.mapToPair(record -> new Tuple2<>(record.getPartitionPath(), 
record.getRecordKey()));
+
+// Lookup indexes for all the partition/recordkey pair
+JavaPairRDD keyFilenamePairRDD =
+lookupIndex(partitionRecordKeyPairRDD, jsc, hoodieTable);
+
+// Cache the result, for subsequent stages.
+if (config.getBloomIndexUseCaching()) {
+  

[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605107232
 
 
   I think this issue caused by JVM how to execution the bytecode.
   
   from the JVM specification and combine the following stackstrace, we know 
that:
   `I: Integer`, `J:Long`
   
   ```
   java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
   ```
   
   
![image](https://user-images.githubusercontent.com/20113411/77780070-6592ae00-708e-11ea-9980-2e79a0f8055c.png)
   
   https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.3.2-200


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-03-27 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r399404807
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
 ##
 @@ -118,9 +121,10 @@ protected HoodieIndex(HoodieWriteConfig config) {
   /**
* Each index type should implement it's own logic to release any resources 
acquired during the process.
*/
-  public void close() {}
+  public void close() {
+  }
 
   public enum IndexType {
-HBASE, INMEMORY, BLOOM, GLOBAL_BLOOM
+HBASE, INMEMORY, BLOOM, GLOBAL_BLOOM, SIMPLE
 
 Review comment:
   Sure, will add it in a diff patch. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605104105
 
 
   @lamber-ken , ok I will clone the repo,  add the loggings and get back to 
you. 
   Thanks


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken edited a comment on issue #1455: [SUPPORT] Hudi upsert run into 
exception:  java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605100639
 
 
   Welcome @EdwinGuo, to figure it out, we need to add some log statement.
   ```
   git clone g...@github.com:apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true -T 2C
   ```
   
   
   
   BucketizedBloomCheckPartitioner#getPartition
   
   ```
   @Override
   public int getPartition(Object key) {
 final Pair parts = (Pair) key;
 final long hashOfKey = NumericUtils.getMessageDigestHash("MD5", 
parts.getRight());
 final List candidatePartitions = 
fileGroupToPartitions.get(parts.getLeft());
   
 int idx = 0;
 try {
idx = (int) Math.floorMod(hashOfKey, candidatePartitions.size());
 } catch (Exception e) {
   LOG.error("hashOfKey: " + hashOfKey + ",  " + 
candidatePartitions.size());
   throw new RuntimeException(e);
 }
 
 assert idx >= 0;
 return candidatePartitions.get(idx);
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605100639
 
 
   Welcome @EdwinGuo, to figure it out, we need to add some log statement.
   ```
   git clone g...@github.com:apache/incubator-hudi.git
   mvn clean install -DskipTests -DskipITs -Dcheckstyle.skip=true 
-Drat.skip=true -T 2C
   ```
   
![image](https://user-images.githubusercontent.com/20113411/8674-501c8480-708c-11ea-9abd-a2b704fc4ac1.png)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605095463
 
 
   Thanks for replying @lamber-ken.
   1:  java -version (from edge node, which is emr master in my case)
   openjdk version "1.8.0_242"
   OpenJDK Runtime Environment (build 1.8.0_242-b08)
   OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
   2: Client mode.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068840#comment-17068840
 ] 

Vinoth Chandar commented on HUDI-686:
-

>May I know why we need external spillableMap? why can't we use regular map. I 
>don't know the benefits of external spillable map if all entries could be held 
>in memory. Here too, one executor will have to hold at max all file infos for 
>one partition only right? So, memory is bounded here too in my understanding. 

Just to handle the case the bloom filters won't fit in memory.. 

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-03-27 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068837#comment-17068837
 ] 

Vinoth Chandar commented on HUDI-686:
-

>if the input data is large, need to increase partitions, "candidates" contains 
>all datas for per partition
No candidates only contains candidate files per key 

>if increase partitions, it will cause duplicate loading of the same 
>partition(e.g populateFileIDs() && populateRangeAndBloomFilters())
it will.. That's why we auto tune everything in BloomIndexV1.. but then it 
needs some memory caching.. Idea here is to make this work for simpler cases 
well and have an option that does not rely on memory caching


> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
lamber-ken commented on issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455#issuecomment-605086930
 
 
   hi, someone met the same issue before, need more context:
   1.jdk version? 
   2.local / cluster mode? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399372702
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -84,26 +87,35 @@
   + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},";
   public static final String FARE_FLATTENED_SCHEMA = "{\"name\": \"fare\", 
\"type\": \"double\"},"
   + "{\"name\": \"currency\", \"type\": \"string\"},";
-
   public static final String TRIP_EXAMPLE_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_NESTED_SCHEMA + TRIP_SCHEMA_SUFFIX;
   public static final String TRIP_FLATTENED_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_FLATTENED_SCHEMA + TRIP_SCHEMA_SUFFIX;
 
+  public static String TRIP_UBER_EXAMPLE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"tripuberrec\",\"fields\":["
+  + 
"{\"name\":\"timestamp\",\"type\":\"double\"},{\"name\":\"_row_key\",\"type\":\"string\"},{\"name\":\"rider\",\"type\":\"string\"},"
+  + 
"{\"name\":\"driver\",\"type\":\"string\"},{\"name\":\"fare\",\"type\":\"double\"},{\"name\":
 \"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false}]}";
+  public static String TRIP_FG_EXAMPLE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"tripfgrec\",\"fields\":["
+  + 
"{\"name\":\"timestamp\",\"type\":\"double\"},{\"name\":\"_row_key\",\"type\":\"string\"},{\"name\":\"rider\",\"type\":\"string\"},"
+  + 
"{\"name\":\"driver\",\"type\":\"string\"},{\"name\":\"fare\",\"type\":\"double\"},{\"name\":
 \"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false}]}";
+
   public static final String NULL_SCHEMA = 
Schema.create(Schema.Type.NULL).toString();
   public static final String TRIP_HIVE_COLUMN_TYPES = 
"double,string,string,string,double,double,double,double,"
-  + 
"struct,boolean";
-
+  + "struct,boolean";
   public static final Schema AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA);
   public static final Schema AVRO_SCHEMA_WITH_METADATA_FIELDS =
   HoodieAvroUtils.addMetadataFields(AVRO_SCHEMA);
+  public static Schema avroFgSchema = new 
Schema.Parser().parse(TRIP_FG_EXAMPLE_SCHEMA);
 
 Review comment:
   Actually that is the initials of my current organisation. While testing I 
put it like that and forgot to modify later. Will change the name. :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r399371617
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -84,26 +87,35 @@
   + "{\"name\": \"amount\",\"type\": \"double\"},{\"name\": \"currency\", 
\"type\": \"string\"}]}},";
   public static final String FARE_FLATTENED_SCHEMA = "{\"name\": \"fare\", 
\"type\": \"double\"},"
   + "{\"name\": \"currency\", \"type\": \"string\"},";
-
   public static final String TRIP_EXAMPLE_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_NESTED_SCHEMA + TRIP_SCHEMA_SUFFIX;
   public static final String TRIP_FLATTENED_SCHEMA =
   TRIP_SCHEMA_PREFIX + FARE_FLATTENED_SCHEMA + TRIP_SCHEMA_SUFFIX;
 
+  public static String TRIP_UBER_EXAMPLE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"tripuberrec\",\"fields\":["
+  + 
"{\"name\":\"timestamp\",\"type\":\"double\"},{\"name\":\"_row_key\",\"type\":\"string\"},{\"name\":\"rider\",\"type\":\"string\"},"
+  + 
"{\"name\":\"driver\",\"type\":\"string\"},{\"name\":\"fare\",\"type\":\"double\"},{\"name\":
 \"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false}]}";
+  public static String TRIP_FG_EXAMPLE_SCHEMA = 
"{\"type\":\"record\",\"name\":\"tripfgrec\",\"fields\":["
+  + 
"{\"name\":\"timestamp\",\"type\":\"double\"},{\"name\":\"_row_key\",\"type\":\"string\"},{\"name\":\"rider\",\"type\":\"string\"},"
+  + 
"{\"name\":\"driver\",\"type\":\"string\"},{\"name\":\"fare\",\"type\":\"double\"},{\"name\":
 \"_hoodie_is_deleted\", \"type\": \"boolean\", \"default\": false}]}";
+
   public static final String NULL_SCHEMA = 
Schema.create(Schema.Type.NULL).toString();
   public static final String TRIP_HIVE_COLUMN_TYPES = 
"double,string,string,string,double,double,double,double,"
-  + 
"struct,boolean";
-
+  + "struct,boolean";
   public static final Schema AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_EXAMPLE_SCHEMA);
   public static final Schema AVRO_SCHEMA_WITH_METADATA_FIELDS =
   HoodieAvroUtils.addMetadataFields(AVRO_SCHEMA);
+  public static Schema avroFgSchema = new 
Schema.Parser().parse(TRIP_FG_EXAMPLE_SCHEMA);
+  public static Schema avroUberSchema = new 
Schema.Parser().parse(TRIP_UBER_EXAMPLE_SCHEMA);
   public static final Schema FLATTENED_AVRO_SCHEMA = new 
Schema.Parser().parse(TRIP_FLATTENED_SCHEMA);
 
-  private static final Random RAND = new Random(46474747);
+  private static Random rand = new Random(46474747);
 
 Review comment:
   Fixed. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1450: [MINOR] Adding .codecov.yml to set exclusions for code coverage reports.

2020-03-27 Thread GitBox
vinothchandar commented on issue #1450: [MINOR] Adding .codecov.yml to set 
exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1450#issuecomment-605075969
 
 
   Would love to understand the rationale for excluding some core classes 
here.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1450: [MINOR] Adding .codecov.yml to set exclusions for code coverage reports.

2020-03-27 Thread GitBox
vinothchandar commented on a change in pull request #1450: [MINOR] Adding 
.codecov.yml to set exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1450#discussion_r399364951
 
 

 ##
 File path: .codecov.yml
 ##
 @@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# For more configuration details:
+# https://docs.codecov.io/docs/codecov-yaml
+
+# Check if this file is valid by running in bash:
+# curl -X POST --data-binary @.codecov.yml https://codecov.io/validate
+
+# Ignoring Paths
+# --
+# which folders/files to ignore
+ignore:
+  - "hudi-common/src/main/java/org/apache/hudi/avro/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java"
+  - "hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/HoodieJsonPayload"
+  - "hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactionAdminTool.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieWithTimelineServer.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/UpgradePayloadFromUberToApache.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/perf/TimelineServerPerf.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HiveIncrementalPuller.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/adhoc/UpgradePayloadFromUberToApache.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java"
+  - 
"hudi-client/src/main/java/org/apache/hudi/metrics/MetricsGraphiteReporter.java"
+  - 
"hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java"
 
 Review comment:
   these are not good to exclude right


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-737) Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder

2020-03-27 Thread Suneel Marthi (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068801#comment-17068801
 ] 

Suneel Marthi commented on HUDI-737:


This has been fixed in the PR for HUDI-479

> Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder
> 
>
> Key: HUDI-737
> URL: https://issues.apache.org/jira/browse/HUDI-737
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-737) Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder

2020-03-27 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-737:
---
Status: In Progress  (was: Open)

> Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder
> 
>
> Key: HUDI-737
> URL: https://issues.apache.org/jira/browse/HUDI-737
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-737) Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder

2020-03-27 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-737:
---
Status: Open  (was: New)

> Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder
> 
>
> Key: HUDI-737
> URL: https://issues.apache.org/jira/browse/HUDI-737
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-737) Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder

2020-03-27 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-737:
---
Fix Version/s: 0.6.0

> Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder
> 
>
> Key: HUDI-737
> URL: https://issues.apache.org/jira/browse/HUDI-737
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-737) Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder

2020-03-27 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-737:
---
Status: Patch Available  (was: In Progress)

> Simplify/Eliminate need for CollectionUtils#Maps/MapsBuilder
> 
>
> Key: HUDI-737
> URL: https://issues.apache.org/jira/browse/HUDI-737
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1159: [HUDI-479] Eliminate or Minimize use of Guava if possible

2020-03-27 Thread GitBox
vinothchandar commented on issue #1159: [HUDI-479] Eliminate or Minimize use of 
Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#issuecomment-605046821
 
 
   If its already used elsewhere, nvm. We can deal with it later


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1159: [HUDI-479] Eliminate or Minimize use of Guava if possible

2020-03-27 Thread GitBox
smarthi commented on a change in pull request #1159: [HUDI-479] Eliminate or 
Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r399324299
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java
 ##
 @@ -91,4 +94,29 @@ public static void writeStringToFile(String str, String 
filePath) throws IOExcep
 out.flush();
 out.close();
   }
+
+  /**
+   * Closes a {@link Closeable}, with control over whether an {@code 
IOException} may be thrown.
+   * @param closeable the {@code Closeable} object to be closed, or null,
+   *  in which case this method does nothing.
+   * @param swallowIOException if true, don't propagate IO exceptions thrown 
by the {@code close} methods.
+   *
+   * @throws IOException if {@code swallowIOException} is false and {@code 
close} throws an {@code IOException}.
+   */
+  public static void close(@Nullable Closeable closeable, boolean 
swallowIOException)
+  throws IOException {
+if (closeable == null) {
+  return;
+}
+try {
+  closeable.close();
+} catch (IOException e) {
+  if (!swallowIOException) {
+throw e;
+  }
+}
+  }
+
+  /** Maximum loop count when creating temp directories. */
+  private static final int TEMP_DIR_ATTEMPTS = 1;
 
 Review comment:
   I think we do - this is getting called in Metrics.java - but I guess it 
could be dispensed with if we modify Metrics.java - haven't looked at it 
closely enough - will take that up.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on issue #1159: [HUDI-479] Eliminate or Minimize use of Guava if possible

2020-03-27 Thread GitBox
smarthi commented on issue #1159: [HUDI-479] Eliminate or Minimize use of Guava 
if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#issuecomment-605043851
 
 
   > ```
   > [INFO] There is 1 error reported by Checkstyle 8.18 with 
style/checkstyle.xml ruleset.
   > 593[ERROR] 
src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java:[33] (imports) 
AvoidStarImport: Using the '.*' form of import should be avoided - java.util.*.
   > ```
   > 
   > there are check style failures? to be resolved?
   
   Resolved


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1159: [HUDI-479] Eliminate or Minimize use of Guava if possible

2020-03-27 Thread GitBox
vinothchandar commented on issue #1159: [HUDI-479] Eliminate or Minimize use of 
Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#issuecomment-605037325
 
 
   ```
   [INFO] There is 1 error reported by Checkstyle 8.18 with 
style/checkstyle.xml ruleset.
   593[ERROR] 
src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java:[33] (imports) 
AvoidStarImport: Using the '.*' form of import should be avoided - java.util.*.
   ```
   
   there are check style failures?  to be resolved?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1159: [HUDI-479] Eliminate or Minimize use of Guava if possible

2020-03-27 Thread GitBox
vinothchandar commented on a change in pull request #1159: [HUDI-479] Eliminate 
or Minimize use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#discussion_r399312059
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/FileIOUtils.java
 ##
 @@ -91,4 +94,29 @@ public static void writeStringToFile(String str, String 
filePath) throws IOExcep
 out.flush();
 out.close();
   }
+
+  /**
+   * Closes a {@link Closeable}, with control over whether an {@code 
IOException} may be thrown.
+   * @param closeable the {@code Closeable} object to be closed, or null,
+   *  in which case this method does nothing.
+   * @param swallowIOException if true, don't propagate IO exceptions thrown 
by the {@code close} methods.
+   *
+   * @throws IOException if {@code swallowIOException} is false and {@code 
close} throws an {@code IOException}.
+   */
+  public static void close(@Nullable Closeable closeable, boolean 
swallowIOException)
+  throws IOException {
+if (closeable == null) {
+  return;
+}
+try {
+  closeable.close();
+} catch (IOException e) {
+  if (!swallowIOException) {
+throw e;
+  }
+}
+  }
+
+  /** Maximum loop count when creating temp directories. */
+  private static final int TEMP_DIR_ATTEMPTS = 1;
 
 Review comment:
   we don't need this anymore correct?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] EdwinGuo opened a new issue #1455: [SUPPORT] Hudi upsert run into exception: java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-03-27 Thread GitBox
EdwinGuo opened a new issue #1455: [SUPPORT] Hudi upsert run into exception:  
java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
URL: https://github.com/apache/incubator-hudi/issues/1455
 
 
   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   Yes
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   Not able to upsert data when upgrade hudi jar to 0.5.2, exception: 
Java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   from console: open spark shell:
   spark-shell --master yarn --deploy-mode client --driver-memory 512M 
--num-executors 1 --executor-memory 12G --executor-cores 5 --jars 
/usr/lib/spark/jars/httpclient-4.5.9.jar,s3://my-s3-bucket/hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar,/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client-*,/usr/share/aws/aws-java-sdk/aws-java-sdk-glue-*
 --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
   
   val hudi_options = Map(
   "hoodie.table.name"-> "table1",
   "hoodie.datasource.write.recordkey.field"-> "c1,c2,c3,c4",
   "hoodie.datasource.write.partitionpath.field"-> "c5",
   "hoodie.datasource.write.table.name"-> "table1",
   "hoodie.datasource.write.operation"-> "upsert",
   "hoodie.datasource.write.precombine.field"-> "c6",
   "hoodie.datasource.write.keygenerator.class"-> 
"org.apache.hudi.keygen.ComplexKeyGenerator",
   "hoodie.datasource.hive_sync.jdbcurl"-> 
"jdbc:hive2://localhost:1",
   "hoodie.datasource.hive_sync.database"-> "db",
   "hoodie.datasource.hive_sync.enable"-> "true",
   "hoodie.datasource.hive_sync.table"-> "tablel",
   "hoodie.datasource.hive_sync.partition_fields"-> "c5",
   "hoodie.datasource.hive_sync.partition_extractor_class"-> 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
   "hoodie.upsert.shuffle.parallelism"-> "20",
   "hoodie.insert.shuffle.parallelism"-> "20")
   
val df = spark.sparkContext.parallelize(List(("x1", "x2", "x3", "x4", "x5", 
100, "x7"), ("y1", "y2", "y3", "y4", "y5", 100, "y7"))).toDF("c1", "c2", "c3", 
"c4", "c5", "c6", "c7")
   
   
df.write.format("hudi").options(hudi_options).mode("append").save("s3://my-write-bucket/prefix1/prefix2")
   
   **Expected behavior**
   TO successfully upsert the data to S3 with no exception.
   
   A clear and concise description of what you expected to happen.
   TO successfully upsert the data to S3.
   
   **Environment Description**
   I'm running spark with hudi in AWS EMR 5.29.0, with a fresh compile of hudi 
jar: 
https://github.com/apache/incubator-hudi/commit/41202da7788193da77f1ae4b784127bb93eaae2c.
   
   * Hudi version :
   0.5.2: 
https://github.com/apache/incubator-hudi/commit/41202da7788193da77f1ae4b784127bb93eaae2c
   
   * Spark version :
   2.4.4
   * Hive version :
   Hive 2.3.6
   * Hadoop version :
   Hadoop 2.8.5
   
   * Storage (HDFS/S3/GCS..) :
   S3
   * Running on Docker? (yes/no) :
   no
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
 at 
org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
 at 
org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1452: [HUDI-740]Fix can not specify the sparkMaster of cleans run command

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1452: [HUDI-740]Fix can 
not specify the sparkMaster of cleans run command
URL: https://github.com/apache/incubator-hudi/pull/1452#discussion_r399232900
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
 ##
 @@ -62,7 +63,9 @@ public static void main(String[] args) throws Exception {
 
 SparkCommand cmd = SparkCommand.valueOf(command);
 
-JavaSparkContext jsc = SparkUtil.initJavaSparkConf("hoodie-cli-" + 
command);
+JavaSparkContext jsc = cmd == SparkCommand.CLEAN
 
 Review comment:
   Why do you want to implement it specifically for CLEAN command? Any specific 
reasons? I could not understand the purpose of this PR basically. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
pratyakshsharma commented on issue #1453: HUDI-644 kafka connect checkpoint 
provider
URL: https://github.com/apache/incubator-hudi/pull/1453#issuecomment-604972381
 
 
   Thank you for making this tool @garyli1019 
   
   Have you thought of a use case where user might want to switch the source 
more than once? Basically something like a user is running 2 pipelines - 
pipeline A (using deltaStreamer consuming from kafka, say) and pipeline B 
(using some other reconciler writing to some other path). Now whenever user 
finds A is having some issue in terms of data quality, user would like to 
consume from output of B using DFS source for A and again switch back to kafka 
for A? And this thing can happen n number of times. Also would like to hear 
from @vinothchandar on this use case. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1438: How to get the file name corresponding to HoodieKey through the GlobalBloomIndex

2020-03-27 Thread GitBox
nsivabalan commented on issue #1438: How to get the file name corresponding to 
HoodieKey through the GlobalBloomIndex 
URL: https://github.com/apache/incubator-hudi/issues/1438#issuecomment-604970042
 
 
   Hi @loagosad : guess you were looking for fileId. In the test code you have 
given, records were not tagged only. you need to tag them first and then 
commit. Following this, fetchRecordLocations should give you the fileId for 
each record. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka 
connect checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399224698
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/checkpoint/TestCheckPointProvider.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hudi.common.HoodieCommonTestHarness;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+
+import static org.junit.Assert.assertEquals;
+
+public class TestCheckPointProvider extends HoodieCommonTestHarness {
+  private FileSystem fs = null;
+  private String topicPath = null;
+
+  @Before
+  public void init() {
+// Prepare directories
+initPath();
+topicPath = basePath + "/topic1";
+final Configuration hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
+fs = FSUtils.getFs(basePath, hadoopConf);
+new File(topicPath).mkdirs();
+  }
+
+  @Test
+  public void testKafkaConnectHdfsProvider() throws Exception {
+// create regular kafka connect hdfs dirs
+new File(topicPath + "/year=2016/month=05/day=01/").mkdirs();
+new File(topicPath + "/year=2016/month=05/day=02/").mkdirs();
+// kafka connect tmp folder
+new File(topicPath + "/TMP").mkdirs();
+// tmp file that being written
+new File(topicPath + "/TMP/" + "topic1+0+301+400.parquet").createNewFile();
+// regular parquet files
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+0+100+200.parquet").createNewFile();
 
 Review comment:
   Is this some standard format of these files maintained by kafka-connect? 
Like {topic}.{partition}.{lowerOffset}.{upperOffset}.parquet? Can you share 
some documentation of this, if it is so? Basically I would like to understand 
these files. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka 
connect checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399218307
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/checkpoint/TestCheckPointProvider.java
 ##
 @@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hudi.common.HoodieCommonTestHarness;
+import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.util.FSUtils;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.junit.Before;
+import org.junit.Test;
+
+import java.io.File;
+
+import static org.junit.Assert.assertEquals;
+
+public class TestCheckPointProvider extends HoodieCommonTestHarness {
+  private FileSystem fs = null;
+  private String topicPath = null;
+
+  @Before
+  public void init() {
+// Prepare directories
+initPath();
+topicPath = basePath + "/topic1";
+final Configuration hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
+fs = FSUtils.getFs(basePath, hadoopConf);
+new File(topicPath).mkdirs();
+  }
+
+  @Test
+  public void testKafkaConnectHdfsProvider() throws Exception {
+// create regular kafka connect hdfs dirs
+new File(topicPath + "/year=2016/month=05/day=01/").mkdirs();
+new File(topicPath + "/year=2016/month=05/day=02/").mkdirs();
+// kafka connect tmp folder
+new File(topicPath + "/TMP").mkdirs();
+// tmp file that being written
+new File(topicPath + "/TMP/" + "topic1+0+301+400.parquet").createNewFile();
+// regular parquet files
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+0+100+200.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "topic1+1+100+200.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=02/"
++ "topic1+0+201+300.parquet").createNewFile();
+// noise parquet file
+new File(topicPath + "/year=2016/month=05/day=01/"
++ "random_snappy_1.parquet").createNewFile();
+new File(topicPath + "/year=2016/month=05/day=02/"
++ "random_snappy_2.parquet").createNewFile();
+CheckPointProvider c = new KafkaConnectHdfsProvider(new Path(topicPath), 
fs);
+assertEquals(c.getCheckpoint(), "topic1,0:300,1:200");
 
 Review comment:
   for partition 0, the offset should be 400? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-03-27 Thread GitBox
pratyakshsharma commented on a change in pull request #1453: HUDI-644 kafka 
connect checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#discussion_r399206469
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/checkpoint/KafkaConnectHdfsProvider.java
 ##
 @@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources.checkpoint;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PathFilter;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Generate checkpoint from Kafka-Connect-HDFS.
+ */
+public class KafkaConnectHdfsProvider implements CheckPointProvider {
 
 Review comment:
   How do you plan to integrate this class with DeltaStreamer? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068589#comment-17068589
 ] 

Pratyaksh Sharma commented on HUDI-741:
---

Hi [~pwason], I am already working with [~nishith29] to implement a 
schema-service which will take care of all these cases. We will come up with an 
RFC soon. Once we file all the tasks, then we all can work together to achieve 
this. 

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068374#comment-17068374
 ] 

Prashant Wason commented on HUDI-741:
-

Background
--

When a record is read from the parquet file, it uses the schema stored in the 
parquet file. This should always succeed as the record was written using the 
same schema (stored in the footer of the parquet file). The same is true for 
records read from the LOG files which also save their AVRO schema.

Once the record is read, it is converted into a GenericRecord with the 
writerSchema (schema provided to the HoodieWriteConfig). This step fails 
(raises exception) if the writerSchema (evolved) is incompatible with the 
schema in the parquet file.  

 

Checking schema compatibility
---

org.apache.avro.SchemaCompatibility is a class which compares two AVRO schemas 
to ensure they are compatible. It has the concept of "reader" and "writer" 
schemas. 
 - writer schema: the schema with which the avro record was serialized into 
bytes
 - reader schema: the schema using which we are reading the serialized bytes

A reader schema is deemed compatible with the writer schema if all fields of 
the reader schema can be populated from the writer schema. Hence, this will 
consider the following two schemas as compatible:

 

Writer Schema

{{{}}
{{ "type": "record",}}
{{ "name": "triprec",}}
{{ "fields": [{}}
{{       "name": "_hoodie_commit_time",}}
{{       "type": "string",}}
{{       }, {}}
{{        "name": "_hoodie_commit_seqno",}}
{{        "type": "string",}}
{{}]}}}

 

Reader Schema

{{{}}
{{ "type": "record",}}
{{ "name": "triprec",}}
{{ "fields": [{}}
{{       "name": "_hoodie_commit_time",}}
{{       "type": "string",}}
{{       }}}
{{}]}}}

 

When reading the bytes using the above reader schema, one will get a record 
which will be missing the {{_hoodie_commit_seqno }}field but there wont be any 
exception.

If HUDI was to use the Reader Schema as an "evolved" schema (say due to a bug) 
then there can be inadvertent data corruption/loss as conversion will fail.

So it is not possible to use org.apache.avro.SchemaCompatibility directly to 
perform the schema compatibility check. This class needs to be modified to 
implement this check.

 

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068364#comment-17068364
 ] 

Prashant Wason commented on HUDI-741:
-

Implementation notes:

hudi-hive-sync module already has code which reads the latest schema from a 
HUDI table. I have moved that code to hudi-common so it can be used within 
HoodieWriteClient as hudi-common is a dependency of hudi-client.

SchemaCompatibility needs to be implemented separately (explained in detail in 
the next comment)

Unit test requires a way to generate records using various schemas. 
HoodieTestDataGenerator is hardcoded to use TRIP_EXAMPLE_SCHEMA so has to be 
modified to take the schema as a parameter.

 

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068359#comment-17068359
 ] 

Prashant Wason commented on HUDI-741:
-

Since inserts/updates are performed using the HoodieWriteClient, the schema 
check can be implemented there. There are two steps involved.

Step 1.  Read the latest schema from the dataset. This is the schema used to 
write data in the last commit.

Step 2: Validate the HoodieWriteConfig's writeSchema (new schema) against the 
schema retrieved in step 1 (existing schema)

  

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17068356#comment-17068356
 ] 

Prashant Wason commented on HUDI-741:
-

Inserts and Updates to the HUDI table should validate that the current 
writeSchema is compatible.

A new schema can be incompatible with older schema for various reasons:
1. A schema field deleted (intentionally or due to a bug)
2. A schema field added but does not have a default value
3. A schema's field type changed (e.g. string to int) 



Allowing data ingestion using such an incompatible schema should not be allowed 
as it will effect reading of data as well as future ingestion (e.g. after the 
buggy schema is reverted).

 

Current issues:

For COW tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there 
is no existing parquet files, no merge is done)

For MOR tables:
1. Inserts to a new partition with incompatible-schema is allowed (since there 
is no existing parquet files, no merge is done)
2. Inserts to a new partition with incompatible-schema is allowed (a LOG file 
may be created with HoodieAvroDataBlock)
3. Appends to an existing LOG file with incompatible-schema is allowed (a new 
HoodieAvroDataBlock is added)
4. Updates with incompatible-schema is allowed (a new HoodieAvroDataBlock is 
added)

 

 

 

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-741:

Status: In Progress  (was: Open)

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-741) Fix Hoodie's schema evolution checks

2020-03-27 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-741:

Status: Open  (was: New)

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >