date:20200415

[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-15 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084562#comment-17084562
 ] 

Vinoth Chandar commented on HUDI-69:


Switching to v2 for writing is tricky... V2 datasource API takes a lot of 
control away... for e.g we need to define a bunch of things at the individual 
task level.. not shuffle and do indexing etc like we do today... This needs 
more thought IMO 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-15 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084560#comment-17084560
 ] 

Vinoth Chandar commented on HUDI-69:


. I was wondering if we can just wrap the FileFormat (Parquet/ORC both have 
formats inside Spark) , reuse its record reader for reading parquet/orc -> Row 
and also use our existing LogReader classes to read the log blocks are Row 
(instead of GenericRecord.. or we can for now do GenericRecord -> Row ).. This 
means, we need to redesign our CompactedRecordScanner etc classes to be generic 
and not implicitly assume it merging Avro/ArrayWritable per se. Must be doable.

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-798) Migrate Mockito to work with Junit 5

2020-04-15 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-798:

Status: In Progress  (was: Open)

> Migrate Mockito to work with Junit 5
> 
>
> Key: HUDI-798
> URL: https://issues.apache.org/jira/browse/HUDI-798
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-798) Migrate Mockito to work with Junit 5

2020-04-15 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-798:

Status: Open  (was: New)

> Migrate Mockito to work with Junit 5
> 
>
> Key: HUDI-798
> URL: https://issues.apache.org/jira/browse/HUDI-798
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-798) Migrate Mockito to work with Junit 5

2020-04-15 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-798:
---

 Summary: Migrate Mockito to work with Junit 5
 Key: HUDI-798
 URL: https://issues.apache.org/jira/browse/HUDI-798
 Project: Apache Hudi (incubating)
  Issue Type: Test
  Components: Testing
Reporter: Raymond Xu
Assignee: Raymond Xu
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator

2020-04-15 Thread Yanjia Gary Li (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-765:
---

Assignee: Yanjia Gary Li

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Assignee: Yanjia Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-765) Implement OrcReaderIterator

2020-04-15 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-765:
---

Assignee: (was: lamber-ken)

> Implement OrcReaderIterator
> ---
>
> Key: HUDI-765
> URL: https://issues.apache.org/jira/browse/HUDI-765
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-15 Thread GitBox

lamber-ken commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-614400699
 
 
   hi @vinothchandar, already in our slack channel.
   
![image](https://user-images.githubusercontent.com/20113411/79412692-aec48680-7fd8-11ea-8483-11a2ff742e25.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #249

2020-04-15 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.37 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @

[GitHub] [incubator-hudi] bvaradar commented on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

2020-04-15 Thread GitBox

bvaradar commented on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-614381927
 
 
   @vinothchandar This fix should reflect only in inflight commit metadata and 
rollback works in the way you mentioned. As it stands now, the current rollback 
logic can benefit with this change. We can optimize the rollback logic to only 
look at the partitions in the inflight file for rollbacks instead of scanning 
all partitions. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Closed] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-04-15 Thread Vinoth Chandar (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-409.
---
Fix Version/s: (was: 0.6.0)
   0.5.2
   Resolution: Fixed

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Ramachandran M S
>Priority: Major
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-04-15 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084495#comment-17084495
 ] 

Vinoth Chandar commented on HUDI-409:
-

[~msramachandran] I was not aware of this conversation.. So if you and 
[~nagarwal] have been over this , I am good to go.. I reopened this since we 
did not implement a secure hash per se.. 

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Ramachandran M S
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-15 Thread GitBox

vinothchandar commented on issue #1498: Migrating parquet table to hudi issue 
[SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-614378189
 
 
   @ahmed-elfar do you mind joining our slack channel and we can chat 1-1 to 
setup a time.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

2020-04-15 Thread GitBox

vinothchandar commented on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-614376368
 
 
   @n3nash can you actually help review this.. does this pose any issues for 
rollback today? (IIUC for rollback of inserts, we will list and delete files 
written as of that delta commit instant.. ) delta commit ultimately will have 
insert info... So really, this is just completeness fix with no real impact on 
operations we do? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] liujianhuiouc commented on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

2020-04-15 Thread GitBox

liujianhuiouc commented on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-614374608
 
 
   @lamber-ken please help to merge


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1224: [HUDI-397] Normalize log print statement

2020-04-15 Thread GitBox

lamber-ken commented on issue #1224: [HUDI-397] Normalize log print statement
URL: https://github.com/apache/incubator-hudi/pull/1224#issuecomment-614372674
 
 
   hi @wangxianghu, you can use `git rebase` operation and try again


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows when record key is binary

2020-04-15 Thread GitBox

lamber-ken commented on issue #1488: [SUPPORT] Hudi table has only five rows 
when record key is binary
URL: https://github.com/apache/incubator-hudi/issues/1488#issuecomment-614370054
 
 
   hi @jvaesteves, are you still facing this problem? : )
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1517: Use appropriate FS when loading configs

2020-04-15 Thread GitBox

lamber-ken commented on issue #1517: Use appropriate FS when loading configs
URL: https://github.com/apache/incubator-hudi/pull/1517#issuecomment-614368160
 
 
   hi @afilipchik, thanks for your contribution, we need add `[MINOR]` to title 
for minor changes.
   
   e.g `[MINOR] Use appropriate FS when loading configs`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 53M records

2020-04-15 Thread GitBox

lamber-ken commented on issue #1491: [SUPPORT] OutOfMemoryError during upsert 
53M records
URL: https://github.com/apache/incubator-hudi/issues/1491#issuecomment-614366584
 
 
   hi @tverdokhlebd, are you still facing this problem? Or can you ask your 
colleagues to try again?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.

2020-04-15 Thread GitBox

lamber-ken commented on issue #1520: [HUDI-797] Small performance improvement 
for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-614365314
 
 
   > @lamber-ken could you please review this
   
   OK : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1519: [HUDI-777] Update Deltastreamer param description for --target-table

2020-04-15 Thread GitBox

lamber-ken commented on issue #1519: [HUDI-777] Update Deltastreamer param 
description for --target-table
URL: https://github.com/apache/incubator-hudi/pull/1519#issuecomment-614364497
 
 
   Welcome @iftachsc  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (1689d01 -> 37f91cb)

2020-04-15 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 1689d01  Merge branch 'master' into hudi_test_suite_refactor
 discard a096128  more fixes
 discard d2d2866  more fixes
 discard b519057  Build fixes after rebase
 discard ab98c40  Fix Compilation Issues + Port Bug Fixes
 discard 1450004  [HUDI-394] Provide a basic implementation of test suite
 add a189878  [HUDI-394] Provide a basic implementation of test suite
 add cf551bd  Fix Compilation Issues + Port Bug Fixes
 add d244c1b  Build fixes after rebase
 add 652a0d6  more fixes
 add 37f91cb  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (1689d01)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (37f91cb)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../src/test/resources/log4j-surefire.properties   |  2 +-
 .../deltastreamer/HoodieDeltaStreamer.java | 48 +++---
 2 files changed, 24 insertions(+), 26 deletions(-)

[incubator-hudi] branch hudi_test_suite_refactor updated (a096128 -> 1689d01)

2020-04-15 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from a096128  more fixes
 add c0f96e0  [HUDI-687] Stop incremental reader on RO table when there is 
a pending compaction (#1396)
 add a464a29  [HUDI-700]Add unit test for FileSystemViewCommand (#1490)
 add 5d717a2  [HUDI-782] Add support of Aliyun object storage service. 
(#1506)
 add 6d7ca2c  [HUDI-727]: Copy default values of fields if not present when 
rewriting incoming record with new schema (#1427)
 add 447ba3b  [MINOR] Disabling flaky test in InlineFileSystem (#1510)
 add 17bf930  [HUDI-770] Organize upsert/insert API implementation under a 
single package (#1495)
 add 661b0b3  [HUDI-761] Refactoring rollback and restore actions using the 
ActionExecutor abstraction (#1492)
 add 644c1cc  [HUDI-698]Add unit test for CleansCommand (#1449)
 add 14d4fea  [HUDI-759] Integrate checkpoint provider with delta streamer 
(#1486)
 add d65efe6  [HUDI-780] Migrate test cases to Junit 5 (#1504)
 add 9ca710c  [HUDI-777] Updated description for --target-table parameter 
(#1519)
 add 1689d01  Merge branch 'master' into hudi_test_suite_refactor

No new revisions were added by this update.

Summary of changes:
 hudi-cli/pom.xml   |  56 +++
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  65 +++
 .../apache/hudi/cli/commands/CleansCommand.java|  15 +-
 .../hudi/cli/commands/FileSystemViewCommand.java   |  47 +-
 .../org/apache/hudi/cli/commands/SparkMain.java|   4 +-
 .../hudi/cli/commands/TestCleansCommand.java   | 183 
 .../cli/commands/TestFileSystemViewCommand.java| 267 
 .../apache/hudi/cli/integ/ITTestCleansCommand.java | 101 +
 .../src/test/resources/clean.properties|   5 +-
 hudi-client/pom.xml|  24 ++
 .../hudi/client/AbstractHoodieWriteClient.java | 138 +-
 .../org/apache/hudi/client/HoodieWriteClient.java  | 475 +
 ...kException.java => HoodieRestoreException.java} |   8 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 428 +++
 .../apache/hudi/table/HoodieMergeOnReadTable.java  | 376 +++-
 .../java/org/apache/hudi/table/HoodieTable.java| 131 --
 .../hudi/table/action/BaseActionExecutor.java  |   5 +-
 .../table/action/clean/CleanActionExecutor.java|   8 +-
 .../action/commit/BaseCommitActionExecutor.java| 291 +
 .../hudi/table/action/commit/BucketInfo.java   |  28 +-
 .../hudi/table/action/commit/BucketType.java   |   6 +-
 .../commit/BulkInsertCommitActionExecutor.java |  60 +++
 .../hudi/table/action/commit/BulkInsertHelper.java |  84 
 .../BulkInsertPreppedCommitActionExecutor.java |  61 +++
 .../table/action/commit/CommitActionExecutor.java  | 176 
 .../action/commit/DeleteCommitActionExecutor.java} |  31 +-
 .../hudi/table/action/commit/DeleteHelper.java |  96 +
 .../table/action/commit/HoodieWriteMetadata.java   | 104 +
 .../hudi/table/action/commit/InsertBucket.java |  30 +-
 .../action/commit/InsertCommitActionExecutor.java} |  33 +-
 .../commit/InsertPreppedCommitActionExecutor.java} |  31 +-
 .../apache/hudi/table/action/commit/SmallFile.java |  29 +-
 .../action/commit/UpsertCommitActionExecutor.java} |  33 +-
 .../table/action/commit/UpsertPartitioner.java | 316 ++
 .../commit/UpsertPreppedCommitActionExecutor.java} |  31 +-
 .../hudi/table/action/commit/WriteHelper.java  | 105 +
 .../BulkInsertDeltaCommitActionExecutor.java   |  62 +++
 ...BulkInsertPreppedDeltaCommitActionExecutor.java |  63 +++
 .../DeleteDeltaCommitActionExecutor.java}  |  33 +-
 .../deltacommit/DeltaCommitActionExecutor.java |  94 
 .../InsertDeltaCommitActionExecutor.java   |  49 +++
 .../InsertPreppedDeltaCommitActionExecutor.java}   |  31 +-
 .../UpsertDeltaCommitActionExecutor.java   |  49 +++
 .../deltacommit/UpsertDeltaCommitPartitioner.java  | 142 ++
 .../UpsertPreppedDeltaCommitActionExecutor.java}   |  31 +-
 .../action/restore/BaseRestoreActionExecutor.java  | 111 +
 .../restore/CopyOnWriteRestoreActionExecutor.java  |  57 +++
 .../restore/MergeOnReadRestoreActionExecutor.java  |  64 +++
 .../rollback/BaseRollbackActionExecutor.java   | 192 +
 .../CopyOnWriteRollbackActionExecutor.java |  94 
 .../MergeOnReadRollbackActionExecutor.java | 241 +++
 .../{ => action}/rollback/RollbackHelper.java  |   2 +-
 .../{ => action}/rollback/RollbackRequest.java |   2 +-
 .../org/apache/hudi/client/TestClientRollback.java |  16 +-
 .../apache/hudi/client/TestHoodieClientBase.java   |   3 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  48 ++-
 .../hudi/common/HoodieMergeOnReadTestUtils.java|  40

[jira] [Resolved] (HUDI-780) Make JUnit 5 co-exist in the project

2020-04-15 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-780.
-
Resolution: Done

> Make JUnit 5 co-exist in the project
> 
>
> Key: HUDI-780
> URL: https://issues.apache.org/jira/browse/HUDI-780
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> - Add Junit 5
> - Convert a Junit 4 testcase to use 5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-780) Make JUnit 5 co-exist in the project

2020-04-15 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-780:

Status: Closed  (was: Patch Available)

> Make JUnit 5 co-exist in the project
> 
>
> Key: HUDI-780
> URL: https://issues.apache.org/jira/browse/HUDI-780
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> - Add Junit 5
> - Convert a Junit 4 testcase to use 5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (HUDI-780) Make JUnit 5 co-exist in the project

2020-04-15 Thread Raymond Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reopened HUDI-780:
-

> Make JUnit 5 co-exist in the project
> 
>
> Key: HUDI-780
> URL: https://issues.apache.org/jira/browse/HUDI-780
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> - Add Junit 5
> - Convert a Junit 4 testcase to use 5



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1515: [HUDI-795] Ignoring missing aux folder

2020-04-15 Thread GitBox

afilipchik commented on a change in pull request #1515: [HUDI-795] Ignoring 
missing aux folder
URL: https://github.com/apache/incubator-hudi/pull/1515#discussion_r409208032
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java
 ##
 @@ -219,14 +220,23 @@ private boolean 
deleteArchivedInstants(List archivedInstants) thr
* @throws IOException in case of error
*/
   private boolean deleteAllInstantsOlderorEqualsInAuxMetaFolder(HoodieInstant 
thresholdInstant) throws IOException {
-List instants = metaClient.scanHoodieInstantsFromFileSystem(
-new Path(metaClient.getMetaAuxiliaryPath()), 
HoodieActiveTimeline.VALID_EXTENSIONS_IN_ACTIVE_TIMELINE, false);
+List instants = null;
+boolean success = true;
+try {
+  instants =
+  metaClient.scanHoodieInstantsFromFileSystem(
+  new Path(metaClient.getMetaAuxiliaryPath()),
+  HoodieActiveTimeline.VALID_EXTENSIONS_IN_ACTIVE_TIMELINE,
+  false);
+} catch (FileNotFoundException e) {
 
 Review comment:
   not sure about it. Did see that folder disappears/reappears sometimes. Can 
be related to the way GCS connector treats a situation when all the files from 
the folder are removed. 
   
   On gcs folders are not real, they are just logical. So, file with name: 
gs://bucket/folder/file doesn't live in the folder "folder", it just has a long 
"/" separated name. I saw some logic in the GCS FS java implementation that 
creates fake file "/" when everything is removed from the logical folder. So, 
my speculation is that if this call fails or not enabled, that removing all the 
files from the folder will end up if folder being removed as well. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1517: Use appropriate FS when loading configs

2020-04-15 Thread GitBox

afilipchik commented on a change in pull request #1517: Use appropriate FS when 
loading configs
URL: https://github.com/apache/incubator-hudi/pull/1517#discussion_r409206422
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -391,7 +391,9 @@ public DeltaSyncService(Config cfg, JavaSparkContext jssc, 
FileSystem fs, HiveCo
   ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != 
Operation.UPSERT,
   "'--filter-dupes' needs to be disabled when '--op' is 'UPSERT' to 
ensure updates are not missed.");
 
-  this.props = properties != null ? properties : 
UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath), 
cfg.configs).getConfig();
+  this.props = properties != null ? properties : UtilHelpers.readConfig(
+  FSUtils.getFs(cfg.propsFilePath, jssc.hadoopConfiguration()),
 
 Review comment:
   I think it is already doing that: 
   ` FileSystem fs = FSUtils.getFs(commonPropsFile, 
jssc.hadoopConfiguration());`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] afilipchik commented on a change in pull request #1517: Use appropriate FS when loading configs

2020-04-15 Thread GitBox

afilipchik commented on a change in pull request #1517: Use appropriate FS when 
loading configs
URL: https://github.com/apache/incubator-hudi/pull/1517#discussion_r409205392
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -391,7 +391,9 @@ public DeltaSyncService(Config cfg, JavaSparkContext jssc, 
FileSystem fs, HiveCo
   ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != 
Operation.UPSERT,
   "'--filter-dupes' needs to be disabled when '--op' is 'UPSERT' to 
ensure updates are not missed.");
 
-  this.props = properties != null ? properties : 
UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath), 
cfg.configs).getConfig();
+  this.props = properties != null ? properties : UtilHelpers.readConfig(
+  FSUtils.getFs(cfg.propsFilePath, jssc.hadoopConfiguration()),
 
 Review comment:
   sure


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-15 Thread GitBox

codecov-io edited a comment on issue #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#issuecomment-613236143
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=h1) 
Report
   > Merging 
[#1500](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/d65efe659d36dfb80be16a6389c5aba2f8701c39=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1500/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1500  +/-   ##
   
   + Coverage 72.17%   72.20%   +0.03% 
 Complexity  294  294  
   
 Files   373  373  
 Lines 1628216293  +11 
 Branches   1638 1639   +1 
   
   + Hits  1175111764  +13 
   + Misses 3795 3793   -2 
 Partials736  736  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `84.58% <100.00%> (+0.20%)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `56.70% <100.00%> (+6.13%)` | `0.00 <0.00> (ø)` | |
   | 
[...a/org/apache/hudi/common/util/collection/Pair.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvY29sbGVjdGlvbi9QYWlyLmphdmE=)
 | `72.00% <0.00%> (-4.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...che/hudi/common/util/BufferedRandomAccessFile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQnVmZmVyZWRSYW5kb21BY2Nlc3NGaWxlLmphdmE=)
 | `55.26% <0.00%> (+0.87%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=footer).
 Last update 
[d65efe6...f8a756e](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-15 Thread GitBox

kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r409198506
 
 

 ##
 File path: hudi-spark/src/test/java/DataSourceUtilsTest.java
 ##
 @@ -17,18 +17,58 @@
  */
 
 import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.NoOpBulkInsertPartitioner;
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericData;
 import org.apache.avro.generic.GenericRecord;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runner.RunWith;
 
 Review comment:
   I'm using mockito, until mockito upgrade to version that support junit 5, it 
doesn't seem to work.
   I think same as 
https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/test/java/org/apache/hudi/common/table/view/TestPriorityBasedFileSystemView.java#L54
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-15 Thread GitBox

codecov-io commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-614333053
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=h1) 
Report
   > Merging 
[#1457](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/661b0b3bab85c8fd958a9e256dab99790ea92713=desc)
 will **increase** coverage by `0.09%`.
   > The diff coverage is `71.60%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1457/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1457  +/-   ##
   
   + Coverage 72.20%   72.29%   +0.09% 
   - Complexity  289  294   +5 
   
 Files   372  374   +2 
 Lines 1626716369 +102 
 Branches   1637 1654  +17 
   
   + Hits  1174511834  +89 
   - Misses 3787 3799  +12 
   - Partials735  736   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...in/java/org/apache/hudi/io/HoodieAppendHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllQXBwZW5kSGFuZGxlLmphdmE=)
 | `84.17% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/hive/util/HiveSchemaUtil.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1oaXZlLXN5bmMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGl2ZS91dGlsL0hpdmVTY2hlbWFVdGlsLmphdmE=)
 | `68.10% <50.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/common/table/TableSchemaResolver.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL1RhYmxlU2NoZW1hUmVzb2x2ZXIuamF2YQ==)
 | `63.86% <63.86%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...in/java/org/apache/hudi/hive/HoodieHiveClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1oaXZlLXN5bmMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGl2ZS9Ib29kaWVIaXZlQ2xpZW50LmphdmE=)
 | `74.63% <75.00%> (+2.43%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `70.24% <96.55%> (+2.28%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `84.64% <100.00%> (+0.27%)` | `0.00 <0.00> (ø)` | |
   | 
[...ain/java/org/apache/hudi/io/HoodieWriteHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllV3JpdGVIYW5kbGUuamF2YQ==)
 | `74.46% <100.00%> (-0.54%)` | `0.00 <0.00> (ø)` | |
   | 
[...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvYXZyby9Ib29kaWVBdnJvVXRpbHMuamF2YQ==)
 | `93.13% <100.00%> (+1.13%)` | `0.00 <0.00> (ø)` | |
   | 
[...c/main/java/org/apache/hudi/hive/HiveSyncTool.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS1oaXZlLXN5bmMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGl2ZS9IaXZlU3luY1Rvb2wuamF2YQ==)
 | `74.15% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...lities/checkpointing/KafkaConnectHdfsProvider.java](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvS2Fma2FDb25uZWN0SGRmc1Byb3ZpZGVyLmphdmE=)
 | `89.28% <0.00%> (-3.03%)` | `14.00% <0.00%> (+2.00%)` | :arrow_down: |
   | ... and [13 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1457/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1457?src=pr=footer).
 Last

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-15 Thread GitBox

xushiyan commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r409173783
 
 

 ##
 File path: hudi-spark/src/test/java/DataSourceUtilsTest.java
 ##
 @@ -17,18 +17,58 @@
  */
 
 import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.NoOpBulkInsertPartitioner;
 
 import org.apache.avro.Schema;
 import org.apache.avro.generic.GenericData;
 import org.apache.avro.generic.GenericRecord;
-import org.junit.jupiter.api.Test;
+import org.apache.spark.api.java.JavaRDD;
+import org.junit.Before;
+import org.junit.Test;
+import org.junit.runner.RunWith;
 
 Review comment:
   @kwondw We're moving to Junit 5.. could you please switch to Junit 5 api ?  
thank you.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-15 Thread GitBox

vinothchandar commented on issue #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512#issuecomment-614308722
 
 
   Fair enough.. Hypothetically, we could support a world where base files 
across partitions are in different formats.. So given its not a real problem 
today.. we can deal with it down the line? 
   
   btw there are also things like index.type which should go there today and 
are not.. IIRC there is a ticket for that as well.. main challenge there is how 
would we apply this change to the hoodie.properties file.. (dont think 
overwriting will be atomic...) also does this change also fix older tables 
automatically.. If we can flesh those out, that may be good. wdyt 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (5803025 -> a096128)

2020-04-15 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard 5803025  more fixes
 add a096128  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (5803025)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (a096128)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../hudi/testsuite/reader/TestDFSHoodieDatasetInputReader.java   | 7 ---
 .../src/test/resources/log4j-surefire-quiet.properties   | 1 +
 .../java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java  | 9 +
 3 files changed, 10 insertions(+), 7 deletions(-)

[GitHub] [incubator-hudi] vinothchandar commented on issue #1520: [HUDI-797] Small performance improvement for rewriting records.

2020-04-15 Thread GitBox

vinothchandar commented on issue #1520: [HUDI-797] Small performance 
improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520#issuecomment-614301101
 
 
   @lamber-ken could you please review this 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-15 Thread GitBox

prashantwason commented on a change in pull request #1476: [HUDI-757] Added 
hudi-cli command to export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#discussion_r409158527
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
 ##
 @@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieSavepointMetadata;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.avro.specific.SpecificData;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * CLI commands to export various information from a HUDI dataset.
+ *
+ * "export instants": Export Instants and their metadata from the Timeline to 
a local
+ *directory specified by the parameter --localFolder
+ *  The instants are exported in the json format.
+ */
+@Component
+public class ExportCommand implements CommandMarker {
+
+  @CliCommand(value = "export instants", help = "Export Instants and their 
metadata from the Timeline")
+  public String exportInstants(
+  @CliOption(key = {"limit"}, help = "Limit Instants", 
unspecifiedDefaultValue = "-1") final Integer limit,
+  @CliOption(key = {"actions"}, help = "Comma seperated list of Instant 
actions to export",
+unspecifiedDefaultValue = 
"clean,commit,deltacommit,rollback,savepoint,restore") final String filter,
+  @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
+  @CliOption(key = {"localFolder"}, help = "Local Folder to export to", 
mandatory = true) String localFolder)
+  throws Exception {
+
+final String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+final Path archivePath = new Path(basePath + 
"/.hoodie/.commits_.archive*");
+final Set actionSet = new 
HashSet(Arrays.asList(filter.split(",")));
+int numExports = limit == -1 ? Integer.MAX_VALUE : limit;
+int numCopied = 0;
+
+if (! new File(localFolder).isDirectory()) {
+  throw new RuntimeException(localFolder + " is not a valid local 
directory");
+}
+
+// The non archived instants can be listed from the Timeline.
+HoodieTimeline timeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().filterCompletedInstants()
+.filter(i -> actionSet.contains(i.getAction()));
+List nonArchivedInstants = 
timeline.getInstants().collect(Collectors.toList());
+
+// Archived instants are in the commit archive files
+FileStatus[] statuses = FSUtils.getFs(basePath, 
HoodieCLI.conf).globStatus(archivePath);
+List archivedStatuses = Arrays.stream(statuses).sorted((f1, 
f2) ->

[incubator-hudi] branch master updated (d65efe6 -> 9ca710c)

2020-04-15 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from d65efe6  [HUDI-780] Migrate test cases to Junit 5 (#1504)
 add 9ca710c  [HUDI-777] Updated description for --target-table parameter 
(#1519)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java| 2 +-
 .../hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

[GitHub] [incubator-hudi] vinothchandar merged pull request #1519: [HUDI-777] Update Deltastreamer param description for --target-table

2020-04-15 Thread GitBox

vinothchandar merged pull request #1519: [HUDI-777] Update Deltastreamer param 
description for --target-table
URL: https://github.com/apache/incubator-hudi/pull/1519
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-04-15 Thread Ramachandran M S (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084405#comment-17084405
 ] 

Ramachandran M S commented on HUDI-409:
---

[~vinoth] can you please give more details (test cases) to reproduce if there 
are any errors. I fixed this as part of 0.5.2 and added tests to simulate the 
failure situations. Highly interested in knowing what case is crashing the 
reader now.

 

Or if this is about adding a magic header, me and [~nagarwal] discussed and 
decided it was not required for the following reasons
 * Current header + magic hash (#HUDI#) + footer itself can be used to detect 
corruptions
 * Discussed a solution where just modifying current reader logic can avoid 
clash of data with header (details are in the 
[PR|https://issues.apache.org/jira/browse/HUDI-409])
 * Modifying/adding new magic header would involve considerable amount of work 
to maintain different versions of log files in production at the same time. 
Could possibly introduce some bugs

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Ramachandran M S
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-797:

Labels: pull-request-available  (was: )

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] prashantwason opened a new pull request #1520: [HUDI-797] Small performance improvement for rewriting records.

2020-04-15 Thread GitBox

prashantwason opened a new pull request #1520: [HUDI-797] Small performance 
improvement for rewriting records.
URL: https://github.com/apache/incubator-hudi/pull/1520
 
 
   When adding HUDI metadata field to schema, the position of incoming schema 
fields is retained. This allows us to lookup fields in exiting record and 
assign to the new record using field positions (array index lookup). This is 
faster than looking up fields using name (HashMap based lookup).
   
   ## What is the purpose of the pull request
   
   Small performance improvement for rewriting records during the ingestion 
phase. [HUDI-797](https://issues.apache.org/jira/browse/HUDI-797) as the 
details on the usecase and improvements. 
   
   ## Brief change log
   
   1. Modified HoodieAvroUtils.addMetadataFields to retain the field positions.
   2. Added a new rewriting function HoodieAvroUtils.rewriteHoodieRecord which 
rewrites using field positions rather than field names.
   
   ## Verify this pull request
   
   This change is verified automatically by all the HUDI client tests which 
ingest data into HUDI.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084382#comment-17084382
 ] 

Prashant Wason commented on HUDI-797:
-

My testing with two schemas (large / small) while rewriting the existing 
parquet files has shown about 29usec improvement for a large schema and 6usec 
improvement for a smaller schema. 

For 1Billion records this translates into saving of 29000seconds or 8 hours of 
CPU time.

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084380#comment-17084380
 ] 

Prashant Wason commented on HUDI-797:
-

Conversion step is uses a HashMap to find the field from the incoming record 
using the field name. HashMap lookups are slow (relative to array index lookup).

One potential improvement is to lookup fields using the integer position of the 
field in the AVRO schema. 

 

How will this work:
--

Assume the incoming schema has 2 fields. On parsing this schema, the Schema 
object will have 2 fields with positions 0 and 1. When we rewrite the schema to 
create HUDI schema 
([HoodieAvroUtils::addMetadataFields()|[https://github.com/apache/incubator-hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java#L103]])
 we will add some extra fields. Now the schema object may not have the 2 
incoming fields at position 0 and 1. This is the reason we need to lookup the 
fields using their name.

Suppose while addMetadataFields we ensure that the incoming schema fields 
retain their positions (by adding HUDI metadata fields at the end). Then we can 
easily lookup fields using their position and assign them using their position 
too.

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084378#comment-17084378
 ] 

Prashant Wason commented on HUDI-797:
-

HoodieAvroUtils::rewriteRecord function has two parts:
 # Conversion: Convert the record to the newSchema
 # Validation: Validate that the converted record is valid as per the newSchema 
(uses GenericData.get().validate function from org.apache.avro)

I tested the time taken by the above two parts with a large schema (50 fields) 
which converting 100Million records.

Current: Test takes total 74217 msec (1.484 usec / record)
No validation: Test takes total 37168 msec (0.743 usec / record)

This shows that 50% of the time is spend in the conversion part and 50% in the 
validation part. This is totally dependent on the schema though - large 
complicated schemas (union, records, etc) take longer to verify due to the 
increased number of fields to validate or the complexity of validating various 
field types (Primitive fields types are trivial to compare using instantOf 
operator).

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-797:

Status: Open  (was: New)

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-797:
---

Assignee: Prashant Wason

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-797:

Status: In Progress  (was: Open)

> Improve performance of rewriting AVRO records in 
> HoodieAvroUtils::rewriteRecord
> ---
>
> Key: HUDI-797
> URL: https://issues.apache.org/jira/browse/HUDI-797
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Major
>
> Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO 
> encoded records. These records have a [schema 
> |https://avro.apache.org/docs/current/spec.html]which is determined by the 
> dataset user and provided to HUDI during the writing process (as part of 
> HUDIWriteConfig). The records are finally saved in [parquet 
> |https://parquet.apache.org/]files which include the schema (in parquet 
> format) in the footer of individual files.
>  
> HUDI design requires addition of some metadata fields to all incoming records 
> to aid in book-keeping and indexing. To achieve this, the incoming schema 
> needs to be modified by adding the HUDI metadata fields and is called the 
> HUDI schema for the dataset. Each incoming record is then re-written to 
> translate it from the incoming schema into the HUDI schema. Re-writing the 
> incoming records to a new schema is reasonably fast as it looks up all fields 
> in the incoming record and adds them to a new record. But since this takes 
> place for each and every incoming record. 
> When ingestion large datasets (billions of records) or large number of 
> datasets, even small improvements in the CPU-bound conversion can translate 
> into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-797) Improve performance of rewriting AVRO records in HoodieAvroUtils::rewriteRecord

2020-04-15 Thread Prashant Wason (Jira)

Prashant Wason created HUDI-797:
---

 Summary: Improve performance of rewriting AVRO records in 
HoodieAvroUtils::rewriteRecord
 Key: HUDI-797
 URL: https://issues.apache.org/jira/browse/HUDI-797
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Prashant Wason


Data is ingested into a [HUDI |https://hudi.apache.org/]dataset as AVRO encoded 
records. These records have a [schema 
|https://avro.apache.org/docs/current/spec.html]which is determined by the 
dataset user and provided to HUDI during the writing process (as part of 
HUDIWriteConfig). The records are finally saved in [parquet 
|https://parquet.apache.org/]files which include the schema (in parquet format) 
in the footer of individual files.

 

HUDI design requires addition of some metadata fields to all incoming records 
to aid in book-keeping and indexing. To achieve this, the incoming schema needs 
to be modified by adding the HUDI metadata fields and is called the HUDI schema 
for the dataset. Each incoming record is then re-written to translate it from 
the incoming schema into the HUDI schema. Re-writing the incoming records to a 
new schema is reasonably fast as it looks up all fields in the incoming record 
and adds them to a new record. But since this takes place for each and every 
incoming record. 

When ingestion large datasets (billions of records) or large number of 
datasets, even small improvements in the CPU-bound conversion can translate 
into notable improvement in compute efficiency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-741) Fix Hoodie's schema evolution checks

2020-04-15 Thread Prashant Wason (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084372#comment-17084372
 ] 

Prashant Wason commented on HUDI-741:
-

Sure. Will work on the blog post soon.

> Fix Hoodie's schema evolution checks
> 
>
> Key: HUDI-741
> URL: https://issues.apache.org/jira/browse/HUDI-741
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 120h
>  Time Spent: 10m
>  Remaining Estimate: 119h 50m
>
> HUDI requires a Schema to be specified in HoodieWriteConfig and is used by 
> the HoodieWriteClient to create the records. The schema is also saved in the 
> data files (parquet format) and log files (avro format).
> Since a schema is required each time new data is ingested into a HUDI 
> dataset, schema can be evolved over time. But HUDI should ensure that the 
> evolved schema is compatible with the older schema.
> HUDI specific validation of schema evolution should ensure that a newer 
> schema can be used for the dataset by checking that the data written using 
> the old schema can be read using the new schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-15 Thread GitBox

lamber-ken commented on a change in pull request #1476: [HUDI-757] Added 
hudi-cli command to export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#discussion_r409110579
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
 ##
 @@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieSavepointMetadata;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.avro.specific.SpecificData;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * CLI commands to export various information from a HUDI dataset.
+ *
+ * "export instants": Export Instants and their metadata from the Timeline to 
a local
+ *directory specified by the parameter --localFolder
+ *  The instants are exported in the json format.
+ */
+@Component
+public class ExportCommand implements CommandMarker {
+
+  @CliCommand(value = "export instants", help = "Export Instants and their 
metadata from the Timeline")
+  public String exportInstants(
+  @CliOption(key = {"limit"}, help = "Limit Instants", 
unspecifiedDefaultValue = "-1") final Integer limit,
+  @CliOption(key = {"actions"}, help = "Comma seperated list of Instant 
actions to export",
+unspecifiedDefaultValue = 
"clean,commit,deltacommit,rollback,savepoint,restore") final String filter,
+  @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
+  @CliOption(key = {"localFolder"}, help = "Local Folder to export to", 
mandatory = true) String localFolder)
+  throws Exception {
+
+final String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+final Path archivePath = new Path(basePath + 
"/.hoodie/.commits_.archive*");
+final Set actionSet = new 
HashSet(Arrays.asList(filter.split(",")));
+int numExports = limit == -1 ? Integer.MAX_VALUE : limit;
+int numCopied = 0;
+
+if (! new File(localFolder).isDirectory()) {
+  throw new RuntimeException(localFolder + " is not a valid local 
directory");
+}
+
+// The non archived instants can be listed from the Timeline.
+HoodieTimeline timeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().filterCompletedInstants()
+.filter(i -> actionSet.contains(i.getAction()));
+List nonArchivedInstants = 
timeline.getInstants().collect(Collectors.toList());
+
+// Archived instants are in the commit archive files
+FileStatus[] statuses = FSUtils.getFs(basePath, 
HoodieCLI.conf).globStatus(archivePath);
+List archivedStatuses = Arrays.stream(statuses).sorted((f1, 
f2) ->

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-15 Thread GitBox

lamber-ken commented on a change in pull request #1476: [HUDI-757] Added 
hudi-cli command to export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#discussion_r409110372
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
 ##
 @@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieSavepointMetadata;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.avro.specific.SpecificData;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * CLI commands to export various information from a HUDI dataset.
+ *
+ * "export instants": Export Instants and their metadata from the Timeline to 
a local
+ *directory specified by the parameter --localFolder
+ *  The instants are exported in the json format.
+ */
+@Component
+public class ExportCommand implements CommandMarker {
+
+  @CliCommand(value = "export instants", help = "Export Instants and their 
metadata from the Timeline")
+  public String exportInstants(
+  @CliOption(key = {"limit"}, help = "Limit Instants", 
unspecifiedDefaultValue = "-1") final Integer limit,
+  @CliOption(key = {"actions"}, help = "Comma seperated list of Instant 
actions to export",
+unspecifiedDefaultValue = 
"clean,commit,deltacommit,rollback,savepoint,restore") final String filter,
+  @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
+  @CliOption(key = {"localFolder"}, help = "Local Folder to export to", 
mandatory = true) String localFolder)
+  throws Exception {
+
+final String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+final Path archivePath = new Path(basePath + 
"/.hoodie/.commits_.archive*");
+final Set actionSet = new 
HashSet(Arrays.asList(filter.split(",")));
+int numExports = limit == -1 ? Integer.MAX_VALUE : limit;
+int numCopied = 0;
+
+if (! new File(localFolder).isDirectory()) {
+  throw new RuntimeException(localFolder + " is not a valid local 
directory");
+}
+
+// The non archived instants can be listed from the Timeline.
+HoodieTimeline timeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().filterCompletedInstants()
+.filter(i -> actionSet.contains(i.getAction()));
+List nonArchivedInstants = 
timeline.getInstants().collect(Collectors.toList());
+
+// Archived instants are in the commit archive files
+FileStatus[] statuses = FSUtils.getFs(basePath, 
HoodieCLI.conf).globStatus(archivePath);
+List archivedStatuses = Arrays.stream(statuses).sorted((f1, 
f2) ->

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-15 Thread GitBox

lamber-ken commented on a change in pull request #1476: [HUDI-757] Added 
hudi-cli command to export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#discussion_r409110372
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
 ##
 @@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.avro.model.HoodieSavepointMetadata;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.avro.specific.SpecificData;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * CLI commands to export various information from a HUDI dataset.
+ *
+ * "export instants": Export Instants and their metadata from the Timeline to 
a local
+ *directory specified by the parameter --localFolder
+ *  The instants are exported in the json format.
+ */
+@Component
+public class ExportCommand implements CommandMarker {
+
+  @CliCommand(value = "export instants", help = "Export Instants and their 
metadata from the Timeline")
+  public String exportInstants(
+  @CliOption(key = {"limit"}, help = "Limit Instants", 
unspecifiedDefaultValue = "-1") final Integer limit,
+  @CliOption(key = {"actions"}, help = "Comma seperated list of Instant 
actions to export",
+unspecifiedDefaultValue = 
"clean,commit,deltacommit,rollback,savepoint,restore") final String filter,
+  @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
+  @CliOption(key = {"localFolder"}, help = "Local Folder to export to", 
mandatory = true) String localFolder)
+  throws Exception {
+
+final String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+final Path archivePath = new Path(basePath + 
"/.hoodie/.commits_.archive*");
+final Set actionSet = new 
HashSet(Arrays.asList(filter.split(",")));
+int numExports = limit == -1 ? Integer.MAX_VALUE : limit;
+int numCopied = 0;
+
+if (! new File(localFolder).isDirectory()) {
+  throw new RuntimeException(localFolder + " is not a valid local 
directory");
+}
+
+// The non archived instants can be listed from the Timeline.
+HoodieTimeline timeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().filterCompletedInstants()
+.filter(i -> actionSet.contains(i.getAction()));
+List nonArchivedInstants = 
timeline.getInstants().collect(Collectors.toList());
+
+// Archived instants are in the commit archive files
+FileStatus[] statuses = FSUtils.getFs(basePath, 
HoodieCLI.conf).globStatus(archivePath);
+List archivedStatuses = Arrays.stream(statuses).sorted((f1, 
f2) ->

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-15 Thread GitBox

lamber-ken edited a comment on issue #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512#issuecomment-614255323
 
 
   hi @vinothchandar @pratyakshsharma, thanks very much for reviewing this. 
   
   The base file format for each hudi dataset can't be changed across writes, 
so we need to store this option to hoodie.properties. This is the first step to 
support multiple data formats (read / write options)
   
   
![image](https://user-images.githubusercontent.com/20113411/79383212-63d64f00-7f97-11ea-8fc1-9fdfe34a911d.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-15 Thread GitBox

lamber-ken commented on issue #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512#issuecomment-614255323
 
 
   hi @vinothchandar @pratyakshsharma, thanks very much for reviewing this. 
   
   The base file format for each hudi dataset can't be changed across writes, 
so we need to store this option to hoodie.properties.
   
   
![image](https://user-images.githubusercontent.com/20113411/79383212-63d64f00-7f97-11ea-8fc1-9fdfe34a911d.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-15 Thread GitBox

prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r409104802
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SchemaCompatibility.java
 ##
 @@ -0,0 +1,566 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+
+import org.apache.avro.AvroRuntimeException;
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.Schema.Type;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * NOTE: This code is copied from org.apache.avro.SchemaCompatibility and 
changed for HUDI use case.
 
 Review comment:
   Not required any longer as I have removed the copied file. Please see the 
latest diff.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-15 Thread GitBox

prashantwason commented on a change in pull request #1457: [HUDI-741] Added 
checks to validate Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#discussion_r409104956
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SchemaUtil.java
 ##
 @@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import java.io.IOException;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.hudi.common.table.log.block.HoodieLogBlock;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.InvalidTableException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.format.converter.ParquetMetadataConverter;
+import org.apache.parquet.hadoop.ParquetFileReader;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.apache.parquet.schema.MessageType;
+
+/**
+ * Utilities to read Schema from data files and log files.
+ */
+public class SchemaUtil {
 
 Review comment:
   Unit tests implemented.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on issue #1457: [HUDI-741] Added checks to validate Hoodie's schema evolution.

2020-04-15 Thread GitBox

prashantwason commented on issue #1457: [HUDI-741] Added checks to validate 
Hoodie's schema evolution.
URL: https://github.com/apache/incubator-hudi/pull/1457#issuecomment-614253179
 
 
   I have reworked the schema compatibility check code to remove copying the 
entire avro.SchemaCompatibility class. I took out the relevant portion and it 
worked. I think the checks are now simpler and clearly defined within 
TableSchemaResolver.isisSchemaCompatible(...). 
   
   No need to NOTICE and LICENSE updates as we are not copying the 
avro.SchemaCompatibility  class.
   
   All unit tests have been completed too. So this is now good for a final 
review.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] hmatu commented on issue #1432: [HUDI-716] Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-04-15 Thread GitBox

hmatu commented on issue #1432: [HUDI-716] Exception: Not an Avro data file 
when running HoodieCleanClient.runClean
URL: https://github.com/apache/incubator-hudi/pull/1432#issuecomment-614241591
 
 
   @lamber-ken thanks for the changes, I have been running into the same issue 
and worked!  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #1504: [HUDI-780] Migrate test cases to Junit 5

2020-04-15 Thread GitBox

vinothchandar merged pull request #1504: [HUDI-780] Migrate test cases to Junit 
5
URL: https://github.com/apache/incubator-hudi/pull/1504
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [HUDI-780] Migrate test cases to Junit 5 (#1504)

2020-04-15 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new d65efe6  [HUDI-780] Migrate test cases to Junit 5 (#1504)
d65efe6 is described below

commit d65efe659d36dfb80be16a6389c5aba2f8701c39
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Wed Apr 15 12:35:01 2020 -0700

[HUDI-780] Migrate test cases to Junit 5 (#1504)
---
 hudi-cli/pom.xml   |  31 +++
 hudi-client/pom.xml|  24 ++
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |   4 +-
 .../org/apache/hudi/metrics/TestHoodieMetrics.java |  15 +-
 hudi-common/pom.xml|  24 ++
 .../org/apache/hudi/avro/TestHoodieAvroUtils.java  |  28 +-
 .../hudi/avro/TestHoodieAvroWriteSupport.java  |  21 +-
 hudi-hadoop-mr/pom.xml |  24 ++
 .../apache/hudi/hadoop/InputPathHandlerTest.java   |  28 +-
 .../org/apache/hudi/hadoop/TestAnnotation.java |   4 +-
 .../hudi/hadoop/TestRecordReaderValueIterator.java |  13 +-
 hudi-hive-sync/pom.xml |  26 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java | 297 ++---
 .../test/java/org/apache/hudi/hive/TestUtil.java   |   2 +-
 hudi-integ-test/pom.xml|  24 +-
 .../java/org/apache/hudi/integ/ITTestBase.java |  18 +-
 .../org/apache/hudi/integ/ITTestHoodieDemo.java|  20 +-
 .../org/apache/hudi/integ/ITTestHoodieSanity.java  |  33 +--
 hudi-spark/pom.xml |  29 +-
 hudi-spark/src/test/java/DataSourceUtilsTest.java  |   6 +-
 hudi-spark/src/test/scala/TestDataSource.scala |  21 +-
 .../src/test/scala/TestDataSourceDefaults.scala|  18 +-
 hudi-timeline-service/pom.xml  |  22 +-
 hudi-utilities/pom.xml |  24 ++
 .../transform/TestChainedTransformer.java  |  14 +-
 pom.xml|  34 ++-
 26 files changed, 522 insertions(+), 282 deletions(-)

diff --git a/hudi-cli/pom.xml b/hudi-cli/pom.xml
index 6aecd3b..2f0f4b4 100644
--- a/hudi-cli/pom.xml
+++ b/hudi-cli/pom.xml
@@ -248,5 +248,36 @@
   org.apache.hadoop
   hadoop-hdfs
 
+
+
+
+  org.junit.jupiter
+  junit-jupiter-api
+  test
+
+
+
+  org.junit.jupiter
+  junit-jupiter-engine
+  test
+
+
+
+  org.junit.vintage
+  junit-vintage-engine
+  test
+
+
+
+  org.junit.jupiter
+  junit-jupiter-params
+  test
+
+
+
+  org.mockito
+  mockito-all
+  test
+
   
 
diff --git a/hudi-client/pom.xml b/hudi-client/pom.xml
index 7650555..9c716f2 100644
--- a/hudi-client/pom.xml
+++ b/hudi-client/pom.xml
@@ -240,6 +240,30 @@
 
 
 
+  org.junit.jupiter
+  junit-jupiter-api
+  test
+
+
+
+  org.junit.jupiter
+  junit-jupiter-engine
+  test
+
+
+
+  org.junit.vintage
+  junit-vintage-engine
+  test
+
+
+
+  org.junit.jupiter
+  junit-jupiter-params
+  test
+
+
+
   org.mockito
   mockito-all
   test
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java 
b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
index fa0f1da..063f64e 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
@@ -20,10 +20,10 @@ package org.apache.hudi.metrics;
 
 import org.apache.hudi.config.HoodieWriteConfig;
 
-import org.junit.Test;
+import org.junit.jupiter.api.Test;
 
 import static org.apache.hudi.metrics.Metrics.registerGauge;
-import static org.junit.Assert.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.mockito.Mockito.mock;
 import static org.mockito.Mockito.when;
 
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieMetrics.java 
b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieMetrics.java
index 9716dd6..41842b1 100755
--- a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieMetrics.java
+++ b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieMetrics.java
@@ -22,22 +22,23 @@ import org.apache.hudi.common.model.HoodieCommitMetadata;
 import org.apache.hudi.config.HoodieWriteConfig;
 
 import com.codahale.metrics.Timer;
-import org.junit.Before;
-import org.junit.Test;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.Test;
 
-import java.util.Arrays;
 import java.util.Random;
+import java.util.stream.Stream;
 
 import static org.apache.hudi.metrics.Metrics.registerGauge;
-import static org.junit.Assert.assertEquals;
-import static

[jira] [Updated] (HUDI-777) Update Deltastreamer param description for --target-table

2020-04-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-777:

Labels: docs newbie pull-request-available  (was: docs newbie)

> Update Deltastreamer param description for --target-table
> -
>
> Key: HUDI-777
> URL: https://issues.apache.org/jira/browse/HUDI-777
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs, newbie
>Reporter: Iftach Schonbaum
>Assignee: Iftach Schonbaum
>Priority: Minor
>  Labels: docs, newbie, pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> --target-table and both of the pair of keys:
> *hoodie.datasource.hive_sync.table*
> *hoodie.datasource.hive_sync.database*
> needs to be all configured, i.e. if one one of them is dropped there is a 
> complaint to provide it - but according to doc they are aiming for the same.
> --target-table
>   name of the target table in Hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] iftachsc opened a new pull request #1519: [HUDI-777] Update Deltastreamer param description for --target-table

2020-04-15 Thread GitBox

iftachsc opened a new pull request #1519: [HUDI-777] Update Deltastreamer param 
description for --target-table
URL: https://github.com/apache/incubator-hudi/pull/1519
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (c7bf608 -> 5803025)

2020-04-15 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard c7bf608  more fixes
 add 5803025  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (c7bf608)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (5803025)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/testsuite/generator/DeltaGenerator.java   |  2 +-
 .../testsuite/reader/DFSHoodieDatasetInputReader.java |  6 +++---
 .../apache/hudi/testsuite/job/TestHoodieTestSuiteJob.java |  2 +-
 .../testsuite/reader/TestDFSHoodieDatasetInputReader.java | 15 ---
 .../hudi-test-suite-config}/complex-workflow-dag-mor.yaml | 12 ++--
 .../src/test/resources/log4j-surefire.properties  |  1 +
 6 files changed, 20 insertions(+), 18 deletions(-)
 copy {docker/demo/config/test-suite => 
hudi-test-suite/src/test/resources/hudi-test-suite-config}/complex-workflow-dag-mor.yaml
 (80%)

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1515: [HUDI-795] Ignoring missing aux folde

2020-04-15 Thread GitBox

pratyakshsharma commented on a change in pull request #1515: [HUDI-795] 
Ignoring missing aux folde
URL: https://github.com/apache/incubator-hudi/pull/1515#discussion_r408978903
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCommitArchiveLog.java
 ##
 @@ -219,14 +220,23 @@ private boolean 
deleteArchivedInstants(List archivedInstants) thr
* @throws IOException in case of error
*/
   private boolean deleteAllInstantsOlderorEqualsInAuxMetaFolder(HoodieInstant 
thresholdInstant) throws IOException {
-List instants = metaClient.scanHoodieInstantsFromFileSystem(
-new Path(metaClient.getMetaAuxiliaryPath()), 
HoodieActiveTimeline.VALID_EXTENSIONS_IN_ACTIVE_TIMELINE, false);
+List instants = null;
+boolean success = true;
+try {
+  instants =
+  metaClient.scanHoodieInstantsFromFileSystem(
+  new Path(metaClient.getMetaAuxiliaryPath()),
+  HoodieActiveTimeline.VALID_EXTENSIONS_IN_ACTIVE_TIMELINE,
+  false);
+} catch (FileNotFoundException e) {
 
 Review comment:
   Can you please provide specific pointers on how the folder gets removed on 
GCS? Is it GCS specific? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1517: Use appropriate FS when loading configs

2020-04-15 Thread GitBox

pratyakshsharma commented on a change in pull request #1517: Use appropriate FS 
when loading configs
URL: https://github.com/apache/incubator-hudi/pull/1517#discussion_r408964846
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -391,7 +391,9 @@ public DeltaSyncService(Config cfg, JavaSparkContext jssc, 
FileSystem fs, HiveCo
   ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != 
Operation.UPSERT,
   "'--filter-dupes' needs to be disabled when '--op' is 'UPSERT' to 
ensure updates are not missed.");
 
-  this.props = properties != null ? properties : 
UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath), 
cfg.configs).getConfig();
+  this.props = properties != null ? properties : UtilHelpers.readConfig(
+  FSUtils.getFs(cfg.propsFilePath, jssc.hadoopConfiguration()),
 
 Review comment:
   Let us make it flexible in HoodieMultiTableDeltaStreamer as well? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-15 Thread GitBox

vinothchandar commented on issue #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#issuecomment-614125874
 
 
   could you please rebase against latest master and get this to a state where 
only the necessary changes are in this one? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi issue [SUPPORT]

2020-04-15 Thread GitBox

ahmed-elfar edited a comment on issue #1498: Migrating parquet table to hudi 
issue [SUPPORT]
URL: https://github.com/apache/incubator-hudi/issues/1498#issuecomment-613563197
 
 
   @vinothchandar I apologies for the delayed response, and thanks again for 
your help and detailed answers.
   
   > Is it possible to share the data generation tool with us or point us to 
reproducing this ourselves locally? We can go much faster if we are able to 
repro this ourselves..
   
   Sure, this is the public Repo for generating the data 
[https://github.com/gregrahn/tpch-kit](url)
   And it provides the information you need for data generation, size, etc
   
   You can use this command to generate lineitem with scale 10GB
   `DSS_PATH=/output/path ./dbgen -T L 10`
   
   ---
   
   > Schema for lineitem
   
   Adding more details and update the schema screenshot mentioned on previous 
comment:
   
   **RECORDKEY_FIELD_OPT_KEY**: is composite (l_linenumber, l_orderkey)
   **PARTITIONPATH_FIELD_OPT_KEY**: optional default (non-portioned), or 
l_shipmode
   **PRECOMBINE_FIELD_OPT_KEY**: l_commitdate, or generating new timestamp 
column **last_updated**.
   
   ![Screenshot from 2020-04-14 
17-54-43](https://user-images.githubusercontent.com/20902425/79246806-f680cc00-7e79-11ea-8edc-bd711d8492ff.png)
   
   
   This is the **official documentation** for the datasets definition, schema, 
queries and business logic behind the queries 
[http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.18.0.pdf](url)
   
   ---
   
   >  @bvaradar & @umehrot2 will have the ability to seamlessly bootstrap the 
data into hudi without rewriting in the next release.
   
   Are we talking about the proposal mentioned at 
[https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+%3A+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi](url)
   
   because we need more clarification regarding this approach.
   
   ---
   
   > you ll also have the ability to do a one-time bulk_insert for last N 
partitions to get the upsert performance benefits as we discussed above
   
   There one of the attempts mentioned on the first comment which might be 
similar, I will explain in details to check with you if it should work for now 
or not, **if it provides a valid Hudi table**:
   
   So consider we have table of size 1TB parquet format as an input table 
either **partitioned** or **non- partitioned**. Spark resource 256GB ram and 32 
cores:
   
   **Case non-partitioned** 
   
   - We use the suggested/recommended partition column(s) (we pick first column 
in the partition path), then project this partition column and apply distinct 
which will provide you with filter values you need to pass to next process of 
the pipe line.
   - The next step submit a sequential spark applications which filter the 
input data based on the passed filter value resulting in data frame of single 
partition.
   - Write (**bulk-insert**) the filtered dataframe as Hudi table with the 
provided partition column(s) using save-mode **append**
   - Hudi table being written partition by partition.
   - Query the Hudi table to check if it is valid table, and it looks valid.
   
   **Case partitioned** same as above, with faster filter operations. 
   
   **Pros**: 
   - Avoided a lot of disk spilling, GC hits. 
   - Using less resources for initial loading.
   
   **Cons**: 
   - No time improvements in case you have enough resources to load the table 
at once. 
   - We ended up with partitioned table, which might not be needed in some of 
our use cases.
   
   **Questions**:
   - If this approach is valid or going to impact the upsert operations in the 
future?
   
   
   ---
   
   > I would be happy to jump on a call with you folks and get this moving 
along.. I am also very excited to work with a user like yourself and move the 
perf aspects of the project along more..
   
   We are excited as well to have a call together, please inform me how we can 
proceed on this meeting.
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch hudi_test_suite_refactor updated (a7a3655 -> c7bf608)

2020-04-15 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard a7a3655  more fixes
 add c7bf608  more fixes

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (a7a3655)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (c7bf608)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/hive/HiveSyncTool.java |  6 +-
 .../hudi/testsuite/dag/SimpleWorkflowDagGenerator.java   |  2 +-
 .../apache/hudi/testsuite/dag/HiveSyncDagGenerator.java  |  2 +-
 .../hudi/testsuite/job/TestHoodieTestSuiteJob.java   |  2 +-
 .../hudi-test-suite-config/complex-workflow-dag-cow.yaml | 16 
 5 files changed, 16 insertions(+), 12 deletions(-)

69 matches

Mail list logo