[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
codecov-io edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600535972
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=h1) 
Report
   > Merging 
[#1417](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/0a4902eccece1df959946fcb7379a94fc5fe0784=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1417/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1417  +/-   ##
   
   + Coverage 67.73%   67.78%   +0.05% 
   - Complexity  243  245   +2 
   
 Files   338  338  
 Lines 1638316386   +3 
 Branches   1675 1677   +2 
   
   + Hits  1109711108  +11 
   + Misses 4546 4539   -7 
   + Partials740  739   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1417/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.49% <0.00%> (+6.49%)` | `22.00% <0.00%> (+2.00%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1417/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `75.00% <0.00%> (+50.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=footer).
 Last update 
[0a4902e...1f0f81f](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
codecov-io edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600535972
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=h1) 
Report
   > Merging 
[#1417](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/0a4902eccece1df959946fcb7379a94fc5fe0784=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1417/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1417  +/-   ##
   
   + Coverage 67.73%   67.78%   +0.05% 
   - Complexity  243  245   +2 
   
 Files   338  338  
 Lines 1638316386   +3 
 Branches   1675 1677   +2 
   
   + Hits  1109711108  +11 
   + Misses 4546 4539   -7 
   + Partials740  739   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/utilities/HoodieSnapshotExporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1417/diff?src=pr=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0hvb2RpZVNuYXBzaG90RXhwb3J0ZXIuamF2YQ==)
 | `83.49% <0.00%> (+6.49%)` | `22.00% <0.00%> (+2.00%)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1417/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `75.00% <0.00%> (+50.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=footer).
 Last update 
[0a4902e...1f0f81f](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #221

2020-03-18 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394767796
 
 

 ##
 File path: .travis.yml
 ##
 @@ -0,0 +1,40 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
 
 Review comment:
   Double checked, always say "Switched to a new branch 'pr'", means that 
travis doesn't cache it.
   
   https://www.travis-ci.org/github/lamber-ken/hdocs/builds/664223886
   https://www.travis-ci.org/github/lamber-ken/hdocs/builds/664223922
   ```
   $ git checkout -b pr
   Switched to a new branch 'pr'
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-723) SqlTransformer's schema sometimes is not registered.

2020-03-18 Thread Alexander Filipchik (Jira)
Alexander Filipchik created HUDI-723:


 Summary: SqlTransformer's schema sometimes is not registered. 
 Key: HUDI-723
 URL: https://issues.apache.org/jira/browse/HUDI-723
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: DeltaStreamer
Reporter: Alexander Filipchik
 Fix For: 0.6.0


If schema is inferred from RowBasedSchemaProvider when SQL transformer is used 
it also needs to be registered. 

 

Current way only works if SchemaProvider has a valid target schema. Is one 
wants to use schema from SQL transformation, the result of 
RowBasedSchemaProvider.getTargetSchema needs to be passed into something like:
{code:java}
private void setupWriteClient(SchemaProvider schemaProvider) {
  LOG.info("Setting up Hoodie Write Client");
  registerAvroSchemas(schemaProvider);
  HoodieWriteConfig hoodieCfg = getHoodieClientConfig(schemaProvider);
  writeClient = new HoodieWriteClient<>(jssc, hoodieCfg, true);
  onInitializingHoodieWriteClient.apply(writeClient);
}
{code}
Existent method will not work as it is checking for:
{code:java}
if ((null != schemaProvider) && (null == writeClient)) {
{code}
and writeClient is already configured. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Udit Mehrotra (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062247#comment-17062247
 ] 

Udit Mehrotra commented on HUDI-721:


This could be related to [https://github.com/apache/incubator-hudi/pull/1406] 
where we are fixing an issue with Array of structs as well as Map types. So, if 
you are using complex types it could be possibly because of this. But I have 
never seen this error being thrown out of Spark, but from Hudi code. You may 
want to pull in that patch and retry. If it still doesn't work, a short 
reproduction step would help. I would be happy to take a look.

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> 

[jira] [Updated] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-18 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-722:
-
Description: 
Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
range: X to X inside MessageColumnIORecordConsumer.addBinary call.

Specifically: getColumnWriter().write(value, r[currentLevel], 
currentColumnIO.getDefinitionLevel());

fails as size of r is the same as current level. What can be causing it?

 

It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
unions present).

But what is surprising is that it fails to write top level field: 
PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
"_hoodie_commit_seqno": "20200317215711_0_650",

  was:
Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
range: X to X inside MessageColumnIORecordConsumer.addBinary call.

Specifically: getColumnWriter().write(value, r[currentLevel], 
currentColumnIO.getDefinitionLevel());

fails as size of r is the same as current level. What can be causing it?

 

It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
1.10.1 Avro is a very complex object (~2.5k columns, highly nested).

But what is surprising is that it fails to write top level field: 
PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
"_hoodie_commit_seqno": "20200317215711_0_650",


> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2020-03-18 Thread liujinhui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062242#comment-17062242
 ] 

liujinhui commented on HUDI-648:


Hello, I also encountered this problem recently. Occasionally the kafka data 
received by hudi is wrong, the consumption is reported wrong, and it needs to 
be skipped. It seems that it cannot be skipped at the moment. Do you have a 
solution for this?

[~vinoth]

> Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction 
> writes
> 
>
> Key: HUDI-648
> URL: https://issues.apache.org/jira/browse/HUDI-648
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> We would like a way to hand the erroring records from writing or compaction 
> back to the users, in a separate table or log. This needs to work generically 
> across all the different writer paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on issue #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600954141
 
 
   > @lamber-ken lets then add the note in a separate PR first.. then proceed 
to merge this?
   
   Hi, during testing phase, push the build result to `test-content` folder.
   So, IMO, we can test the whole flow first.
   
   
![image](https://user-images.githubusercontent.com/20113411/77025334-c86bb180-69cb-11ea-8d5d-e5085a491209.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2020-03-18 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-718:
-
Fix Version/s: 0.6.0

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-721:
-
Component/s: Common Core

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> 

[jira] [Updated] (HUDI-719) Exception during clean phase: Found org.apache.hudi.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan

2020-03-18 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-719:
-
Fix Version/s: 0.6.0

> Exception during clean phase: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan
> --
>
> Key: HUDI-719
> URL: https://issues.apache.org/jira/browse/HUDI-719
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> Dataset is written using 0.5 moving to the latest master:
>  
> Exception in thread "main" org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
>  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>  at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
>  at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
>  at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>  at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>  at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86)
>  at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843)
>  at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:397)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-03-18 Thread Alexander Filipchik (Jira)
Alexander Filipchik created HUDI-722:


 Summary: IndexOutOfBoundsException in 
MessageColumnIORecordConsumer.addBinary when writing parquet
 Key: HUDI-722
 URL: https://issues.apache.org/jira/browse/HUDI-722
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Writer Core
Reporter: Alexander Filipchik
 Fix For: 0.6.0


Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
range: X to X inside MessageColumnIORecordConsumer.addBinary call.

Specifically: getColumnWriter().write(value, r[currentLevel], 
currentColumnIO.getDefinitionLevel());

fails as size of r is the same as current level. What can be causing it?

 

It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
1.10.1 Avro is a very complex object (~2.5k columns, highly nested).

But what is surprising is that it fails to write top level field: 
PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
"_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-721:

Fix Version/s: 0.6.0

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
>   at 
> 

[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062237#comment-17062237
 ] 

Vinoth Chandar commented on HUDI-721:
-

cc [~uditme] love to get your thoughts on this also.. 

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.0
>
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> 

[GitHub] [incubator-hudi] vinothchandar commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on issue #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600912161
 
 
   @lamber-ken lets then add the note in a separate PR first.. then proceed to 
merge this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062144#comment-17062144
 ] 

Alexander Filipchik commented on HUDI-721:
--

[~vbalaji] ^^^

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
>   

[GitHub] [incubator-hudi] yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600900820
 
 
   @vinothchandar Replied you on the ML.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
yanghua commented on issue #1412: [HUDI-504] Restructuring and auto-generation 
of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600890300
 
 
   > May be we add a small note somewhere that 0.5.2 is unreleased? 
   
   +1 to have this note
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-721:
-
Description: 
hi,

was working on the upgrade from 0.5 to 0.6 and hit a bug in 
AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
generator (convertStructTypeToAvroSchema method), but after some debugging I'm 
pretty sure the issue is somewhere in the: AvroConversionHelper.

What happens: when complexes type is extracted using SqlTransformer (using 
select bla fro ) where bla is complex type with arrays of struct, Kryo 
serialization breaks with :

 
{code:java}
28701 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler  
- ResultStage 1 (isEmpty at DeltaSync.java:337) failed in 12.146 s due to Job 
aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
org.apache.avro.UnresolvedUnionException: Not in union 
at 
org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
at 
org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
at 
org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
at 
org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:351)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:456)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 

[jira] [Comment Edited] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062025#comment-17062025
 ] 

Alexander Filipchik edited comment on HUDI-721 at 3/18/20, 8:27 PM:


Also, putting old code in AvroConversionHelper solves he issue, but ONLY when 
databrics SchemaConverters is used. Using spark one is not working.


was (Author: afilipchik):
Also, putting old code in AvroConversionHelper solves he issue, even when new 
spark SchemaConverters is used

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
>  
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)

[jira] [Comment Edited] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062025#comment-17062025
 ] 

Alexander Filipchik edited comment on HUDI-721 at 3/18/20, 8:11 PM:


Also, putting old code in AvroConversionHelper solves he issue, even when new 
spark SchemaConverters is used


was (Author: afilipchik):
Also, putting old code in AvroConversionHelper solves he issue.

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
>  
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> 

[jira] [Commented] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062025#comment-17062025
 ] 

Alexander Filipchik commented on HUDI-721:
--

Also, putting old code in AvroConversionHelper solves he issue.

> AvroConversionUtils is broken for complex types in 0.6
> --
>
> Key: HUDI-721
> URL: https://issues.apache.org/jira/browse/HUDI-721
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>
> hi,
> was working on the upgrade from 0.5 to 0.6 and hit a bug in 
> AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
> generator (convertStructTypeToAvroSchema method), but after some debugging 
> I'm pretty sure the issue is somewhere in the: AvroConversionHelper.
>  
> What happens: when complexes type is extracted using SqlTransformer (using 
> select bla fro ) where bla is complex type with arrays of struct, Kryo 
> serialization breaks with :
>  
> {code:java}
> 28701 [dag-scheduler-event-loop] INFO  
> org.apache.spark.scheduler.DAGScheduler  - ResultStage 1 (isEmpty at 
> DeltaSync.java:337) failed in 12.146 s due to Job aborted due to stage 
> failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 
> 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
> org.apache.avro.UnresolvedUnionException: Not in union 
>   at 
> org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
>   at 
> org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
>   at 
> org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
>   at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
>   at 
> org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
>   at 
> 

[jira] [Created] (HUDI-721) AvroConversionUtils is broken for complex types in 0.6

2020-03-18 Thread Alexander Filipchik (Jira)
Alexander Filipchik created HUDI-721:


 Summary: AvroConversionUtils is broken for complex types in 0.6
 Key: HUDI-721
 URL: https://issues.apache.org/jira/browse/HUDI-721
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Alexander Filipchik


hi,

was working on the upgrade from 0.5 to 0.6 and hit a bug in 
AvroConversionUtils. I originally blames it on Spark parquet to avro schema 
generator (convertStructTypeToAvroSchema method), but after some debugging I'm 
pretty sure the issue is somewhere in the: AvroConversionHelper.

 

What happens: when complexes type is extracted using SqlTransformer (using 
select bla fro ) where bla is complex type with arrays of struct, Kryo 
serialization breaks with :

 
{code:java}
28701 [dag-scheduler-event-loop] INFO  org.apache.spark.scheduler.DAGScheduler  
- ResultStage 1 (isEmpty at DeltaSync.java:337) failed in 12.146 s due to Job 
aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent 
failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): 
org.apache.avro.UnresolvedUnionException: Not in union 
at 
org.apache.avro.generic.GenericData.resolveUnion(GenericData.java:740)
at 
org.apache.avro.generic.GenericDatumWriter.resolveUnion(GenericDatumWriter.java:205)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:123)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:192)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:120)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:125)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:166)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:156)
at 
org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:118)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:75)
at 
org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:62)
at 
org.apache.spark.serializer.GenericAvroSerializer.serializeDatum(GenericAvroSerializer.scala:125)
at 
org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:159)
at 
org.apache.spark.serializer.GenericAvroSerializer.write(GenericAvroSerializer.scala:47)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:361)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:302)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:351)
at 

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394555849
 
 

 ##
 File path: .travis.yml
 ##
 @@ -0,0 +1,40 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
 
 Review comment:
   It's too late today, will double check it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061977#comment-17061977
 ] 

Balaji Varadarajan commented on HUDI-716:
-

Spoke with Alex to debug this issue. It has to do with failed clean operation 
before upgrade to 0.5.[1,2].  Here is the context

```

Before 0.5.1, hudi was handling clean differently than commit action. If a 
clean failed during final step, there would be .clean.inflight files (empty or 
corrupted) lying around but pre 0.5.1 did not care. With 0.5.1 onwards, we are 
handling clean action consistently like commit. Hudi would first scan all files 
and store the plan (atomically) in .clean.requested and .clean.inflight before 
triggering the actual clean. If there are any intermittent failures, subsequent 
clean would read the cleaner plan again and retry.

 

In this case, Cleaner is reading an empty .clean.inflight plan and looking for 
Cleaner plan but it is empty. 

```

One workaround would be to catch exception in the below section and delete 
pending clean

```

// If there are inflight(failed) or previously requested clean operation, first 
perform them
table.getCleanTimeline().filterInflightsAndRequested().getInstants().forEach(hoodieInstant
 -> {
 LOG.info("There were previously unfinished cleaner operations. Finishing 
Instant=" + hoodieInstant);
 runClean(table, hoodieInstant);
});

```

 

 

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian 

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394553164
 
 

 ##
 File path: .travis.yml
 ##
 @@ -0,0 +1,40 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
 
 Review comment:
   okay.. just to be safe, do you want to account for that scenario? We can 
also file a follow up JIRA 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394540086
 
 

 ##
 File path: .travis.yml
 ##
 @@ -0,0 +1,40 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
 
 Review comment:
   We didn't specify the cache config like bellow, so IMO, the travis will not 
cache it.
   
   ```
   cache:
 directories:
   - "$HOME/.m2"
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394536044
 
 

 ##
 File path: docs/_includes/nav_list
 ##
 @@ -18,20 +18,19 @@
 {% assign menu_label = "文档菜单" %}
 {% endif %}
 {% elsif page.version == "0.5.1" %}
-{% assign navigation = site.data.navigation["0.5.1_docs"] %}
+{% assign navigation = site.data.navigation["0.5.1_docs"] %}
 
 Review comment:
   Here is a bug, it'll cause build failed
   
   
![image](https://user-images.githubusercontent.com/20113411/76991189-22e11f80-6984-11ea-845b-1a697c8bc391.png)
   
   Update
   
   
![image](https://user-images.githubusercontent.com/20113411/76991411-83705c80-6984-11ea-8bc9-ab052e1f009b.png)
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
lamber-ken commented on issue #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600773995
 
 
   > @lamber-ken neat. so it will build and not push, for PRs.
   > 
   > Once landed, it will build again and republish?
   
   Right


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on issue #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600772516
 
 
   Minor comments only. This approach seems safe to me.. 
   cc @yanghua , since we are in the middle of 0.5.2.. if we land this, then 
the docs will get published? May be we add a small note somewhere that 0.5.2 is 
unreleased? (or may be its not a big deal) I ll let vino take the call 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394530282
 
 

 ##
 File path: docs/_includes/nav_list
 ##
 @@ -18,20 +18,19 @@
 {% assign menu_label = "文档菜单" %}
 {% endif %}
 {% elsif page.version == "0.5.1" %}
-{% assign navigation = site.data.navigation["0.5.1_docs"] %}
+{% assign navigation = site.data.navigation["0.5.1_docs"] %}
 
 Review comment:
   why these changes? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on a change in pull request #1412: [HUDI-504] 
Restructuring and auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#discussion_r394529953
 
 

 ##
 File path: .travis.yml
 ##
 @@ -0,0 +1,40 @@
+language: ruby
+rvm:
+  - 2.6.3
+
+env:
+  global:
+- GIT_USER="CI BOT"
+- GIT_EMAIL="ci...@hudi.apache.org"
+- GIT_REPO="apache"
+- GIT_PROJECT="incubator-hudi"
+- GIT_BRANCH="asf-site"
+- DOCS_ROOT="`pwd`/docs"
+
+before_install:
+  - git config --global user.name ${GIT_USER}
+  - git config --global user.email ${GIT_EMAIL}
+  - git remote add hudi 
https://${GIT_TOKEN}@github.com/${GIT_REPO}/${GIT_PROJECT}.git
+  - git checkout -b pr
 
 Review comment:
   is there any chance that this branch already exists?  (e.g travis caching 
the git clone etc?) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1412: [HUDI-504] Restructuring and auto-generation of docs

2020-03-18 Thread GitBox
vinothchandar commented on issue #1412: [HUDI-504] Restructuring and 
auto-generation of docs
URL: https://github.com/apache/incubator-hudi/pull/1412#issuecomment-600770553
 
 
   @lamber-ken neat. so it will build and not push, for PRs.
   
   Once landed, it will build again and republish? 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1414: [HUDI-437] Add user-defined index config

2020-03-18 Thread GitBox
vinothchandar commented on a change in pull request #1414: [HUDI-437] Add 
user-defined index config
URL: https://github.com/apache/incubator-hudi/pull/1414#discussion_r394524721
 
 

 ##
 File path: docs/configurations.cn.md
 ##
 @@ -223,7 +223,11 @@ Following configs control indexing behavior, which tags 
incoming records as eith
 
 [withIndexConfig](#withIndexConfig) (HoodieIndexConfig) 
 This is pluggable to have a external index (HBase) or 
use the default bloom filter stored in the Parquet files
-
+
+# withIndexClass(indexClass = "x.y.z.UserDefinedIndex") {#withIndexClass}
+Property: `hoodie.index.class` 
+Full path of user-defined index class and must 
extends HoodieIndex class. It will take precedence over the `hoodie.index.type` 
configuration if specified
 
 Review comment:
   must extend?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061933#comment-17061933
 ] 

Vinoth Chandar commented on HUDI-716:
-

I think it happens only when there is an inflight .clean file and the upgrade 
happens? 
[~vbalaji] 

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061925#comment-17061925
 ] 

lamber-ken edited comment on HUDI-716 at 3/18/20, 5:29 PM:
---

I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}
 

*Step2: upgrade to hudi 0.5.1*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))

for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}
 

*Step3: upgrade to hudi master*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
--driver-memory 6G \
--packages org.apache.spark:spark-avro_2.11:2.4.4 \
--jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle_*.*-*.*.*-SNAPSHOT.jar` \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))

for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
 

[jira] [Comment Edited] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061925#comment-17061925
 ] 

lamber-ken edited comment on HUDI-716 at 3/18/20, 5:21 PM:
---

I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}
 

*Step2: upgrade to hudi 0.5.1*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))

for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}


was (Author: lamber-ken):
I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 

[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061925#comment-17061925
 ] 

lamber-ken commented on HUDI-716:
-

I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> 

[jira] [Comment Edited] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061925#comment-17061925
 ] 

lamber-ken edited comment on HUDI-716 at 3/18/20, 5:18 PM:
---

I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}


was (Author: lamber-ken):
I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>

[jira] [Comment Edited] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061925#comment-17061925
 ] 

lamber-ken edited comment on HUDI-716 at 3/18/20, 5:18 PM:
---

I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}


was (Author: lamber-ken):
I tried to reproduce it, but it works ok.

*Step1: Use hudi 0.5.0 generate old datas*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.4.4-bin-hadoop2.7
${SPARK_HOME}/bin/spark-shell \
  --packages org.apache.hudi:hudi-spark-bundle:0.5.0-incubating \
  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'

val tableName = "hudi_mor_table"
val basePath = "file:///tmp/hudi_mor_table"
var datas = List("""{ "name": "kenken", "ts": "qwer", "age": 12, "location": 
"latitude"}""")

val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
mode("Overwrite").
save(basePath)

var datas = List.tabulate(30)(i => List(s"""{ "name": "kenken${i}", "ts": 
"zasz", "age": 123, "location": "latitude"}"""))
for (data <- datas) {
  val df = spark.read.json(spark.sparkContext.parallelize(data, 2))
  df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", tableName).
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
mode("Append").
save(basePath)
}

spark.read.format("org.apache.hudi").load(basePath + "/*/").show()
{code}

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>

[jira] [Resolved] (HUDI-344) Hudi Dataset Snapshot Exporter

2020-03-18 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-344.
-
Resolution: Implemented

The main features were implemented. Some future improvements or fix are linked 
in the related issues hence resolving this ticket.

> Hudi Dataset Snapshot Exporter
> --
>
> Key: HUDI-344
> URL: https://issues.apache.org/jira/browse/HUDI-344
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Utilities
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A dataset exporter tool for snapshotting. See 
> [RFC-9|https://cwiki.apache.org/confluence/display/HUDI/RFC-9%3A+Hudi+Dataset+Snapshot+Exporter]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061888#comment-17061888
 ] 

lamber-ken commented on HUDI-716:
-

[~vinoth] willing to take it. :) 

 

Hi [~afilipchik], waiting for you, need more detail infomation as balaji 
mentioned, thanks.

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-03-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-716:

Status: Open  (was: New)

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.6.0
>
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600719966
 
 
   Please see my response on the mailing list.. I think we should file. LEGAL 
jira and get this sorted out.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua edited a comment on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600692484
 
 
   > are we just collecting the NOTICE files for all dependencies? is that the 
intention?
   
   @vinothchandar  No, IMO, it is not about dependencies. It is about that we 
have declared "This product includes code from Apache xxx" in our LICENSE 
files. It seems those "xxx" is Justin mentioned: "bundled several ASF projects".
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600692484
 
 
   > are we just collecting the NOTICE files for all dependencies? is that the 
intention?
   
   No, IMO, it is not about dependencies. It is about that we have declared 
"This product includes code from Apache xxx" in our LICENSE files. It seems 
those "xxx" is Justin mentioned: "bundled several ASF projects".
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
vinothchandar commented on issue #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600685735
 
 
   are we just collecting the NOTICE files for all dependencies? is that the 
intention?  
   
   We already have  the NOTICE appender transformer for the shade plugin. I was 
mid way into understanding if it works and whether or not its correct.. That 
should give us a concatenated NOTICE file from all dependencies


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1106: [HUDI-209] Implement JMX metrics reporter

2020-03-18 Thread GitBox
codecov-io edited a comment on issue #1106: [HUDI-209] Implement JMX metrics 
reporter
URL: https://github.com/apache/incubator-hudi/pull/1106#issuecomment-593074963
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=h1) 
Report
   > Merging 
[#1106](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/779edc068865898049569da0fe750574f93a0dca=desc)
 will **decrease** coverage by `0.37%`.
   > The diff coverage is `3.66%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1106/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1106  +/-   ##
   
   - Coverage 67.78%   67.41%   -0.38% 
 Complexity  245  245  
   
 Files   338  339   +1 
 Lines 1638616474  +88 
 Branches   1677 1682   +5 
   
   - Hits  1110811106   -2 
   - Misses 4539 4630  +91 
   + Partials739  738   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `83.91% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-35.00%)` | `0.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/metrics/JmxMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9KbXhNZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...ava/org/apache/hudi/metrics/JmxReporterServer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9KbXhSZXBvcnRlclNlcnZlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...g/apache/hudi/metrics/MetricsGraphiteReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzR3JhcGhpdGVSZXBvcnRlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/metrics/MetricsReporterFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXJGYWN0b3J5LmphdmE=)
 | `46.15% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/config/HoodieMetricsConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1ldHJpY3NDb25maWcuamF2YQ==)
 | `52.77% <42.85%> (-2.07%)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <50.00%> (-11.82%)` | `0.00 <0.00> (ø)` | |
   | ... and [2 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=footer).
 Last update 
[779edc0...183cbdd](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and 

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1106: [HUDI-209] Implement JMX metrics reporter

2020-03-18 Thread GitBox
codecov-io edited a comment on issue #1106: [HUDI-209] Implement JMX metrics 
reporter
URL: https://github.com/apache/incubator-hudi/pull/1106#issuecomment-593074963
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=h1) 
Report
   > Merging 
[#1106](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/779edc068865898049569da0fe750574f93a0dca=desc)
 will **decrease** coverage by `0.37%`.
   > The diff coverage is `3.66%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1106/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1106  +/-   ##
   
   - Coverage 67.78%   67.41%   -0.38% 
 Complexity  245  245  
   
 Files   338  339   +1 
 Lines 1638616474  +88 
 Branches   1677 1682   +5 
   
   - Hits  1110811106   -2 
   - Misses 4539 4630  +91 
   + Partials739  738   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `83.91% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `40.00% <0.00%> (-35.00%)` | `0.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/metrics/JmxMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9KbXhNZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...ava/org/apache/hudi/metrics/JmxReporterServer.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9KbXhSZXBvcnRlclNlcnZlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...g/apache/hudi/metrics/MetricsGraphiteReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzR3JhcGhpdGVSZXBvcnRlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...rg/apache/hudi/metrics/MetricsReporterFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXJGYWN0b3J5LmphdmE=)
 | `46.15% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...va/org/apache/hudi/config/HoodieMetricsConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZU1ldHJpY3NDb25maWcuamF2YQ==)
 | `52.77% <42.85%> (-2.07%)` | `0.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/metrics/Metrics.java](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzLmphdmE=)
 | `56.75% <50.00%> (-11.82%)` | `0.00 <0.00> (ø)` | |
   | ... and [2 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1106/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=footer).
 Last update 
[779edc0...183cbdd](https://codecov.io/gh/apache/incubator-hudi/pull/1106?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and 

[GitHub] [incubator-hudi] yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600659531
 
 
   @shaofengshi Can you help to review this PR? thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-03-18 Thread GitBox
s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r394359235
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/NumericUtils.java
 ##
 @@ -31,4 +38,27 @@ public static String humanReadableByteCount(double bytes) {
 String pre = "KMGTPE".charAt(exp - 1) + "";
 return String.format("%.1f %sB", bytes / Math.pow(1024, exp), pre);
   }
+
+  public static long getMessageDigestHash(final String algorithmName, final 
String string) {
+MessageDigest md;
+try {
+  md = MessageDigest.getInstance(algorithmName);
+} catch (NoSuchAlgorithmException e) {
+  throw new HoodieException(e);
+}
+return 
asLong(Objects.requireNonNull(md).digest(string.getBytes(StandardCharsets.UTF_8)));
+  }
+
+  public static long asLong(byte[] bytes) {
+ValidationUtils.checkState(bytes.length >= 8, "HashCode#asLong() requires 
>= 8 bytes.");
+return padToLong(bytes);
+  }
+
+  public static long padToLong(byte[] bytes) {
+long retVal = (bytes[0] & 0xFF);
 
 Review comment:
   @vinothchandar this is a nitpick comment. there is nothing wrong in and with 
0xFF but it is not really needed right ?
   I maybe wrong here but my understanding is 
   `long retVal = bytes[0]`
   and 
   `long retVal = (bytes[0] & 0xFF)` are the same


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-03-18 Thread GitBox
s-sanjay commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r392642073
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/NumericUtils.java
 ##
 @@ -31,4 +38,27 @@ public static String humanReadableByteCount(double bytes) {
 String pre = "KMGTPE".charAt(exp - 1) + "";
 return String.format("%.1f %sB", bytes / Math.pow(1024, exp), pre);
   }
+
+  public static long getMessageDigestHash(final String algorithmName, final 
String string) {
+MessageDigest md;
+try {
+  md = MessageDigest.getInstance(algorithmName);
+} catch (NoSuchAlgorithmException e) {
+  throw new HoodieException(e);
+}
+return 
asLong(Objects.requireNonNull(md).digest(string.getBytes(StandardCharsets.UTF_8)));
+  }
+
+  public static long asLong(byte[] bytes) {
+ValidationUtils.checkState(bytes.length >= 8, "HashCode#asLong() requires 
>= 8 bytes.");
+return padToLong(bytes);
+  }
+
+  public static long padToLong(byte[] bytes) {
+long retVal = (bytes[0] & 0xFF);
+for (int i = 1; i < Math.min(bytes.length, 8); i++) {
 
 Review comment:
   wondering instead of making this public, if we can make it private and then 
test the asLong method
   
   also would it help in readability if we unroll the for loop like this ?
   ```
   byte[] padded = Arrays.copyOf(bytes, 8);
   long retVal =  padded[0]
   retVal |= (padded[1] << i * 8)
   retVal |= (padded[2] << i * 16)
   retVal |= (padded[3] << i * 24)
   retVal |= (padded[4] << i * 32)
   retVal |= (padded[5] << i * 40)
   retVal |= (padded[6] << i * 48)
   retVal |= (padded[7] << i * 56)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1106: [HUDI-209] Implement JMX metrics reporter

2020-03-18 Thread GitBox
leesf commented on issue #1106: [HUDI-209] Implement JMX metrics reporter
URL: https://github.com/apache/incubator-hudi/pull/1106#issuecomment-600569126
 
 
   @XuQianJin-Stars Please rebase to the master and I think we would be home


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-344] Add partitioner param to Exporter (#1405)

2020-03-18 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 779edc0  [HUDI-344] Add partitioner param to Exporter (#1405)
779edc0 is described below

commit 779edc068865898049569da0fe750574f93a0dca
Author: Raymond Xu <2701446+xushi...@users.noreply.github.com>
AuthorDate: Wed Mar 18 04:24:04 2020 -0700

[HUDI-344] Add partitioner param to Exporter (#1405)
---
 .../hudi/utilities/HoodieSnapshotExporter.java | 126 +
 .../hudi/utilities/TestHoodieSnapshotExporter.java | 110 --
 2 files changed, 178 insertions(+), 58 deletions(-)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
index b58b5d3..c39daa7 100644
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
@@ -18,16 +18,9 @@
 
 package org.apache.hudi.utilities;
 
-import com.beust.jcommander.JCommander;
-import com.beust.jcommander.Parameter;
-
-import org.apache.hadoop.fs.FileStatus;
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.fs.FileUtil;
-import org.apache.hadoop.fs.Path;
-import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.SerializableConfiguration;
 import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.HoodieTimeline;
@@ -36,6 +29,18 @@ import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
 import org.apache.hudi.common.util.FSUtils;
 import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ReflectionUtils;
+import org.apache.hudi.common.util.StringUtils;
+
+import com.beust.jcommander.IValueValidator;
+import com.beust.jcommander.JCommander;
+import com.beust.jcommander.Parameter;
+import com.beust.jcommander.ParameterException;
+import com.google.common.collect.ImmutableList;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
 import org.apache.spark.api.java.JavaSparkContext;
@@ -45,41 +50,66 @@ import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.SaveMode;
 import org.apache.spark.sql.SparkSession;
-import org.apache.spark.sql.execution.datasources.DataSource;
-
-import scala.Tuple2;
-import scala.collection.JavaConversions;
 
 import java.io.IOException;
 import java.io.Serializable;
 import java.util.ArrayList;
-import java.util.Arrays;
 import java.util.List;
 import java.util.stream.Collectors;
 
+import scala.Tuple2;
+import scala.collection.JavaConversions;
+
 /**
  * Export the latest records of Hudi dataset to a set of external files (e.g., 
plain parquet files).
  *
  * @experimental This export is an experimental tool. If you want to export 
hudi to hudi, please use HoodieSnapshotCopier.
  */
 public class HoodieSnapshotExporter {
+
+  @FunctionalInterface
+  public interface Partitioner {
+
+DataFrameWriter partition(Dataset source);
+
+  }
+
   private static final Logger LOG = 
LogManager.getLogger(HoodieSnapshotExporter.class);
 
+  public static class OutputFormatValidator implements IValueValidator 
{
+
+static final String HUDI = "hudi";
+static final List FORMATS = ImmutableList.of("json", "parquet", 
HUDI);
+
+@Override
+public void validate(String name, String value) {
+  if (value == null || !FORMATS.contains(value)) {
+throw new ParameterException(
+String.format("Invalid output format: value:%s: supported 
formats:%s", value, FORMATS));
+  }
+}
+  }
+
   public static class Config implements Serializable {
+
 @Parameter(names = {"--source-base-path"}, description = "Base path for 
the source Hudi dataset to be snapshotted", required = true)
-String sourceBasePath = null;
+String sourceBasePath;
 
-@Parameter(names = {"--target-base-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
-String targetOutputPath = null;
+@Parameter(names = {"--target-output-path"}, description = "Base path for 
the target output files (snapshots)", required = true)
+String targetOutputPath;
 
-@Parameter(names = {"--output-format"}, description = "e.g. Hudi or 
Parquet", required = true)
+@Parameter(names = {"--output-format"}, description = "Output 

[GitHub] [incubator-hudi] leesf merged pull request #1405: [HUDI-344] Add partitioner param to Exporter

2020-03-18 Thread GitBox
leesf merged pull request #1405: [HUDI-344] Add partitioner param to Exporter
URL: https://github.com/apache/incubator-hudi/pull/1405
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] codecov-io commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
codecov-io commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600535972
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=h1) 
Report
   > Merging 
[#1417](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/0a4902eccece1df959946fcb7379a94fc5fe0784=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1417/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1417  +/-   ##
   
   + Coverage 67.73%   67.74%   +0.01% 
 Complexity  243  243  
   
 Files   338  338  
 Lines 1638316383  
 Branches   1675 1675  
   
   + Hits  1109711099   +2 
   + Misses 4546 4544   -2 
 Partials740  740  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...g/apache/hudi/metrics/InMemoryMetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1417/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9Jbk1lbW9yeU1ldHJpY3NSZXBvcnRlci5qYXZh)
 | `75.00% <0.00%> (+50.00%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=footer).
 Last update 
[0a4902e...f700c3b](https://codecov.io/gh/apache/incubator-hudi/pull/1417?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua commented on issue #1417: [HUDI-720] NOTICE file needs to add more 
content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417#issuecomment-600522459
 
 
   It would be better to have @steveblackmon to review this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-720) NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-720:

Labels: pull-request-available  (was: )

> NOTICE file needs to add more content based on the NOTICE files of the ASF 
> projects that hudi bundles
> -
>
> Key: HUDI-720
> URL: https://issues.apache.org/jira/browse/HUDI-720
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Blocker
>  Labels: pull-request-available
>
> Based on Justion's suggestion on general@ voting thread[1]. The NOTICE file 
> needs more work.
> [1]: 
> http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua opened a new pull request #1417: [HUDI-720] NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread GitBox
yanghua opened a new pull request #1417: [HUDI-720] NOTICE file needs to add 
more content based on the NOTICE files of the ASF projects that hudi bundles
URL: https://github.com/apache/incubator-hudi/pull/1417
 
 
   
   
   ## What is the purpose of the pull request
   
   *This pull request adds more content based on the NOTICE files of the ASF 
projects that hudi bundles*
   
   ## Brief change log
   
 - *NOTICE file needs to add more content based on the NOTICE files of the 
ASF projects that hudi bundles*
   
   ## Verify this pull request
   
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-720) NOTICE file needs to add more content based on the NOTICE files of the ASF projects that hudi bundles

2020-03-18 Thread vinoyang (Jira)
vinoyang created HUDI-720:
-

 Summary: NOTICE file needs to add more content based on the NOTICE 
files of the ASF projects that hudi bundles
 Key: HUDI-720
 URL: https://issues.apache.org/jira/browse/HUDI-720
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Release  Administrative
Reporter: vinoyang
Assignee: vinoyang


Based on Justion's suggestion on general@ voting thread[1]. The NOTICE file 
needs more work.

[1]: 
http://mail-archives.apache.org/mod_mbox/incubator-general/202003.mbox/%3C932F44A0-1CEE-4549-896B-70FB61EAA034%40classsoftware.com%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2020-03-18 Thread cdmikechen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061447#comment-17061447
 ] 

cdmikechen commented on HUDI-83:


[~vinoth] Sorry for delay. I recently was busy on doing some other thins. But I 
may have some new ideas about the problem that hive can't read the correct 
timestamp type. I'll take a look at some time to do a verification recently. 

In hudi 0.5.1, I temporarily modified the code in hudi and hive for timestamp 
type to solve this problem. I also want to do it by extend hive class instead 
of changing hive source code in an appropriate way.

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration, Usability
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> [https://github.com/apache/incubator-hudi/issues/543] ; related issues 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)