[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-18 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575977167
 
 
   `TestCsvDFSSource` will be added once #1239 is merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1250: [HUDI-557] Additional job for supporting multiple version docs

2020-01-18 Thread GitBox
lamber-ken commented on issue #1250: [HUDI-557] Additional job for supporting 
multiple version docs
URL: https://github.com/apache/incubator-hudi/pull/1250#issuecomment-575976889
 
 
   ## Main things:
   - Add versions to `previous_docs`.
   - Add `0.5.0_cn_docs` in `navigation.yml` file.
   - Support out of data version tip.
   - Quick review, https://lamber-ken.github.io
   
   
   
   
![image](https://user-images.githubusercontent.com/20113411/72677042-e94e7e00-3ad2-11ea-885f-7f085a23c8cb.png)
   
   
![image](https://user-images.githubusercontent.com/20113411/72677047-fbc8b780-3ad2-11ea-9316-7ede45c4acd9.png)
   
   
![image](https://user-images.githubusercontent.com/20113411/72677063-3599be00-3ad3-11ea-8bf1-6401cd6810f7.png)
   
   
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta Streamer

2020-01-18 Thread GitBox
yihua commented on issue #1165: [HUDI-76] Add CSV Source support for Hudi Delta 
Streamer
URL: https://github.com/apache/incubator-hudi/pull/1165#issuecomment-575976539
 
 
   @bvaradar This PR is ready for review.
   
   @leesf @vinothchandar Feel free to also review this PR.  I'm not sure if we 
can merge this PR by the release cut.  If not, we can add this feature to the 
next release.
   
   Thanks @UZi5136225 for helping test the functionality of this PR and 
reporting the [issue](https://issues.apache.org/jira/browse/HUDI-552) of 
corrupt data generated from DeltaStreamer with text files (CSV format with no 
header line).  The latter has been fix in another 
[PR](https://github.com/apache/incubator-hudi/pull/1246).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Description: 
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

 

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate 
corrupt data.  This behavior is originally reported by [~liujinhui].

  was:
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!


> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the 

[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Description: 
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

 

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate 
corrupt data.  This behavior of generating corrupt data is originally reported 
by [~liujinhui].

  was:
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

 

Note that for some Avro schema, the DeltaStreamer sync may succeed but generate 
corrupt data.  This behavior is originally reported by [~liujinhui].


> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses 

[jira] [Updated] (HUDI-543) Carefully draft release notes for 0.5.1 with all breaking/user impacting changes

2020-01-18 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-543:
---
Status: In Progress  (was: Open)

> Carefully draft release notes for 0.5.1 with all breaking/user impacting 
> changes
> 
>
> Key: HUDI-543
> URL: https://issues.apache.org/jira/browse/HUDI-543
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Call out all breaking changs : 
>  * Spark 2.4 support drop, avro version change etc.
>  * Need for shading custom Payloads 
>  * --packages for spark-shell 
>  * key generator changes 
>  * _ro suffix for read optimized views.. 
>  
> Also need to call out major release highlights (quoting docs/blogs as 
> available)
>  * better delete support
>  * dynamic bloom filters
>  * DMS support
>  
>  
> I am also linking the different jiras as subtaks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-556) Check if any license attribution is needed for PR 1233

2020-01-18 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-556:
---
Status: Open  (was: New)

> Check if any license attribution is needed for PR 1233
> --
>
> Key: HUDI-556
> URL: https://issues.apache.org/jira/browse/HUDI-556
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> We need to resolve this before we prepare a vote



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-556) Check if any license attribution is needed for PR 1233

2020-01-18 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-556:
---
Status: In Progress  (was: Open)

> Check if any license attribution is needed for PR 1233
> --
>
> Key: HUDI-556
> URL: https://issues.apache.org/jira/browse/HUDI-556
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release  Administrative
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> We need to resolve this before we prepare a vote



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-557) Additional job for supporting multiple version docs

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-557:

Labels: pull-request-available  (was: )

> Additional job for supporting multiple version docs
> ---
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken closed pull request #1251: [HUDI-557] Additional job for supporting multiple version docs

2020-01-18 Thread GitBox
lamber-ken closed pull request #1251: [HUDI-557] Additional job for supporting 
multiple version docs
URL: https://github.com/apache/incubator-hudi/pull/1251
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1250: [HUDI-557] Additional job for supporting multiple version docs

2020-01-18 Thread GitBox
lamber-ken opened a new pull request #1250: [HUDI-557] Additional job for 
supporting multiple version docs
URL: https://github.com/apache/incubator-hudi/pull/1250
 
 
   ## What is the purpose of the pull request
   
   Additional job for supporting multiple version docs.
   
   ## Brief change log
   
 - *Additional job for supporting multiple version docs.*
   
   ## Verify this pull request
   
   This pull request is web doc cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1251: [HUDI-557] Additional job for supporting multiple version docs

2020-01-18 Thread GitBox
lamber-ken opened a new pull request #1251: [HUDI-557] Additional job for 
supporting multiple version docs
URL: https://github.com/apache/incubator-hudi/pull/1251
 
 
   ## What is the purpose of the pull request
   
   Additional job for supporting multiple version docs.
   
   ## Brief change log
   
 - *Additional job for supporting multiple version docs.*
   
   ## Verify this pull request
   
   This pull request is web doc cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #1239: [HUDI-551] Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread GitBox
yihua commented on a change in pull request #1239: [HUDI-551] Abstract a test 
case class for DFS Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239#discussion_r368268103
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/AbstractDFSSourceTestBase.java
 ##
 @@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.utilities.UtilitiesTestBase;
+import org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter;
+import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * An abstract test base for {@link Source} using DFS as the file system.
+ */
+public abstract class AbstractDFSSourceTestBase extends UtilitiesTestBase {
+
+  FilebasedSchemaProvider schemaProvider;
+  String dfsRoot;
+  String fileSuffix;
+  int fileCount = 1;
+  HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+
+  @BeforeClass
+  public static void initClass() throws Exception {
+UtilitiesTestBase.initClass();
+  }
+
+  @AfterClass
+  public static void cleanupClass() throws Exception {
+UtilitiesTestBase.cleanupClass();
+  }
+
+  @Before
+  public void setup() throws Exception {
+super.setup();
+schemaProvider = new FilebasedSchemaProvider(Helpers.setupSchemaOnDFS(), 
jsc);
+  }
+
+  @After
+  public void teardown() throws Exception {
+super.teardown();
+  }
+
+  /**
+   * Prepares the specific {@link Source} to test, by passing in necessary 
configurations.
+   *
+   * @return A {@link Source} using DFS as the file system.
+   */
+  abstract Source prepareDFSSource();
+
+  /**
+   * Writes test data, i.e., a {@link List} of {@link HoodieRecord}, to a file 
on DFS.
+   *
+   * @param records Test data.
+   * @param pathThe path in {@link Path} of the file to write.
+   * @throws IOException
+   */
+  abstract void writeNewDataToFile(List records, Path path) 
throws IOException;
+
+  /**
+   * Generates a batch of test data and writes the data to a file.  This can 
be called multiple times to generate multiple files.
+   *
+   * @return The {@link Path} of the file.
+   * @throws IOException
+   */
+  Path generateOneFile() throws IOException {
+Path path = new Path(dfsRoot, fileCount + fileSuffix);
+switch (fileCount) {
+  case 1:
+writeNewDataToFile(dataGenerator.generateInserts("000", 100), path);
+fileCount++;
+return path;
+  case 2:
+writeNewDataToFile(dataGenerator.generateInserts("001", 1), path);
+fileCount++;
+return path;
+  default:
+return null;
+}
+  }
 
 Review comment:
   In my latest commit, as we discussed, I parameterized this method to take 
the file name, commit time String and the number of records to generate, to 
make it easier to understand and use.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-557) Additional job based on supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-557:

Summary: Additional job based on supporting multiple version docs  (was: 
Additional job base on supporting multiple version docs)

> Additional job based on supporting multiple version docs
> 
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-557) Additional job for supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-557:

Summary: Additional job for supporting multiple version docs  (was: 
Additional job on supporting multiple version docs)

> Additional job for supporting multiple version docs
> ---
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-557) Additional job on supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-557:

Summary: Additional job on supporting multiple version docs  (was: 
Additional job based on supporting multiple version docs)

> Additional job on supporting multiple version docs
> --
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-557) Additional job base on supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-557:

Status: Open  (was: New)

> Additional job base on supporting multiple version docs
> ---
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-557) Additional job base on supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-557:
---

Assignee: lamber-ken

> Additional job base on supporting multiple version docs
> ---
>
> Key: HUDI-557
> URL: https://issues.apache.org/jira/browse/HUDI-557
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-545) Throw NoSuchElementException when init IncrementalRelation

2020-01-18 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018803#comment-17018803
 ] 

lamber-ken commented on HUDI-545:
-

Thanks for your detail steps [~ashsih], I will work on this. :)

> Throw NoSuchElementException when init IncrementalRelation
> --
>
> Key: HUDI-545
> URL: https://issues.apache.org/jira/browse/HUDI-545
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
> Attachments: image-2020-01-17-10-03-23-268.png, 
> image-2020-01-17-10-07-31-105.png
>
>
> If there is an empty commit in HUDI storage then Incremental Pulling throws 
> "java.util.NoSuchElementException". 
> {code:java}
> 20/01/16 19:22:49 ERROR Client: Application diagnostics message: User class 
> threw exception: java.util.NoSuchElementException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
> at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
> at 
> org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
> at 
> org.apache.spark.sql.DataFrameReader.l3oadV1Source(DataFrameReader.scala:223)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> at 
> com.amazon.finautopolarisdataplanesparkemr.emr.java.SparkHiveDataLoadCoreHudiReader.readData(SparkHiveDataLoadCoreHudiReader.java:147)
> at 
> com.amazon.finautopolarisdataplanesparkemr.emr.java.SparkHiveDataLoadCoreHudiReader.start(SparkHiveDataLoadCoreHudiReader.java:73)
> at 
> com.amazon.finautopolarisdataplanesparkemr.emr.java.SparkHiveDataLoadCoreHudiReader.main(SparkHiveDataLoadCoreHudiReader.java:36)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:684)
> Exception in thread "main" org.apache.spark.SparkException: Application 
> application_1579116139216_0082 finished with failed status
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
> at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
> at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
> at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
> at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-557) Additional job base on supporting multiple version docs

2020-01-18 Thread lamber-ken (Jira)
lamber-ken created HUDI-557:
---

 Summary: Additional job base on supporting multiple version docs
 Key: HUDI-557
 URL: https://issues.apache.org/jira/browse/HUDI-557
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: lamber-ken






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1249: [HUDI-555] Support for separate 0.5.0 site docs + instructions for future

2020-01-18 Thread GitBox
lamber-ken commented on issue #1249: [HUDI-555] Support for separate 0.5.0 site 
docs + instructions for future
URL: https://github.com/apache/incubator-hudi/pull/1249#issuecomment-575969720
 
 
    , I can do some additional job base on this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on a change in pull request #1239: [HUDI-551] Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread GitBox
yihua commented on a change in pull request #1239: [HUDI-551] Abstract a test 
case class for DFS Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239#discussion_r368266524
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/AbstractDFSSourceTestBase.java
 ##
 @@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.utilities.UtilitiesTestBase;
+import org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter;
+import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * An abstract test base for {@link Source} using DFS as the file system.
+ */
+public abstract class AbstractDFSSourceTestBase extends UtilitiesTestBase {
+
+  FilebasedSchemaProvider schemaProvider;
+  String dfsRoot;
+  String fileSuffix;
+  int fileCount = 1;
+  HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+
+  @BeforeClass
+  public static void initClass() throws Exception {
+UtilitiesTestBase.initClass();
+  }
+
+  @AfterClass
+  public static void cleanupClass() throws Exception {
+UtilitiesTestBase.cleanupClass();
+  }
+
+  @Before
+  public void setup() throws Exception {
+super.setup();
+schemaProvider = new FilebasedSchemaProvider(Helpers.setupSchemaOnDFS(), 
jsc);
+  }
+
+  @After
+  public void teardown() throws Exception {
+super.teardown();
+  }
+
+  /**
+   * Prepares the specific {@link Source} to test, by passing in necessary 
configurations.
+   *
+   * @return A {@link Source} using DFS as the file system.
+   */
+  abstract Source prepareDFSSource();
+
+  /**
+   * Writes test data, i.e., a {@link List} of {@link HoodieRecord}, to a file 
on DFS.
+   *
+   * @param records Test data.
+   * @param pathThe path in {@link Path} of the file to write.
+   * @throws IOException
+   */
+  abstract void writeNewDataToFile(List records, Path path) 
throws IOException;
+
+  /**
+   * Generates a batch of test data and writes the data to a file.  This can 
be called multiple times to generate multiple files.
+   *
+   * @return The {@link Path} of the file.
+   * @throws IOException
+   */
+  Path generateOneFile() throws IOException {
+Path path = new Path(dfsRoot, fileCount + fileSuffix);
+switch (fileCount) {
+  case 1:
+writeNewDataToFile(dataGenerator.generateInserts("000", 100), path);
+fileCount++;
+return path;
+  case 2:
+writeNewDataToFile(dataGenerator.generateInserts("001", 1), path);
+fileCount++;
+return path;
+  default:
+return null;
+}
+  }
 
 Review comment:
   Each time it's called, the method generates a new file for a batch of data.  
Only two batches are considered.  Any suggestions to make it better?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1239: [HUDI-551] Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread GitBox
leesf commented on a change in pull request #1239: [HUDI-551] Abstract a test 
case class for DFS Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239#discussion_r368264818
 
 

 ##
 File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/AbstractDFSSourceTestBase.java
 ##
 @@ -0,0 +1,186 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.sources;
+
+import org.apache.hudi.AvroConversionUtils;
+import org.apache.hudi.common.HoodieTestDataGenerator;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.utilities.UtilitiesTestBase;
+import org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter;
+import org.apache.hudi.utilities.schema.FilebasedSchemaProvider;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.LocatedFileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.RemoteIterator;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * An abstract test base for {@link Source} using DFS as the file system.
+ */
+public abstract class AbstractDFSSourceTestBase extends UtilitiesTestBase {
+
+  FilebasedSchemaProvider schemaProvider;
+  String dfsRoot;
+  String fileSuffix;
+  int fileCount = 1;
+  HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
+
+  @BeforeClass
+  public static void initClass() throws Exception {
+UtilitiesTestBase.initClass();
+  }
+
+  @AfterClass
+  public static void cleanupClass() throws Exception {
+UtilitiesTestBase.cleanupClass();
+  }
+
+  @Before
+  public void setup() throws Exception {
+super.setup();
+schemaProvider = new FilebasedSchemaProvider(Helpers.setupSchemaOnDFS(), 
jsc);
+  }
+
+  @After
+  public void teardown() throws Exception {
+super.teardown();
+  }
+
+  /**
+   * Prepares the specific {@link Source} to test, by passing in necessary 
configurations.
+   *
+   * @return A {@link Source} using DFS as the file system.
+   */
+  abstract Source prepareDFSSource();
+
+  /**
+   * Writes test data, i.e., a {@link List} of {@link HoodieRecord}, to a file 
on DFS.
+   *
+   * @param records Test data.
+   * @param pathThe path in {@link Path} of the file to write.
+   * @throws IOException
+   */
+  abstract void writeNewDataToFile(List records, Path path) 
throws IOException;
+
+  /**
+   * Generates a batch of test data and writes the data to a file.  This can 
be called multiple times to generate multiple files.
+   *
+   * @return The {@link Path} of the file.
+   * @throws IOException
+   */
+  Path generateOneFile() throws IOException {
+Path path = new Path(dfsRoot, fileCount + fileSuffix);
+switch (fileCount) {
+  case 1:
+writeNewDataToFile(dataGenerator.generateInserts("000", 100), path);
+fileCount++;
+return path;
+  case 2:
+writeNewDataToFile(dataGenerator.generateInserts("001", 1), path);
+fileCount++;
+return path;
+  default:
+return null;
+}
+  }
 
 Review comment:
   Would not get the point of this method, a bit of tricky.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-556) Check if any license attribution is needed for PR 1233

2020-01-18 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-556:
---

 Summary: Check if any license attribution is needed for PR 1233
 Key: HUDI-556
 URL: https://issues.apache.org/jira/browse/HUDI-556
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Release  Administrative
Reporter: Vinoth Chandar
Assignee: leesf
 Fix For: 0.5.1


We need to resolve this before we prepare a vote



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on issue #1239: [HUDI-551] Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread GitBox
leesf commented on issue #1239: [HUDI-551] Abstract a test case class for DFS 
Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239#issuecomment-575963722
 
 
   > > Looks like the auto code reformatting does not reorder the imports.
   > 
   > This is a pain atm. yes.
   > 
   > @yanghua is this good to go?
   
   Will help to review since @yanghua has some other stuff today.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] leesf commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-18 Thread GitBox
leesf commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, 
and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237#issuecomment-575963643
 
 
   > given the large number of files touched here, can we land this after the 
freeze..
   
   +1 to merge this after this release.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #163

2020-01-18 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.04 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4:
bin
boot
conf
lib
LICENSE
NOTICE
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark_2.11[jar]
[INFO] hudi-utilities_2.11[jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle_2.11 [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle_2.11

[jira] [Commented] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018760#comment-17018760
 ] 

Vinoth Chandar commented on HUDI-555:
-

Done with a cut.. New PRs can now land... [~xleesf] we will still regenerate 
the site once again after the release is pushed.. 

[https://hudi.apache.org/docs/0.5.0-quick-start-guide.html]

i.e please don't regenerate the site with new content until the release is 
voted on and we have the artifacts.. 

Also as you go through this, please keep the release guide upto date

 

cc [~vbalaji]

> Move 0.5.0 release docs into a separate page hierarchy 
> ---
>
> Key: HUDI-555
> URL: https://issues.apache.org/jira/browse/HUDI-555
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [~xleesf] I am going to try to move the existing docs manually to a new sub 
> structure, linked off the `Releases` page.. This way, users who can't yet 
> upgrade also will be able to access the old docs.. 
>  
> [~bhasudha] this needs to happen before you or [~shivnarayan] land the 
> changes to docs, based on current code.  Just FYI . I will take care of the 
> orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-555.
-
Resolution: Fixed

> Move 0.5.0 release docs into a separate page hierarchy 
> ---
>
> Key: HUDI-555
> URL: https://issues.apache.org/jira/browse/HUDI-555
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [~xleesf] I am going to try to move the existing docs manually to a new sub 
> structure, linked off the `Releases` page.. This way, users who can't yet 
> upgrade also will be able to access the old docs.. 
>  
> [~bhasudha] this needs to happen before you or [~shivnarayan] land the 
> changes to docs, based on current code.  Just FYI . I will take care of the 
> orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-555:

Status: Open  (was: New)

> Move 0.5.0 release docs into a separate page hierarchy 
> ---
>
> Key: HUDI-555
> URL: https://issues.apache.org/jira/browse/HUDI-555
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [~xleesf] I am going to try to move the existing docs manually to a new sub 
> structure, linked off the `Releases` page.. This way, users who can't yet 
> upgrade also will be able to access the old docs.. 
>  
> [~bhasudha] this needs to happen before you or [~shivnarayan] land the 
> changes to docs, based on current code.  Just FYI . I will take care of the 
> orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar merged pull request #1249: [HUDI-555] Support for separate 0.5.0 site docs + instructions for future

2020-01-18 Thread GitBox
vinothchandar merged pull request #1249: [HUDI-555] Support for separate 0.5.0 
site docs + instructions for future
URL: https://github.com/apache/incubator-hudi/pull/1249
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-555:
---

Assignee: Vinoth Chandar

> Move 0.5.0 release docs into a separate page hierarchy 
> ---
>
> Key: HUDI-555
> URL: https://issues.apache.org/jira/browse/HUDI-555
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [~xleesf] I am going to try to move the existing docs manually to a new sub 
> structure, linked off the `Releases` page.. This way, users who can't yet 
> upgrade also will be able to access the old docs.. 
>  
> [~bhasudha] this needs to happen before you or [~shivnarayan] land the 
> changes to docs, based on current code.  Just FYI . I will take care of the 
> orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-555:

Labels: pull-request-available  (was: )

> Move 0.5.0 release docs into a separate page hierarchy 
> ---
>
> Key: HUDI-555
> URL: https://issues.apache.org/jira/browse/HUDI-555
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Docs
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> [~xleesf] I am going to try to move the existing docs manually to a new sub 
> structure, linked off the `Releases` page.. This way, users who can't yet 
> upgrade also will be able to access the old docs.. 
>  
> [~bhasudha] this needs to happen before you or [~shivnarayan] land the 
> changes to docs, based on current code.  Just FYI . I will take care of the 
> orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1249: [HUDI-555] Support for separate 0.5.0 site docs + instructions for future

2020-01-18 Thread GitBox
vinothchandar opened a new pull request #1249: [HUDI-555] Support for separate 
0.5.0 site docs + instructions for future
URL: https://github.com/apache/incubator-hudi/pull/1249
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong edited a comment on issue #1242: HUDI-544 Adjust the read and write path of archive

2020-01-18 Thread GitBox
hddong edited a comment on issue #1242: HUDI-544 Adjust the read and write path 
of archive
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-575956786
 
 
   @vinothchandar It's ok now, please have a review when you are free.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1242: HUDI-544 Adjust the read and write path of archive

2020-01-18 Thread GitBox
hddong commented on issue #1242: HUDI-544 Adjust the read and write path of 
archive
URL: https://github.com/apache/incubator-hudi/pull/1242#issuecomment-575956786
 
 
   @vinothchandar It's ok now, please have a review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo resolved HUDI-552.

Resolution: Fixed

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar merged pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread GitBox
vinothchandar merged pull request #1246: [HUDI-552] Fix the schema mismatch in 
Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread GitBox
yihua commented on issue #1246: [HUDI-552] Fix the schema mismatch in 
Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246#issuecomment-575950765
 
 
   @vinothchandar I fixed the tests.  Locally they passed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-479) Eliminate use of guava if possible

2020-01-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-479:
---
Fix Version/s: 0.5.2

> Eliminate use of guava if possible
> --
>
> Key: HUDI-479
> URL: https://issues.apache.org/jira/browse/HUDI-479
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi commented on issue #1245: [MINOR] Replace Collection.size > 0 with Collection.isEmpty()

2020-01-18 Thread GitBox
smarthi commented on issue #1245: [MINOR] Replace Collection.size > 0 with 
Collection.isEmpty()
URL: https://github.com/apache/incubator-hudi/pull/1245#issuecomment-575941333
 
 
   > I am not sure if this is really improving things. its very subjective. I'd 
argue reading `!isEmpty()` is harder than a simple positive check..
   > 
   > Not sure if I want to really make this change
   
   I agree, maybe consider rebasing this PR after PR# 1159 is merged after 
0.5.1 release - PR# 1159 introduces CollectionUtils - u could add this as a 
method there.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc

2020-01-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018718#comment-17018718
 ] 

Vinoth Chandar commented on HUDI-403:
-

[~vbalaji] I will take a stab at this..  documenting some general principles 

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-403) Publish a deployment guide talking about deployment options, upgrading etc

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-403:
---

Assignee: Vinoth Chandar  (was: Balaji Varadarajan)

> Publish a deployment guide talking about deployment options, upgrading etc
> --
>
> Key: HUDI-403
> URL: https://issues.apache.org/jira/browse/HUDI-403
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Things to cover 
>  # Upgrade readers first, Upgrade writers next, Principles of compatibility 
> followed
>  # DeltaStreamer Deployment models
>  # Scheduling Compactions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-555) Move 0.5.0 release docs into a separate page hierarchy

2020-01-18 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-555:
---

 Summary: Move 0.5.0 release docs into a separate page hierarchy 
 Key: HUDI-555
 URL: https://issues.apache.org/jira/browse/HUDI-555
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Docs
Reporter: Vinoth Chandar
 Fix For: 0.5.1


[~xleesf] I am going to try to move the existing docs manually to a new sub 
structure, linked off the `Releases` page.. This way, users who can't yet 
upgrade also will be able to access the old docs.. 

 

[~bhasudha] this needs to happen before you or [~shivnarayan] land the changes 
to docs, based on current code.  Just FYI . I will take care of the 
orchestration 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-552:

Status: Open  (was: New)

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-547) Call out changes in package names due to scala cross compiling support

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-547:

Status: Open  (was: New)

> Call out changes in package names due to scala cross compiling support
> --
>
> Key: HUDI-547
> URL: https://issues.apache.org/jira/browse/HUDI-547
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Two versions of each of the below packages needs to be built. 
> hudi-spark is hudi-spark_2.11 and hudi-spark_2.12
> hudi-utilities is hudi-utilities_2.11 and hudi-utilities_2.12
> hudi-spark-bundle is hudi-spark-bundle_2.11 and hudi-spark-bundle_2.12
> hudi-utilities-bundle is hudi-utilities-bundle_2.11 and 
> hudi-utilities-bundle_2.12
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-551) Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-551:

Status: Open  (was: New)

> Abstract a test case class for DFS Source to make it extensible
> ---
>
> Key: HUDI-551
> URL: https://issues.apache.org/jira/browse/HUDI-551
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Testing
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
>
> * Create a new class {{AbstractDFSSourceTestBase}} based on 
> {{DFSSourceTestCase}} in the last commit
>  * The common test logic still resides in {{AbstractDFSSourceTestBase}}
>  * For each DFS Source class, extend from {{AbstractDFSSourceTestBase}} to 
> add source-specific test logic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-550) Add to Release Notes : Configuration Value change for Kafka Reset Offset Strategies

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-550:

Status: Open  (was: New)

> Add to Release Notes : Configuration Value change for Kafka Reset Offset 
> Strategies
> ---
>
> Key: HUDI-550
> URL: https://issues.apache.org/jira/browse/HUDI-550
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> Enum Values are changed for configuring kafka reset offset strategies in 
> deltastreamer
>    LARGEST -> LATEST
>   SMALLEST -> EARLIEST
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-548) Update HowToRelease Page to release hudi with both versions of scala

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-548:

Status: Open  (was: New)

> Update HowToRelease Page to release hudi with both versions of scala
> 
>
> Key: HUDI-548
> URL: https://issues.apache.org/jira/browse/HUDI-548
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release  Administrative
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Blocker
> Fix For: 0.5.1
>
>
> We will be changing the names of packages : hudi-spark, hudi-utilities, 
> hudi-spark-bundle and hudi-utilities-bundle to include scala version. As part 
> of building jars when releasing, we would have to run "mvn clean install xxx" 
> twice one without any additional settings to build 2.11 scala versions and 
> then run
> dev/change-scala-version 2.12
> mvn -Pscala-2.12 clean install
> for 2.12
>  
> We need to update ReleaseNotes with this information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
vinothchandar commented on a change in pull request #1248: Adding delete docs 
to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248651
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   Lets have it after incremental query.. deletes will conclude the flow of 
writing and reading nicely


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
nsivabalan commented on a change in pull request #1248: Adding delete docs to 
QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248588
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   I can move it as the last section. Hope thats fine. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
vinothchandar commented on a change in pull request #1248: Adding delete docs 
to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248589
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   lets just delete a few existing records? and show that.. you can use 
`.limit(2)` to say get just 2 records out of the existing table and delete it 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
nsivabalan commented on a change in pull request #1248: Adding delete docs to 
QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248561
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   So, if I do the same with initial insert batch, then all records will be 
deleted. But don't want to disrupt the flow for rest of the quick start.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
nsivabalan commented on a change in pull request #1248: Adding delete docs to 
QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368248364
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   I am deleting an entire batch of inserts and hence thought will do a new 
batch of inserts and delete the entire batch. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
vinothchandar commented on issue #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-575928283
 
 
   yes.. we will update the site, as we release 0.5.1. its on my and sudhas 
plate 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
vinothchandar commented on a change in pull request #1248: Adding delete docs 
to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368242317
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
+val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
+df.write.format("org.apache.hudi").
+options(getQuickstartWriteConfigs).
+option(PRECOMBINE_FIELD_OPT_KEY, "ts").
+option(RECORDKEY_FIELD_OPT_KEY, "uuid").
+option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
+option(TABLE_NAME, tableName).
+mode(Overwrite).
+save(basePath);
+
+// Fetch the rider value for the batch of records inserted just now
+val roDeleteViewDF = spark.
+read.
+format("org.apache.hudi").
+load(basePath + "/*/*/*/*")
+roDeleteViewDF.registerTempTable("hudi_ro_table")
+spark.sql("select distinct rider from  hudi_ro_table where").show()
+
+// replace the rider value in below query to a value from above. "rider-213" 
is first batch and "rider-284" is second batch.
+val ds = spark.sql("select uuid, partitionPath from hudi_ro_table where rider 
= 'rider-284'")
+
+// issue deletes
 
 Review comment:
   ideally,. we just have this part and remove everything above this line, to 
keep the quickstart small 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
vinothchandar commented on a change in pull request #1248: Adding delete docs 
to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#discussion_r368242273
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -109,6 +109,57 @@ Notice that the save mode is now `Append`. In general, 
always use append mode un
 [Querying](#query) the data again will now show updated trips. Each write 
operation generates a new 
[commit](http://hudi.incubator.apache.org/concepts.html) 
 denoted by the timestamp. Look for changes in `_hoodie_commit_time`, `rider`, 
`driver` fields for the same `_hoodie_record_key`s in previous commit. 
 
+## Delete data {#deletes}
+Delete records for the HoodieKeys passed in. Lets first generate a new batch 
of insert and delete the same. Query to verify
+that all records are deleted.
+
+```
+val inserts = convertToStringList(dataGen.generateInserts(10))
 
 Review comment:
   could we avoid doing the insert again?  can we not reuse from the 
insert/update done above? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
nsivabalan commented on issue #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248#issuecomment-575927177
 
 
   @bhasudha @vinothchandar : while I am at it, do you guys think we can fix 
the spark set up instructions in quick start "Hudi works with Spark-2.x 
versions. You can follow instructions here for setting up spark. From the 
extracted directory run spark-shell with Hudi as:"
   
   To add info that both local spark and the spark version passed in --packages 
should match. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan opened a new pull request #1248: Adding delete docs to QuickStart

2020-01-18 Thread GitBox
nsivabalan opened a new pull request #1248: Adding delete docs to QuickStart
URL: https://github.com/apache/incubator-hudi/pull/1248
 
 
   - Adding delete docs to QuickStart
   
   ## Verify this pull request
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   Verified locally.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1237: [MINOR] Code Cleanup, remove redundant code, and other changes

2020-01-18 Thread GitBox
vinothchandar commented on issue #1237: [MINOR] Code Cleanup, remove redundant 
code, and other changes
URL: https://github.com/apache/incubator-hudi/pull/1237#issuecomment-575926740
 
 
   given the large number of files touched here, can we land this after the 
freeze..  
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1230: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType

2020-01-18 Thread GitBox
vinothchandar commented on issue #1230: java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType
URL: https://github.com/apache/incubator-hudi/issues/1230#issuecomment-575926043
 
 
   Logical types are only in avro 1.8 IIUC.. we are dropping support for Spark 
2.4 and lower anyway.. They are a couple year old at this point.. That said, we 
can think of providing more standard workarounds.. If one of you are interested 
in putting together a small blog, that talks about how to make 2.3, 2.2 spark 
work with master, that would be great 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1225: [MINOR] Adding util methods to assist in adding deletion support to Quick Start

2020-01-18 Thread GitBox
vinothchandar commented on issue #1225: [MINOR] Adding util methods to assist 
in adding deletion support to Quick Start
URL: https://github.com/apache/incubator-hudi/pull/1225#issuecomment-575925772
 
 
   Goes both ways, PMC/committers should check. 
   
   but its clearly documented here 
https://hudi.apache.org/contributing.html#contributing-code  that the 
contributor should squash .. Please read this more carefully


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1230: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType

2020-01-18 Thread GitBox
lamber-ken commented on issue #1230: java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType
URL: https://github.com/apache/incubator-hudi/issues/1230#issuecomment-575925039
 
 
   Whether `maven-shade` plugin can reduce this affect or not ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1228: No FileSystem for scheme: abfss

2020-01-18 Thread GitBox
vinothchandar commented on issue #1228: No FileSystem for scheme: abfss
URL: https://github.com/apache/incubator-hudi/issues/1228#issuecomment-575924686
 
 
   >The problem is a blank Hadoop Configuration is passed in in 
HoodieROTablePathFilter so it never picks up any settings in my hadoop 
environment.
   
   I think you are referring to 
   ```
   if (fs == null) {
   fs = path.getFileSystem(new Configuration());
 }
   ```
   
   the path filter is instantiated by the query engine.. and if it does not add 
all the configs to class path, it will be empty.. Let me triage this and move 
it to JIRA


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar closed issue #1230: java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType

2020-01-18 Thread GitBox
vinothchandar closed issue #1230: java.lang.NoSuchMethodError: 
org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType
URL: https://github.com/apache/incubator-hudi/issues/1230
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar closed issue #1070: How to do the bulk update .?

2020-01-18 Thread GitBox
vinothchandar closed issue #1070: How to do the bulk update .?
URL: https://github.com/apache/incubator-hudi/issues/1070
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar closed issue #1061: Hudi output just parquet even-though input is snappy.parquet

2020-01-18 Thread GitBox
vinothchandar closed issue #1061: Hudi output just parquet even-though input is 
snappy.parquet
URL: https://github.com/apache/incubator-hudi/issues/1061
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1225: [MINOR] Adding util methods to assist in adding deletion support to Quick Start

2020-01-18 Thread GitBox
nsivabalan commented on issue #1225: [MINOR] Adding util methods to assist in 
adding deletion support to Quick Start
URL: https://github.com/apache/incubator-hudi/pull/1225#issuecomment-575924333
 
 
   @vinothchandar : yeah, I thought the person who is merging the PR would 
squash. I didn't know that the author or PR is expected to squash.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018664#comment-17018664
 ] 

Vinoth Chandar commented on HUDI-538:
-

+1 [~yanghua] , I added a second task for moving classes around based on your 
changes..

Core issue we need a solution for IMO is the following .. (if we solve this, 
rest is more or less easy)  I will illustrate using Spark (since my 
understanding of Flink is somewhat limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or 
_DataFrame_ routes and the current code converts the dataframe into RDDs to 
perform writes. This has some performance side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does 
something like `hoodieRecordRDD.map().sort()` internally.. if we want to 
support Flink DataStream or Spark DataFrame as the object, then we need to 
somehow define an abstraction like `HoodieExecutionContext`  which will have 
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods? 
There will be subclasses like _HoodieSparkRDDExecutionContext,_ 
_HoodieSparkDataFrameExecutionContext_, 
_HoodieFlinkDataStreamExecutionContext_ which will implement them 
in engine specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. 
i.e we eagerly parse the input into a HoodieKey (String recordKey, String 
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and 
the payload is needed to precombine duplicates within a batch (may be spark 
specific)/combine incoming record with whats stored in the table during 
writing.. We need a way to do these lazily by pushing the key extraction 
function into the entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches 
before we embark more deeply.. We will hit these issues.. 

 

 

 

 

 

 

 

 

 

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018664#comment-17018664
 ] 

Vinoth Chandar edited comment on HUDI-538 at 1/18/20 6:05 PM:
--

+1 [~yanghua] , I added a second task for moving classes around based on your 
changes..

Core issue we need a solution for IMO is the following .. (if we solve this, 
rest is more or less easy)  I will illustrate using Spark (since my 
understanding of Flink is somewhat limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or 
_DataFrame_ routes and the current code converts the dataframe into RDDs to 
perform writes. This has some performance side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does 
something like `hoodieRecordRDD.map().sort()` internally.. if we want to 
support Flink DataStream or Spark DataFrame as the object, then we need to 
somehow define an abstraction like `HoodieExecutionContext`  which will have 
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods? 
There will be subclasses like _HoodieSparkRDDExecutionContext,_ 
_HoodieSparkDataFrameExecutionContext_, 
_HoodieFlinkDataStreamExecutionContext_ which will implement them 
in engine specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. 
i.e we eagerly parse the input into a HoodieKey (String recordKey, String 
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and 
the payload is needed to precombine duplicates within a batch (may be spark 
specific)/combine incoming record with whats stored in the table during 
writing.. We need a way to do these lazily by pushing the key extraction 
function into the entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches 
before we embark more deeply.. We will hit these issues.. 

 

 

 


was (Author: vc):
+1 [~yanghua] , I added a second task for moving classes around based on your 
changes..

Core issue we need a solution for IMO is the following .. (if we solve this, 
rest is more or less easy)  I will illustrate using Spark (since my 
understanding of Flink is somewhat limited atm) ..

 

So,  even for Spark I would like the writing to be done via _RDD_ or 
_DataFrame_ routes and the current code converts the dataframe into RDDs to 
perform writes. This has some performance side-effects (suprisingly, :P) 

 

1) If you take a single class like _HoodieWriteClient_, then it currently does 
something like `hoodieRecordRDD.map().sort()` internally.. if we want to 
support Flink DataStream or Spark DataFrame as the object, then we need to 
somehow define an abstraction like `HoodieExecutionContext`  which will have 
a common set of map(T) -> T, sortBy(T) -> T, filter(), repartition() methods? 
There will be subclasses like _HoodieSparkRDDExecutionContext,_ 
_HoodieSparkDataFrameExecutionContext_, 
_HoodieFlinkDataStreamExecutionContext_ which will implement them 
in engine specific ways and hand back the transformed T object? 

 

2) Right now, we work with _HoodieRecord_, as the record level abstraction.. 
i.e we eagerly parse the input into a HoodieKey (String recordKey, String 
partitionPath) and HoodieRecordPayload. The key is needed during indexing, and 
the payload is needed to precombine duplicates within a batch (may be spark 
specific)/combine incoming record with whats stored in the table during 
writing.. We need a way to do these lazily by pushing the key extraction 
function into the entire writing path. 

 

I think we should deeply think about these issues.. have concrete approaches 
before we embark more deeply.. We will hit these issues.. 

 

 

 

 

 

 

 

 

 

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-538) Restructuring hudi client module for multi engine support

2020-01-18 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-538:

Component/s: Code Cleanup

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi (incubating)
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Priority: Major
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.
> Some thoughts wrote here: 
> https://docs.google.com/document/d/1Q9w_4K6xzGbUrtTS0gAlzNYOmRXjzNUdbbe0q59PX9w/edit?usp=sharing
> The feature branch is {{restructure-hudi-client}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-554) Restructure code/packages to move more code back into hudi-writer-common

2020-01-18 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-554:
---

 Summary: Restructure code/packages  to move more code back into 
hudi-writer-common
 Key: HUDI-554
 URL: https://issues.apache.org/jira/browse/HUDI-554
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Code Cleanup
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1233: [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap,…

2020-01-18 Thread GitBox
vinothchandar commented on a change in pull request #1233: [HUDI-335] 
Improvements to DiskBasedMap used by ExternalSpillableMap,…
URL: https://github.com/apache/incubator-hudi/pull/1233#discussion_r368238975
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,411 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.log4j.Logger;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.nio.ByteBuffer;
+
+/**
+ * Use a private buffer for the read/write/seek operations of the 
RandomAccessFile
 
 Review comment:
   @n3nash @nbalajee is this our own implementation? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-553) Building/Running Hudi on higher java versions

2020-01-18 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-553:
---

 Summary: Building/Running Hudi on higher java versions
 Key: HUDI-553
 URL: https://issues.apache.org/jira/browse/HUDI-553
 Project: Apache Hudi (incubating)
  Issue Type: Task
  Components: Usability
Reporter: Vinoth Chandar
 Fix For: 0.6.0


[https://github.com/apache/incubator-hudi/issues/1235] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1235: Compilation errors when following docker demo and building Hudi. Missing some java packages

2020-01-18 Thread GitBox
vinothchandar commented on issue #1235: Compilation errors when following 
docker demo and building Hudi. Missing some java packages 
URL: https://github.com/apache/incubator-hudi/issues/1235#issuecomment-575922045
 
 
   https://issues.apache.org/jira/browse/HUDI-553 to track this for longer 
term.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1159: [WIP][HUDI-479] Eliminate or Minimize use of Guava if possible

2020-01-18 Thread GitBox
vinothchandar commented on issue #1159: [WIP][HUDI-479] Eliminate or Minimize 
use of Guava if possible
URL: https://github.com/apache/incubator-hudi/pull/1159#issuecomment-575921472
 
 
   May need to rebase. Will do a full review post code freeze.. Trying to 
minimize large changes at last min 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (5471d8f -> 3f4966d)

2020-01-18 Thread smarthi
This is an automated email from the ASF dual-hosted git repository.

smarthi pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 5471d8f  [MINOR] Add toString method to TimelineLayoutVersion to make 
it more readable (#1244)
 add 3f4966d  [MINOR] Fix PMC in DOAP] (#1247)

No new revisions were added by this update.

Summary of changes:
 doap_HUDI.rdf | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [incubator-hudi] smarthi merged pull request #1247: [MINOR] Fix PMC in DOAP]

2020-01-18 Thread GitBox
smarthi merged pull request #1247: [MINOR] Fix PMC in DOAP]
URL: https://github.com/apache/incubator-hudi/pull/1247
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on issue #1247: [MINOR] Fix PMC in DOAP]

2020-01-18 Thread GitBox
smarthi commented on issue #1247: [MINOR] Fix PMC in DOAP]
URL: https://github.com/apache/incubator-hudi/pull/1247#issuecomment-575909446
 
 
   Merging this, very trivial change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi opened a new pull request #1247: [MINOR] Fix PMC in DOAP]

2020-01-18 Thread GitBox
smarthi opened a new pull request #1247: [MINOR] Fix PMC in DOAP]
URL: https://github.com/apache/incubator-hudi/pull/1247
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-552:

Labels: pull-request-available  (was: )

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> !Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread GitBox
yihua opened a new pull request #1246: [HUDI-552] Fix the schema mismatch in 
Row-to-Avro conversion
URL: https://github.com/apache/incubator-hudi/pull/1246
 
 
   ## What is the purpose of the pull request
   
   This PR addresses 
[HUDI-552](https://issues.apache.org/jira/browse/HUDI-552).  When using the 
`FilebasedSchemaProvider` to provide the source/target schema in Avro, while 
ingesting data with the same columns from `RowSource`, the DeltaStreamer 
failed.  The root cause is that when writing parquet files in Spark, all fields 
are automatically converted to be nullable for compatibility reasons.  If the 
source Avro schema has non-null fields, `AvroConversionUtils.createRdd` still 
uses the schema from the Dataframe to convert the Row to Avro record, resulting 
in a different schema (only nullability difference).
   
   To fix this issue, the Avro schema, if exists, is passed to the conversion 
function to reconstruct the correct StructType for conversion.
   
   ## Brief change log
   
 - Passed the Avro schema to `createRdd` to generate the correct StructType 
for conversion in `DeltaSync.readFromSource` and `AvroConversionUtils.createRdd`
 - Added new tests to make sure the logic is correct (before this schema 
fix some of the new tests failed)
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - Added tests in `TestHoodieDeltaStreamer` to test the 
`HoodieDeltaStreamer` with `ParquetDFSSource` under different configurations
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yihua commented on issue #1239: [HUDI-551] Abstract a test case class for DFS Source to make it extensible

2020-01-18 Thread GitBox
yihua commented on issue #1239: [HUDI-551] Abstract a test case class for DFS 
Source to make it extensible
URL: https://github.com/apache/incubator-hudi/pull/1239#issuecomment-575878732
 
 
   @yanghua The PR is ready for a final review.  I'll squash the commits before 
merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Description: 
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

!Screen Shot 2020-01-18 at 12.31.23 AM.png|width=844,height=349!

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

  was:
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

```
{"type":"record","name":"triprec","fields":[\{"name":"timestamp","type":"double"},\{"name":"_row_key","type":"string"},\{"name":"rider","type":"string"},\{"name":"driver","type":"string"},\{"name":"begin_lat","type":"double"},\{"name":"begin_lon","type":"double"},\{"name":"end_lat","type":"double"},\{"name":"end_lon","type":"double"},\{"name":"fare","type":{"type":"record","name":"fare","fields":[{"name":"amount","type":"double"},\{"name":"currency","type":"string"}]}},\{"name":"_hoodie_is_deleted","type":"boolean","default":false}]}

```

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!


> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null 

[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Attachment: Screen Shot 2020-01-18 at 12.31.23 AM.png

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png, 
> Screen Shot 2020-01-18 at 12.31.23 AM.png
>
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> ```
> {"type":"record","name":"triprec","fields":[\{"name":"timestamp","type":"double"},\{"name":"_row_key","type":"string"},\{"name":"rider","type":"string"},\{"name":"driver","type":"string"},\{"name":"begin_lat","type":"double"},\{"name":"begin_lon","type":"double"},\{"name":"end_lat","type":"double"},\{"name":"end_lon","type":"double"},\{"name":"fare","type":{"type":"record","name":"fare","fields":[{"name":"amount","type":"double"},\{"name":"currency","type":"string"}]}},\{"name":"_hoodie_is_deleted","type":"boolean","default":false}]}
> ```
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Description: 
When using the `FilebasedSchemaProvider` to provide the source schema in Avro, 
while ingesting data from `ParquetDFSSource` with the same schema, the 
DeltaStreamer failed.  A new test case is added below to demonstrate the error:

!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

Based on further investigation, the root cause is that when writing parquet 
files in Spark, all fields are automatically [converted to be 
nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
for compatibility reasons.  If the source Avro schema has non-null fields, 
`AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe to 
convert the Row to Avro record.  The `dataType` has nullable fields based on 
Spark logic, even though the field names are identical as the source Avro 
schema.  Thus the resulting Avro records from the conversion have different 
schema (only nullability difference) compared to the source schema file.  
Before inserting the records, there are other operations using the source 
schema file, causing failure of serialization/deserialization because of this 
schema mismatch.

 

The following screenshot shows the modified Avro schema in 
`AvroConversionUtils.createRdd`.  The original source schema file is:

```
{"type":"record","name":"triprec","fields":[\{"name":"timestamp","type":"double"},\{"name":"_row_key","type":"string"},\{"name":"rider","type":"string"},\{"name":"driver","type":"string"},\{"name":"begin_lat","type":"double"},\{"name":"begin_lon","type":"double"},\{"name":"end_lat","type":"double"},\{"name":"end_lon","type":"double"},\{"name":"fare","type":{"type":"record","name":"fare","fields":[{"name":"amount","type":"double"},\{"name":"currency","type":"string"}]}},\{"name":"_hoodie_is_deleted","type":"boolean","default":false}]}

```

 

!Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!

  was:
!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

!Screen Shot 2020-01-18 at 12.15.09 AM.png!


> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png
>
>
> When using the `FilebasedSchemaProvider` to provide the source schema in 
> Avro, while ingesting data from `ParquetDFSSource` with the same schema, the 
> DeltaStreamer failed.  A new test case is added below to demonstrate the 
> error:
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> Based on further investigation, the root cause is that when writing parquet 
> files in Spark, all fields are automatically [converted to be 
> nullable|https://spark.apache.org/docs/latest/sql-data-sources-parquet.html] 
> for compatibility reasons.  If the source Avro schema has non-null fields, 
> `AvroConversionUtils.createRdd` still uses the `dataType` from the Dataframe 
> to convert the Row to Avro record.  The `dataType` has nullable fields based 
> on Spark logic, even though the field names are identical as the source Avro 
> schema.  Thus the resulting Avro records from the conversion have different 
> schema (only nullability difference) compared to the source schema file.  
> Before inserting the records, there are other operations using the source 
> schema file, causing failure of serialization/deserialization because of this 
> schema mismatch.
>  
> The following screenshot shows the modified Avro schema in 
> `AvroConversionUtils.createRdd`.  The original source schema file is:
> ```
> {"type":"record","name":"triprec","fields":[\{"name":"timestamp","type":"double"},\{"name":"_row_key","type":"string"},\{"name":"rider","type":"string"},\{"name":"driver","type":"string"},\{"name":"begin_lat","type":"double"},\{"name":"begin_lon","type":"double"},\{"name":"end_lat","type":"double"},\{"name":"end_lon","type":"double"},\{"name":"fare","type":{"type":"record","name":"fare","fields":[{"name":"amount","type":"double"},\{"name":"currency","type":"string"}]}},\{"name":"_hoodie_is_deleted","type":"boolean","default":false}]}
> ```
>  
> !Screen Shot 2020-01-18 at 12.15.09 AM.png|width=850,height=471!



--
This 

[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Description: 
!Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!

!Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!

!Screen Shot 2020-01-18 at 12.15.09 AM.png!

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png
>
>
> !Screen Shot 2020-01-18 at 12.12.58 AM.png|width=543,height=392!
> !Screen Shot 2020-01-18 at 12.13.08 AM.png|width=546,height=165!
> !Screen Shot 2020-01-18 at 12.15.09 AM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Attachment: Screen Shot 2020-01-18 at 12.15.09 AM.png

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Attachment: Screen Shot 2020-01-18 at 12.12.58 AM.png

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-552) Fix the schema mismatch in Row-to-Avro conversion

2020-01-18 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-552:
---
Attachment: Screen Shot 2020-01-18 at 12.13.08 AM.png

> Fix the schema mismatch in Row-to-Avro conversion
> -
>
> Key: HUDI-552
> URL: https://issues.apache.org/jira/browse/HUDI-552
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2020-01-18 at 12.12.58 AM.png, Screen Shot 
> 2020-01-18 at 12.13.08 AM.png, Screen Shot 2020-01-18 at 12.15.09 AM.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)