[GitHub] [incubator-hudi] xushiyan commented on pull request #1572: [HUDI-836] Implement datadog metrics reporter

2020-05-10 Thread GitBox


xushiyan commented on pull request #1572:
URL: https://github.com/apache/incubator-hudi/pull/1572#issuecomment-626460368


   @yanghua In the last commit, I added a new config class 
`HoodieMetricsDatadogConfig` just for datadog related configs and reverted 
previous change to `HoodieMetricsConfig`. I think it'd be better to refactor 
the configs of 2 other reporter types in a separate PR to minimize risks. If 
the new config class looks good, I'll do a separate PR for the other two, 
probably with some test cases, too.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #1572: [HUDI-836] Implement datadog metrics reporter

2020-05-10 Thread GitBox


xushiyan commented on a change in pull request #1572:
URL: https://github.com/apache/incubator-hudi/pull/1572#discussion_r422769043



##
File path: 
hudi-client/src/test/java/org/apache/hudi/metrics/datadog/TestDatadogHttpClient.java
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.datadog;
+
+import org.apache.hudi.metrics.datadog.DatadogHttpClient.ApiSite;
+
+import org.apache.http.StatusLine;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.impl.client.CloseableHttpClient;
+import org.apache.log4j.AppenderSkeleton;
+import org.apache.log4j.Level;
+import org.apache.log4j.Logger;
+import org.apache.log4j.spi.LoggingEvent;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+import org.mockito.ArgumentCaptor;
+import org.mockito.Captor;
+import org.mockito.Mock;
+import org.mockito.junit.jupiter.MockitoExtension;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.verify;
+import static org.mockito.Mockito.when;
+
+@ExtendWith(MockitoExtension.class)
+public class TestDatadogHttpClient {
+
+  @Mock
+  AppenderSkeleton appender;
+
+  @Captor
+  ArgumentCaptor logCaptor;
+
+  @Mock
+  CloseableHttpClient httpClient;
+
+  @Mock
+  CloseableHttpResponse httpResponse;
+
+  @Mock
+  StatusLine statusLine;
+
+  private void mockResponse(int statusCode) {
+when(statusLine.getStatusCode()).thenReturn(statusCode);
+when(httpResponse.getStatusLine()).thenReturn(statusLine);
+try {
+  when(httpClient.execute(any())).thenReturn(httpResponse);
+} catch (IOException e) {
+  fail(e.getMessage(), e);
+}
+  }
+
+  @Test
+  public void validateApiKey_shouldThrowException_whenRequestFailed() throws 
IOException {

Review comment:
   Yup changed to hump namings.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #274

2020-05-10 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.38 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Assigned] (HUDI-877) Restructure hudi-hive-sync implement hudi-common-sync

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-877:
--

Assignee: liwei

> Restructure hudi-hive-sync  implement  hudi-common-sync
> ---
>
> Key: HUDI-877
> URL: https://issues.apache.org/jira/browse/HUDI-877
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-878) hudi-spark support the multi catalog conf

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-878:
--

Assignee: liwei

> hudi-spark support the multi catalog conf
> -
>
> Key: HUDI-878
> URL: https://issues.apache.org/jira/browse/HUDI-878
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-879) hudi-utilities add the hudi-common-sync metrics

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-879:
--

Assignee: liwei

> hudi-utilities add the hudi-common-sync metrics
> ---
>
> Key: HUDI-879
> URL: https://issues.apache.org/jira/browse/HUDI-879
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Utilities
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-876) Restructure code/packages to implement hudi-common-sync

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-876:
--

Assignee: liwei

> Restructure code/packages to implement hudi-common-sync
> ---
>
> Key: HUDI-876
> URL: https://issues.apache.org/jira/browse/HUDI-876
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-875) Introduce a new pom module named hudi-common-sync

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-875:
--

Assignee: liwei

> Introduce a new pom module named hudi-common-sync
> -
>
> Key: HUDI-875
> URL: https://issues.apache.org/jira/browse/HUDI-875
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-879) hudi-utilities add the hudi-common-sync metrics

2020-05-10 Thread liwei (Jira)
liwei created HUDI-879:
--

 Summary: hudi-utilities add the hudi-common-sync metrics
 Key: HUDI-879
 URL: https://issues.apache.org/jira/browse/HUDI-879
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Utilities
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-877) Restructure hudi-hive-sync implement hudi-common-sync

2020-05-10 Thread liwei (Jira)
liwei created HUDI-877:
--

 Summary: Restructure hudi-hive-sync  implement  hudi-common-sync
 Key: HUDI-877
 URL: https://issues.apache.org/jira/browse/HUDI-877
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-878) hudi-spark support the multi catalog conf

2020-05-10 Thread liwei (Jira)
liwei created HUDI-878:
--

 Summary: hudi-spark support the multi catalog conf
 Key: HUDI-878
 URL: https://issues.apache.org/jira/browse/HUDI-878
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Spark Integration
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-876) Restructure code/packages to implement hudi-common-sync

2020-05-10 Thread liwei (Jira)
liwei created HUDI-876:
--

 Summary: Restructure code/packages to implement hudi-common-sync
 Key: HUDI-876
 URL: https://issues.apache.org/jira/browse/HUDI-876
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-875) Introduce a new pom module named hudi-common-sync

2020-05-10 Thread liwei (Jira)
liwei created HUDI-875:
--

 Summary: Introduce a new pom module named hudi-common-sync
 Key: HUDI-875
 URL: https://issues.apache.org/jira/browse/HUDI-875
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1572: [HUDI-836] Implement datadog metrics reporter

2020-05-10 Thread GitBox


yanghua commented on a change in pull request #1572:
URL: https://github.com/apache/incubator-hudi/pull/1572#discussion_r422750808



##
File path: 
hudi-client/src/test/java/org/apache/hudi/metrics/datadog/TestDatadogHttpClient.java
##
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metrics.datadog;
+
+import org.apache.hudi.metrics.datadog.DatadogHttpClient.ApiSite;
+
+import org.apache.http.StatusLine;
+import org.apache.http.client.methods.CloseableHttpResponse;
+import org.apache.http.impl.client.CloseableHttpClient;
+import org.apache.log4j.AppenderSkeleton;
+import org.apache.log4j.Level;
+import org.apache.log4j.Logger;
+import org.apache.log4j.spi.LoggingEvent;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+import org.mockito.ArgumentCaptor;
+import org.mockito.Captor;
+import org.mockito.Mock;
+import org.mockito.junit.jupiter.MockitoExtension;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertThrows;
+import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.api.Assertions.fail;
+import static org.mockito.ArgumentMatchers.any;
+import static org.mockito.Mockito.verify;
+import static org.mockito.Mockito.when;
+
+@ExtendWith(MockitoExtension.class)
+public class TestDatadogHttpClient {
+
+  @Mock
+  AppenderSkeleton appender;
+
+  @Captor
+  ArgumentCaptor logCaptor;
+
+  @Mock
+  CloseableHttpClient httpClient;
+
+  @Mock
+  CloseableHttpResponse httpResponse;
+
+  @Mock
+  StatusLine statusLine;
+
+  private void mockResponse(int statusCode) {
+when(statusLine.getStatusCode()).thenReturn(statusCode);
+when(httpResponse.getStatusLine()).thenReturn(statusLine);
+try {
+  when(httpClient.execute(any())).thenReturn(httpResponse);
+} catch (IOException e) {
+  fail(e.getMessage(), e);
+}
+  }
+
+  @Test
+  public void validateApiKey_shouldThrowException_whenRequestFailed() throws 
IOException {

Review comment:
   I guess it may be a practice. I do not against this style, it looks 
good. But for now, it would be better to keep a unified style right?
   
   Of cause, you can start a DISCUSS  on dev ML.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-600:
---

Assignee: Balaji Varadarajan

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ```
>  
> [~varadarb] any ideas about this ?
>  
> [~thesquelched] fyi



--
This message was sent 

[jira] [Commented] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103994#comment-17103994
 ] 

lamber-ken commented on HUDI-600:
-

[~vbalaji] good job! left one minor comment.

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ```
>  
> [~varadarb] any ideas about this ?
>  
> [~thesquelched] fyi



--
This message was 

[incubator-hudi] branch master updated: [HUDI-820] cleaner repair command should only inspect clean metadata files (#1542)

2020-05-10 Thread lamberken
This is an automated email from the ASF dual-hosted git repository.

lamberken pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d0e231  [HUDI-820] cleaner repair command should only inspect clean 
metadata files (#1542)
8d0e231 is described below

commit 8d0e23173b3f21d7cdb24942913eb1132a111a32
Author: Balaji Varadarajan 
AuthorDate: Sun May 10 18:25:54 2020 -0700

[HUDI-820] cleaner repair command should only inspect clean metadata files 
(#1542)
---
 .../apache/hudi/cli/commands/RepairsCommand.java| 21 +++--
 .../hudi/cli/commands/TestRepairsCommand.java   |  4 ++--
 .../apache/hudi/common/HoodieTestDataGenerator.java | 13 -
 3 files changed, 29 insertions(+), 9 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
index 0af9ff2..7b859c2 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
@@ -27,9 +27,11 @@ import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.HoodiePartitionMetadata;
 import org.apache.hudi.common.table.HoodieTableConfig;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
-import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
 import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.exception.HoodieIOException;
 
+import org.apache.avro.AvroRuntimeException;
 import org.apache.hadoop.fs.Path;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.log4j.Logger;
@@ -171,14 +173,21 @@ public class RepairsCommand implements CommandMarker {
   public void removeCorruptedPendingCleanAction() {
 
 HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
-HoodieActiveTimeline activeTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline();
-
-activeTimeline.filterInflightsAndRequested().getInstants().forEach(instant 
-> {
+HoodieTimeline cleanerTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().getCleanerTimeline();
+LOG.info("Inspecting pending clean metadata in timeline for corrupted 
files");
+
cleanerTimeline.filterInflightsAndRequested().getInstants().forEach(instant -> {
   try {
 CleanerUtils.getCleanerPlan(client, instant);
-  } catch (IOException e) {
-LOG.warn("try to remove corrupted instant file: " + instant);
+  } catch (AvroRuntimeException e) {
+LOG.warn("Corruption found. Trying to remove corrupted clean instant 
file: " + instant);
 FSUtils.deleteInstantFile(client.getFs(), client.getMetaPath(), 
instant);
+  } catch (IOException ioe) {
+if (ioe.getMessage().contains("Not an Avro data file")) {
+  LOG.warn("Corruption found. Trying to remove corrupted clean instant 
file: " + instant);
+  FSUtils.deleteInstantFile(client.getFs(), client.getMetaPath(), 
instant);
+} else {
+  throw new HoodieIOException(ioe.getMessage(), ioe);
+}
   }
 });
   }
diff --git 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java
index 9fd44b4..452e249 100644
--- 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java
+++ 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java
@@ -188,8 +188,8 @@ public class TestRepairsCommand extends 
AbstractShellIntegrationTest {
 // Create four requested files
 for (int i = 100; i < 104; i++) {
   String timestamp = String.valueOf(i);
-  // Write corrupted requested Compaction
-  
HoodieTestCommitMetadataGenerator.createCompactionRequestedFile(tablePath, 
timestamp, conf);
+  // Write corrupted requested Clean File
+  
HoodieTestCommitMetadataGenerator.createEmptyCleanRequestedFile(tablePath, 
timestamp, conf);
 }
 
 // reload meta client
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java 
b/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
index 65b567c..9a55ade 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
@@ -315,12 +315,23 @@ public class HoodieTestDataGenerator {
 });
   }
 
+  public static void createEmptyCleanRequestedFile(String basePath, String 
instantTime, Configuration configuration)
+  throws IOException {
+Path commitFile = new Path(basePath + "/" + 
HoodieTableMetaClient.METAFOLDER_NAME + "/"
++ HoodieTimeline.makeRequestedCleanerFileName(instantTime));
+

[GitHub] [incubator-hudi] lamber-ken commented on pull request #1542: [HUDI-820] cleaner repair command should only inspect clean metadata files

2020-05-10 Thread GitBox


lamber-ken commented on pull request #1542:
URL: https://github.com/apache/incubator-hudi/pull/1542#issuecomment-626420797


   hi @bvaradar, the unit test 
`TestCleaner#testCleanPreviousCorruptedCleanFiles` has covered. IMO, 
`TestRepairsCommand.java` is redundant, WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1542: [HUDI-820] cleaner repair command should only inspect clean metadata files

2020-05-10 Thread GitBox


lamber-ken commented on a change in pull request #1542:
URL: https://github.com/apache/incubator-hudi/pull/1542#discussion_r422728172



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##
@@ -147,14 +149,16 @@ public String overwriteHoodieProperties(
   public void removeCorruptedPendingCleanAction() {
 
 HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
-HoodieActiveTimeline activeTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline();
-
-activeTimeline.filterInflightsAndRequested().getInstants().forEach(instant 
-> {
+HoodieTimeline cleanerTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().getCleanerTimeline();
+LOG.info("Inspecting pending clean metadata in timeline for corrupted 
files");
+
cleanerTimeline.filterInflightsAndRequested().getInstants().forEach(instant -> {
   try {
 CleanerUtils.getCleanerPlan(client, instant);
-  } catch (IOException e) {
-LOG.warn("try to remove corrupted instant file: " + instant);
+  } catch (AvroRuntimeException e) {

Review comment:
    





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io commented on pull request #1612: [HUDI-528] Handle empty commit in incremental pulling

2020-05-10 Thread GitBox


codecov-io commented on pull request #1612:
URL: https://github.com/apache/incubator-hudi/pull/1612#issuecomment-626417448


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=h1) 
Report
   > Merging 
[#1612](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/fa6aba751d8de16d9d109a8cfc21150b17b59cff=desc)
 will **increase** coverage by `0.02%`.
   > The diff coverage is `90.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1612/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1612  +/-   ##
   
   + Coverage 71.78%   71.81%   +0.02% 
 Complexity 1087 1087  
   
 Files   385  385  
 Lines 1657516578   +3 
 Branches   1668 1669   +1 
   
   + Hits  1189911906   +7 
   + Misses 3947 3944   -3 
   + Partials729  728   -1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1612/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==)
 | `72.30% <90.00%> (-0.28%)` | `0.00 <0.00> (ø)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1612/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `76.92% <0.00%> (+0.96%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1612/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=)
 | `55.15% <0.00%> (+1.81%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...in/scala/org/apache/hudi/AvroConversionUtils.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1612/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvQXZyb0NvbnZlcnNpb25VdGlscy5zY2FsYQ==)
 | `58.33% <0.00%> (+4.16%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=footer).
 Last update 
[fa6aba7...de5e4cd](https://codecov.io/gh/apache/incubator-hudi/pull/1612?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] umehrot2 commented on issue #1581: [SUPPORT] Hive Metastore not in sync with Hudi Dataset using DataSource API

2020-05-10 Thread GitBox


umehrot2 commented on issue #1581:
URL: https://github.com/apache/incubator-hudi/issues/1581#issuecomment-626417498


   @bvaradar yes this has been a known issue for sometime, and has been 
discussed previously on slack as well. Glue catalog does not currently support 
**cascade** for **alter table** statements. We discussed this internally 
before, and @bschell has started looking into this issue. The most probable 
solution as of now seems to be to run alter command for each partition in case 
of glue catalog. However, he is still investigating further. I noticed there 
was no JIRA for this issue so I created one 
https://jira.apache.org/jira/browse/HUDI-874.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-874) Schema evolution does not work with AWS Glue catalog

2020-05-10 Thread Udit Mehrotra (Jira)
Udit Mehrotra created HUDI-874:
--

 Summary: Schema evolution does not work with AWS Glue catalog
 Key: HUDI-874
 URL: https://issues.apache.org/jira/browse/HUDI-874
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Hive Integration
Reporter: Udit Mehrotra


This issue has been discussed here 
[https://github.com/apache/incubator-hudi/issues/1581] and at other places as 
well. Glue catalog currently does not support *cascade* for *ALTER TABLE* 
statements. As a result features like adding new columns to an existing table 
does now work with glue catalog .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-575:

Description: 
Currenlty, only inline compaction is supported for Structured streaming writes. 

 

We need to 
 * Enable configuring async compaction for streaming writes 
 * Implement a parallel compaction process like we did for delta streamer

  was:Currenlty, only inline compaction is supported for Structured streaming 
writes. 


> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Prasanna Rajaperumal
>Priority: Major
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-575:
---

Assignee: Prasanna Rajaperumal

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Prasanna Rajaperumal
>Priority: Major
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-194) Support for writing Iceberg metadata on Hoodie RO tables

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-194:

Description: Basic idea here is to map Hudi WriteStatus objects into what 
Iceberg needs to maintain metadata.. Additionally, we need to enhance the 
WriteStat to collect range/null stats information on columns and feed that into 
Iceberg as well..  

> Support for writing Iceberg metadata on Hoodie RO tables
> 
>
> Key: HUDI-194
> URL: https://issues.apache.org/jira/browse/HUDI-194
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prasanna Rajaperumal
>Priority: Major
>
> Basic idea here is to map Hudi WriteStatus objects into what Iceberg needs to 
> maintain metadata.. Additionally, we need to enhance the WriteStat to collect 
> range/null stats information on columns and feed that into Iceberg as well..  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-194) Support for writing Iceberg metadata on Hoodie RO tables

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-194:
---

Assignee: Prasanna Rajaperumal  (was: Balaji Varadarajan)

> Support for writing Iceberg metadata on Hoodie RO tables
> 
>
> Key: HUDI-194
> URL: https://issues.apache.org/jira/browse/HUDI-194
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prasanna Rajaperumal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-528:

Status: In Progress  (was: Open)

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-528:

Labels: bug-bash-0.6.0 help-requested pull-request-available  (was: 
bug-bash-0.6.0 help-requested)

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] garyli1019 opened a new pull request #1612: HUDI-528 Handle empty commit in incremental pulling

2020-05-10 Thread GitBox


garyli1019 opened a new pull request #1612:
URL: https://github.com/apache/incubator-hudi/pull/1612


   ## What is the purpose of the pull request
   
   https://issues.apache.org/jira/browse/HUDI-528
   
   ## Brief change log
   
 - Avoid loading empty instant in `IncrementalRelation.scala`
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - Added an empty data frame test case in TestDataSource
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch master updated: [MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics (#1606)

2020-05-10 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f92b9fd  [MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics 
(#1606)
f92b9fd is described below

commit f92b9fdcc4d7c0dff9953e42205fca22472525c9
Author: vinoth chandar 
AuthorDate: Sun May 10 16:23:26 2020 -0700

[MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics (#1606)
---
 .../apache/hudi/metrics/TestHoodieJmxMetrics.java  |  5 ++--
 .../hudi/common/minicluster/HdfsTestService.java   | 19 
 .../hudi/common/testutils/NetworkTestUtils.java| 35 ++
 3 files changed, 43 insertions(+), 16 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java 
b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
index 063f64e..7b63a30 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/metrics/TestHoodieJmxMetrics.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.metrics;
 
+import org.apache.hudi.common.testutils.NetworkTestUtils;
 import org.apache.hudi.config.HoodieWriteConfig;
 
 import org.junit.jupiter.api.Test;
@@ -39,7 +40,7 @@ public class TestHoodieJmxMetrics {
 when(config.isMetricsOn()).thenReturn(true);
 when(config.getMetricsReporterType()).thenReturn(MetricsReporterType.JMX);
 when(config.getJmxHost()).thenReturn("localhost");
-when(config.getJmxPort()).thenReturn("9889");
+
when(config.getJmxPort()).thenReturn(String.valueOf(NetworkTestUtils.nextFreePort()));
 new HoodieMetrics(config, "raw_table");
 registerGauge("jmx_metric1", 123L);
 assertEquals("123", Metrics.getInstance().getRegistry().getGauges()
@@ -51,7 +52,7 @@ public class TestHoodieJmxMetrics {
 when(config.isMetricsOn()).thenReturn(true);
 when(config.getMetricsReporterType()).thenReturn(MetricsReporterType.JMX);
 when(config.getJmxHost()).thenReturn("localhost");
-when(config.getJmxPort()).thenReturn("1000-5000");
+
when(config.getJmxPort()).thenReturn(String.valueOf(NetworkTestUtils.nextFreePort()));
 new HoodieMetrics(config, "raw_table");
 registerGauge("jmx_metric2", 123L);
 assertEquals("123", Metrics.getInstance().getRegistry().getGauges()
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
index d331a17..00e6e3c 100644
--- 
a/hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
@@ -19,8 +19,8 @@
 package org.apache.hudi.common.minicluster;
 
 import org.apache.hudi.common.model.HoodieTestUtils;
+import org.apache.hudi.common.testutils.NetworkTestUtils;
 import org.apache.hudi.common.util.FileIOUtils;
-import org.apache.hudi.exception.HoodieIOException;
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
@@ -31,7 +31,6 @@ import org.apache.log4j.Logger;
 
 import java.io.File;
 import java.io.IOException;
-import java.net.ServerSocket;
 import java.nio.file.Files;
 import java.util.Objects;
 
@@ -61,14 +60,6 @@ public class HdfsTestService {
 return hadoopConf;
   }
 
-  private static int nextFreePort() {
-try (ServerSocket socket = new ServerSocket(0)) {
-  return socket.getLocalPort();
-} catch (IOException e) {
-  throw new HoodieIOException("Unable to find next free port", e);
-}
-  }
-
   public MiniDFSCluster start(boolean format) throws IOException {
 Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 hadoopConf = HoodieTestUtils.getDefaultHadoopConf();
@@ -81,10 +72,10 @@ public class HdfsTestService {
   FileIOUtils.deleteDirectory(file);
 }
 
-int namenodeRpcPort = nextFreePort();
-int datanodePort = nextFreePort();
-int datanodeIpcPort = nextFreePort();
-int datanodeHttpPort = nextFreePort();
+int namenodeRpcPort = NetworkTestUtils.nextFreePort();
+int datanodePort = NetworkTestUtils.nextFreePort();
+int datanodeIpcPort = NetworkTestUtils.nextFreePort();
+int datanodeHttpPort = NetworkTestUtils.nextFreePort();
 
 // Configure and start the HDFS cluster
 // boolean format = shouldFormatDFSCluster(localDFSLocation, clean);
diff --git 
a/hudi-common/src/test/java/org/apache/hudi/common/testutils/NetworkTestUtils.java
 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/NetworkTestUtils.java
new file mode 100644
index 000..1f99b0e
--- /dev/null
+++ 
b/hudi-common/src/test/java/org/apache/hudi/common/testutils/NetworkTestUtils.java
@@ -0,0 +1,35 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) 

[GitHub] [incubator-hudi] vinothchandar commented on pull request #1606: [MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics

2020-05-10 Thread GitBox


vinothchandar commented on pull request #1606:
URL: https://github.com/apache/incubator-hudi/pull/1606#issuecomment-626402897


   anyone can do a quick review? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1542: [HUDI-820] cleaner repair command should only inspect clean metadata files

2020-05-10 Thread GitBox


codecov-io edited a comment on pull request #1542:
URL: https://github.com/apache/incubator-hudi/pull/1542#issuecomment-626396467


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=h1) 
Report
   > Merging 
[#1542](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/fa6aba751d8de16d9d109a8cfc21150b17b59cff=desc)
 will **decrease** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1542/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1542  +/-   ##
   
   - Coverage 71.78%   71.77%   -0.02% 
 Complexity 1087 1087  
   
 Files   385  385  
 Lines 1657516575  
 Branches   1668 1668  
   
   - Hits  1189911897   -2 
   - Misses 3947 3949   +2 
 Partials729  729  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1542/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1542/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `76.92% <0.00%> (+0.96%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=footer).
 Last update 
[fa6aba7...2c72c50](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io commented on pull request #1542: [HUDI-820] cleaner repair command should only inspect clean metadata files

2020-05-10 Thread GitBox


codecov-io commented on pull request #1542:
URL: https://github.com/apache/incubator-hudi/pull/1542#issuecomment-626396467


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=h1) 
Report
   > Merging 
[#1542](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/fa6aba751d8de16d9d109a8cfc21150b17b59cff=desc)
 will **decrease** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1542/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1542  +/-   ##
   
   - Coverage 71.78%   71.77%   -0.02% 
 Complexity 1087 1087  
   
 Files   385  385  
 Lines 1657516575  
 Branches   1668 1668  
   
   - Hits  1189911897   -2 
   - Misses 3947 3949   +2 
 Partials729  729  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...ache/hudi/common/fs/inline/InMemoryFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1542/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9Jbk1lbW9yeUZpbGVTeXN0ZW0uamF2YQ==)
 | `79.31% <0.00%> (-10.35%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...e/hudi/common/table/log/HoodieLogFormatWriter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1542/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGb3JtYXRXcml0ZXIuamF2YQ==)
 | `76.92% <0.00%> (+0.96%)` | `0.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=footer).
 Last update 
[fa6aba7...2c72c50](https://codecov.io/gh/apache/incubator-hudi/pull/1542?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-10 Thread GitBox


garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-626385758


   Hello @bvaradar , I'd like to get your opinion about how to fix this issue, 
because if we change the way to calculate the record size, it will impact many 
places in the codebase. 
   The issue basically is:
   The `totalBytesWritten` in metadata included bloom filter, so 
`totalBytesWritten/totalRecordsWritten` will be off if the number of records is 
small but bloom filter is large.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-528:
---

Assignee: Yanjia Gary Li

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-10 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Status: In Progress  (was: Open)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103916#comment-17103916
 ] 

Balaji Varadarajan commented on HUDI-600:
-

[~thesquelched] : You should not see this issue in latest master as Hudi writer 
is able to automatically handle and ignore these corrupted files. Once, 
[https://github.com/apache/incubator-hudi/pull/1542] is landed, you can also 
remove these bad clean metadata once and for all from your dataset using 
repairs command. 

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1542: [HUDI-820] cleaner repair command should only inspect clean metadata files

2020-05-10 Thread GitBox


bvaradar commented on a change in pull request #1542:
URL: https://github.com/apache/incubator-hudi/pull/1542#discussion_r422684688



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/RepairsCommand.java
##
@@ -147,14 +149,16 @@ public String overwriteHoodieProperties(
   public void removeCorruptedPendingCleanAction() {
 
 HoodieTableMetaClient client = HoodieCLI.getTableMetaClient();
-HoodieActiveTimeline activeTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline();
-
-activeTimeline.filterInflightsAndRequested().getInstants().forEach(instant 
-> {
+HoodieTimeline cleanerTimeline = 
HoodieCLI.getTableMetaClient().getActiveTimeline().getCleanerTimeline();
+LOG.info("Inspecting pending clean metadata in timeline for corrupted 
files");
+
cleanerTimeline.filterInflightsAndRequested().getInstants().forEach(instant -> {
   try {
 CleanerUtils.getCleanerPlan(client, instant);
-  } catch (IOException e) {
-LOG.warn("try to remove corrupted instant file: " + instant);
+  } catch (AvroRuntimeException e) {

Review comment:
   @lamber-ken : Thanks. I have made changes to specifically look for this 
message in the exception to detect corruption. Please take a look. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-865) Improve Hive Syncing by directly translating avro schema to Hive types

2020-05-10 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103910#comment-17103910
 ] 

Balaji Varadarajan commented on HUDI-865:
-

This is more as a cleanup and  to standardize hive schema syncing. The extra 
hop is due to recently made change. It also helps to keep hive schema syncing 
standardized in future when we support other types like ORC.  I don't expect it 
to have a performance impact.

 

> Improve Hive Syncing by directly translating avro schema to Hive types
> --
>
> Key: HUDI-865
> URL: https://issues.apache.org/jira/browse/HUDI-865
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> With the current code in master and proposed improvements with  
> [https://github.com/apache/incubator-hudi/pull/1559,|https://github.com/apache/incubator-hudi/pull/1559]
> Hive Sync integration would resort to the following translations for finding 
> table schema
>  Avro-Schema to Parquet-Schema to Hive Schema transformations
> We need to implement logic to skip the extra hop to parquet schema when 
> generating hive schema. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-873) kafka connector support hudi sink

2020-05-10 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-873:
--

Assignee: liwei

> kafka  connector support hudi sink
> --
>
> Key: HUDI-873
> URL: https://issues.apache.org/jira/browse/HUDI-873
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-873) kafka connector support hudi sink

2020-05-10 Thread liwei (Jira)
liwei created HUDI-873:
--

 Summary: kafka  connector support hudi sink
 Key: HUDI-873
 URL: https://issues.apache.org/jira/browse/HUDI-873
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: liwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-47) Revisit null checks in the Log Blocks, merge lazyreading with this null check #340

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-47:
---
Labels: help-requested  (was: )

> Revisit null checks in the Log Blocks, merge lazyreading with this null check 
> #340
> --
>
> Key: HUDI-47
> URL: https://issues.apache.org/jira/browse/HUDI-47
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Storage Management
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/340



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-44) Compaction must preserve commit timestamps of merged records #376

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-44:
---
Labels: help-requested  (was: )

> Compaction must preserve commit timestamps of merged records #376
> -
>
> Key: HUDI-44
> URL: https://issues.apache.org/jira/browse/HUDI-44
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/376



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-45) Refactor handleWrite() in HoodieMergeHandle to offload conversion and merging of records to reader #374

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-45:
---
Labels: help-requested  (was: )

> Refactor handleWrite() in HoodieMergeHandle to offload conversion and merging 
> of records to reader #374
> ---
>
> Key: HUDI-45
> URL: https://issues.apache.org/jira/browse/HUDI-45
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/374



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-43) Introduce a WriteContext abstraction to HoodieWriteClient #384

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-43?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-43:
---
Labels: help-requested  (was: )

> Introduce a WriteContext abstraction to HoodieWriteClient #384
> --
>
> Key: HUDI-43
> URL: https://issues.apache.org/jira/browse/HUDI-43
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> [https://github.com/uber/hudi/issues/384]
>  
> HoodieTable, WriteConfig and other classes passed between "client" and "io"  
> etc need to standardize on this 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-37) Persist the HoodieIndex type in the hoodie.properties file #409

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-37:
---
Labels: help-requested  (was: )

> Persist the HoodieIndex type in the hoodie.properties file #409
> ---
>
> Key: HUDI-37
> URL: https://issues.apache.org/jira/browse/HUDI-37
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Storage Management
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/409



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-239) Harden Hive based incremental pull on real-time view

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-239:

Labels: help-requested  (was: )

> Harden Hive based incremental pull on real-time view 
> -
>
> Key: HUDI-239
> URL: https://issues.apache.org/jira/browse/HUDI-239
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Hive Integration, Incremental Pull
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: help-requested
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-48) Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-48:
---
Labels: help-requested  (was: )

> Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339
> -
>
> Key: HUDI-48
> URL: https://issues.apache.org/jira/browse/HUDI-48
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Code Cleanup, Compaction, Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-872) Implement JMH benchmarks for all core classes

2020-05-10 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-872:
---

 Summary: Implement JMH benchmarks for all core classes 
 Key: HUDI-872
 URL: https://issues.apache.org/jira/browse/HUDI-872
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Performance
Reporter: Vinoth Chandar
Assignee: Nishith Agarwal


We need to invest in a micro benchmark suite that has a baseline and our class 
performance for all core parts 

- CompactedLogScanner
- All I/O Handles 
- Index Lookup
- Payloads 
- ExternalSpillableMap 

First task is to populate this list by tracing through the write path end-end 
for all operations bulk_insert, upsert ... 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-92) Include custom names for spark HUDI spark DAG stages for easier understanding

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-92:
---
Labels: bug-bash-0.6.0 help-requested pull-request-available  (was: 
bug-bash-0.6.0 pull-request-available)

> Include custom names for spark HUDI spark DAG stages for easier understanding
> -
>
> Key: HUDI-92
> URL: https://issues.apache.org/jira/browse/HUDI-92
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: newbie, Usability
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-64) Estimation of compression ratio & other dynamic storage knobs based on historical stats

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-64:
---
Labels: help-requested  (was: )

> Estimation of compression ratio & other dynamic storage knobs based on 
> historical stats
> ---
>
> Key: HUDI-64
> URL: https://issues.apache.org/jira/browse/HUDI-64
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> Something core to Hudi writing is using heuristics or runtime workload 
> statistics to optimize aspects of storage like file sizes, partitioning and 
> so on.  
> Below lists all such places. 
>  
>  # Compression ratio for parquet 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L46]
>  . This is used by HoodieWrapperFileSystem, to estimate amount of bytes it 
> has written for a given parquet file and closes the parquet file once the 
> configured size has reached. DFSOutputStream level we only know bytes written 
> before compression. Once enough data has been written, it should be possible 
> to replace this by a simple estimate of what the avg record size would be 
> (commit metadata would give you size and number of records in each file)
>  # Very similar problem exists for log files 
> [https://github.com/apache/incubator-hudi/blob/a4f9d7575f39bb79089714049ffea12ba5f25ec8/hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java#L52]
>  We write data into logs in avro and can log updates to same record in 
> parquet multiple times. We need to estimate again how large the log file(s) 
> can grow to, and still we would be able to produce a parquet file of 
> configured size during compaction. (hope I conveyed this clearly)
>  # WorkloadProfile : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java]
>  caches the input records using Spark Caching and computes the shape of the 
> workload, i.e how many records per partition, how many inserts vs updates 
> etc. This is used by the Partitioner here 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L141]
>  for assigning records to a file group. This is the critical one to replace 
> for Flink support and probably the hardest, since we need to guess input, 
> which is not always possible? 
>  # Within partitioner, we already derive a simple average size per record 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java#L756]
>  from the last commit metadata alone. This can be generalized.  (default : 
> [https://github.com/apache/incubator-hudi/blob/b19bed442d84c1cb1e48d184c9554920735bcb6c/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java#L71])
>  
>  # 
> Our goal in this Jira is to see, if could derive this information in the 
> background purely using the commit metadata.. Some parts of this are 
> open-ended.. Good starting point would be to see whats feasible, estimate ROI 
> before aactually implementing 
>  
>  
>  
>  
>  
>  
> Roughly along the likes of. [https://github.com/uber/hudi/issues/270] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-86) Add indexing support to the log file format

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-86?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-86:
---
Labels: help-requested  (was: realtime-data-lakes)

> Add indexing support to the log file format
> ---
>
> Key: HUDI-86
> URL: https://issues.apache.org/jira/browse/HUDI-86
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Index, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.1
>
>
> https://github.com/apache/incubator-hudi/pull/519



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-52) Implement Savepoints for Merge On Read table #88

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-52?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-52:
---
Labels: help-requested  (was: )

> Implement Savepoints for Merge On Read table #88
> 
>
> Key: HUDI-52
> URL: https://issues.apache.org/jira/browse/HUDI-52
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> https://github.com/uber/hudi/issues/88



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-258) Hive Query engine not supporting join queries between RT and RO tables

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-258:

Labels: bug-bash-0.6.0 help-requested  (was: )

> Hive Query engine not supporting join queries between RT and RO tables
> --
>
> Key: HUDI-258
> URL: https://issues.apache.org/jira/browse/HUDI-258
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: bug-bash-0.6.0, help-requested
>
> Description : 
> [https://github.com/apache/incubator-hudi/issues/789#issuecomment-512740619]
>  
> Root Cause: Hive is tracking getSplits calls by dataset basePath and does not 
> take INputFormatClass into account. Hence getSplits() is called only once. In 
> the case of RO and RT tables, they both have same dataset base-path but 
> differ in the InputFormatClass. Due to this, Hive join query is returning 
> weird results.
>  
> =
> The result of the demo is very strange
> (Step 6(a))
>  
> {{ select `_hoodie_commit_time`, symbol, ts, volume, open, close  from 
> stock_ticks_mor_rt where  symbol = 'GOOG';
>  select `_hoodie_commit_time`, symbol, ts, volume, open, close  from 
> stock_ticks_mor where  symbol = 'GOOG';}}
> return as demo
> BUT!
>  
> {{select a.key,a.ts, b.ts from stock_ticks_mor a join stock_ticks_mor_rt b  
> on a.key=b.key where a.ts != b.ts
> ...
> ++---+---+--+
> | a.key  | a.ts  | b.ts  |
> ++---+---+--+
> ++---+---+--+}}
>  
> {{0: jdbc:hive2://hiveserver:1> select a.key,a.ts,b.ts from 
> stock_ticks_mor_rt a join stock_ticks_mor b on a.key = b.key where a.key= 
> 'GOOG_2018-08-31 10';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Execution log at: 
> /tmp/root/root_20190718091316_ec40e8f2-be17-4450-bb75-8db9f4390041.log
> 2019-07-18 09:13:20 Starting to launch local task to process map join;  
> maximum memory = 477626368
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> 2019-07-18 09:13:21 Dump the side-table for tag: 0 with group count: 1 into 
> file: 
> file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-16_658_8306103829282410332-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile50--.hashtable
> 2019-07-18 09:13:21 Uploaded 1 File to: 
> file:/tmp/root/60ae1624-3514-4ddd-9bc1-5d2349d922d6/hive_2019-07-18_09-13-16_658_8306103829282410332-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile50--.hashtable
>  (317 bytes)
> 2019-07-18 09:13:21 End of local task; Time Taken: 1.688 sec.
> +-+--+--+--+
> |a.key| a.ts | b.ts |
> +-+--+--+--+
> | GOOG_2018-08-31 10  | 2018-08-31 10:29:00  | 2018-08-31 10:29:00  |
> +-+--+--+--+
> 1 row selected (7.207 seconds)
> 0: jdbc:hive2://hiveserver:1> select a.key,a.ts,b.ts from stock_ticks_mor 
> a join stock_ticks_mor_rt b on a.key = b.key where a.key= 'GOOG_2018-08-31 
> 10';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/opt/hive/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/opt/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
> Execution log at: 
> /tmp/root/root_20190718091348_72a5fc30-fc04-41c1-b2e3-5f943e4d5c08.log
> 2019-07-18 09:13:51 Starting to launch 

[jira] [Updated] (HUDI-413) Use ColumnIndex in parquet to speed up scans

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-413:

Labels: help-requested  (was: )

> Use ColumnIndex in parquet to speed up scans
> 
>
> Key: HUDI-413
> URL: https://issues.apache.org/jira/browse/HUDI-413
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: help-requested
>
> [https://github.com/apache/parquet-format/blob/master/PageIndex.md]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-434) Design and develop HFile based Index using InlineFS

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-434:

Labels: help-requested  (was: )

> Design and develop HFile based Index using InlineFS
> ---
>
> Key: HUDI-434
> URL: https://issues.apache.org/jira/browse/HUDI-434
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: help-requested
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Labels:   (was: help-requested)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-433) Improve the way log block magic header is identified when a corrupt block is encountered #416

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-433.
---
Resolution: Fixed

> Improve the way log block magic header is identified when a corrupt block is 
> encountered #416
> -
>
> Key: HUDI-433
> URL: https://issues.apache.org/jira/browse/HUDI-433
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> h1. Improve the way log block magic header is identified when a corrupt block 
> is encountered #416



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-431) Design and develop parquet logging in Log file

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-431:

Labels: help-requested  (was: )

> Design and develop parquet logging in Log file
> --
>
> Key: HUDI-431
> URL: https://issues.apache.org/jira/browse/HUDI-431
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: help-requested
>
> Once Inline FS is available, enable parquet logging support with 
> HoodieLogFile. LogFile can expose a writer (essentially ParquetWriter) and 
> users can write records as though writing to parquet files. Similarly on the 
> read path, a reader (parquetReader) will be exposed which the user can use to 
> read data out of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-528:

Labels: bug-bash-0.6.0 help-requested  (was: )

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-528:

Status: Open  (was: New)

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-600) Cleaner fails with AVRO exception when upgrading from 0.5.0 to master

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-600:

Labels: help-requested  (was: )

> Cleaner fails with AVRO exception when upgrading from 0.5.0 to master
> -
>
> Key: HUDI-600
> URL: https://issues.apache.org/jira/browse/HUDI-600
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Cleaner
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> ```
> org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
> at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
> at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
> at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
> at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
> at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
> at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
> at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at org.apache.hudi.HoodieCleanClient.runClean(HoodieCleanClient.java:144)
> at org.apache.hudi.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:89)
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at org.apache.hudi.HoodieCleanClient.clean(HoodieCleanClient.java:87)
> at org.apache.hudi.HoodieWriteClient.clean(HoodieWriteClient.java:837)
> at org.apache.hudi.HoodieWriteClient.postCommit(HoodieWriteClient.java:514)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:156)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:100)
> at 
> org.apache.hudi.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:91)
> at 
> org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:261)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:183)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ```
>  
> [~varadarb] any ideas about this ?
>  
> [~thesquelched] fyi



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-635:

Labels: help-requested  (was: )

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-677) Abstract/Refactor all transaction management logic into a set of classes

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-677:

Labels: help-requested help-wanted  (was: help-wanted)

> Abstract/Refactor all transaction management logic into a set of classes 
> -
>
> Key: HUDI-677
> URL: https://issues.apache.org/jira/browse/HUDI-677
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested, help-wanted
>
> Hudi's timeline management code sits in HoodieActiveTimeline and 
> HoodieDefaultTimeline classes, taking action through the four stages : 
> REQUESTED, INFLIGHT, COMPLETED, INVALID.
> For sake of better readability and maintenance, we should look into 
> reimplementing these as a state machine. 
> Note that this is better done after organizing the action execution classes 
> (as in HUDI-756) in hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Labels: help-requested  (was: )

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-760:

Labels: bug-bash-0.6.0 help-requested help-wanted newbie,  (was: 
bug-bash-0.6.0 help-wanted newbie,)

> Remove Rolling Stat management from Hudi Writer
> ---
>
> Key: HUDI-760
> URL: https://issues.apache.org/jira/browse/HUDI-760
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: renyi.bao
>Priority: Major
>  Labels: bug-bash-0.6.0, help-requested, help-wanted, newbie,
> Fix For: 0.6.0
>
>
> Current implementation of rolling stat is not scalable. As Consolidated 
> Metadata will be implemented eventually, we can have one design to manage 
> file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-818:

Labels: bug-bash-0.6.0 help-requested  (was: bug-bash-0.6.0)

> Optimize the default value of hoodie.memory.merge.max.size option
> -
>
> Key: HUDI-818
> URL: https://issues.apache.org/jira/browse/HUDI-818
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance
>Reporter: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, help-requested
> Fix For: 0.6.0
>
>
> The default value of hoodie.memory.merge.max.size option is incapable of 
> meeting their performance requirements
> [https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-845:

Labels: help-requested  (was: )

> Allow parallel writing and move the pending rollback work into cleaner
> --
>
> Key: HUDI-845
> URL: https://issues.apache.org/jira/browse/HUDI-845
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Things to think about 
>  * Commit time has to be unique across writers 
>  * Parallel writers can finish commits out of order i.e c2 commits before c1.
>  * MOR log blocks fence uncommited data.. 
>  * Cleaner should loudly complain if it cannot finish cleaning up partial 
> writes.  
>  
> P.S: think about what is left for the general thing : log files may have 
> different order, inserts may violate uniqueness constraint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-854) Incremental Cleaning should not revert to brute force all-partition scanning in any cases

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-854:

Labels: help-requested  (was: )

> Incremental Cleaning should not revert to brute force all-partition scanning 
> in any cases
> -
>
> Key: HUDI-854
> URL: https://issues.apache.org/jira/browse/HUDI-854
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
>
> After [https://github.com/apache/incubator-hudi/pull/1576] . Incremental 
> Cleaning would still resort to full partition scan when  no previous clean 
> operation was done in the dataset. This ticket is to design and implement a 
> safe solution which would avoid full scanning in all cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Labels: help-requested  (was: )

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-284) Need Tests for Hudi handling of schema evolution

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-284:

Labels: help-requested  (was: )

> Need  Tests for Hudi handling of schema evolution
> -
>
> Key: HUDI-284
> URL: https://issues.apache.org/jira/browse/HUDI-284
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Common Core, newbie, Testing
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
>
> Context in : 
> https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-857) Overhaul unit-tests for Cleaner and ROllbacks

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-857:

Labels: help-requested  (was: bug-bash-0.6.0)

> Overhaul unit-tests for Cleaner and ROllbacks
> -
>
> Key: HUDI-857
> URL: https://issues.apache.org/jira/browse/HUDI-857
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Unit tests for these components do not clearly tests their functionality. 
> Instead some of them seem to be written to make them pass with the initial 
> code. We would need to overhaul these tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Labels:   (was: help-wanted)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-749) packaging/hudi-timeline-server-bundle./run_server.sh start error

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-749.
-
Resolution: Fixed

> packaging/hudi-timeline-server-bundle./run_server.sh start error
> 
>
> Key: HUDI-749
> URL: https://issues.apache.org/jira/browse/HUDI-749
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: leesf
>Assignee: Trevorzhang
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ./run_server.sh start error.
> should change 
> HOODIE_JAR=`ls -c $DIR/target/hudi-timeline-server-bundle-*.jar | grep -v 
> test | head -1` to 
> HOODIE_JAR=`ls -c $DIR/target/hudi-timeline-server-bundle-*.jar | grep -v 
> test | grep -v source | head -1`
> should fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-749) packaging/hudi-timeline-server-bundle./run_server.sh start error

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-749:

Status: Open  (was: New)

> packaging/hudi-timeline-server-bundle./run_server.sh start error
> 
>
> Key: HUDI-749
> URL: https://issues.apache.org/jira/browse/HUDI-749
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Reporter: leesf
>Assignee: Trevorzhang
>Priority: Minor
>  Labels: newbie, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> ./run_server.sh start error.
> should change 
> HOODIE_JAR=`ls -c $DIR/target/hudi-timeline-server-bundle-*.jar | grep -v 
> test | head -1` to 
> HOODIE_JAR=`ls -c $DIR/target/hudi-timeline-server-bundle-*.jar | grep -v 
> test | grep -v source | head -1`
> should fix the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-558) Introduce ability to compress bloom filters while storing in parquet

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-558:

Fix Version/s: (was: 0.6.0)

> Introduce ability to compress bloom filters while storing in parquet
> 
>
> Key: HUDI-558
> URL: https://issues.apache.org/jira/browse/HUDI-558
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Balaji Varadarajan
>Assignee: liwei
>Priority: Blocker
>  Labels: help-wanted, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Based on performance study 
> [https://docs.google.com/spreadsheets/d/1KCmmdgaFTWBmpOk9trePdQ2m6wPVj2G328fTcRnQP1M/edit?usp=sharing]
>  we found that there is benefit in compressing bloom filters when storing in 
> parquet. As this is an experimental feature, we will need to disable this 
> feature by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-817) Wrong index filter condition check in HoodieGlobalBloomIndex

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-817.
---
Resolution: Fixed

> Wrong index filter condition check in HoodieGlobalBloomIndex
> 
>
> Key: HUDI-817
> URL: https://issues.apache.org/jira/browse/HUDI-817
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> In HoodieGlobalBloomIndex, wrong condition is checked.
>  
> {code:java}
> IndexFileFilter indexFileFilter =config.getBloomIndexPruneByRanges() 
> ? new IntervalTreeBasedGlobalIndexFileFilter(partitionToFileIndexInfo)
> : new ListBasedGlobalIndexFileFilter(partitionToFileIndexInfo);
> {code}
>  Instead of config.getBloomIndexPruneByRanges(), it should be 
> config.useBloomIndexTreebasedFilter().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-817) Wrong index filter condition check in HoodieGlobalBloomIndex

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-817:

Status: Open  (was: New)

> Wrong index filter condition check in HoodieGlobalBloomIndex
> 
>
> Key: HUDI-817
> URL: https://issues.apache.org/jira/browse/HUDI-817
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Index
>Reporter: sivabalan narayanan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> In HoodieGlobalBloomIndex, wrong condition is checked.
>  
> {code:java}
> IndexFileFilter indexFileFilter =config.getBloomIndexPruneByRanges() 
> ? new IntervalTreeBasedGlobalIndexFileFilter(partitionToFileIndexInfo)
> : new ListBasedGlobalIndexFileFilter(partitionToFileIndexInfo);
> {code}
>  Instead of config.getBloomIndexPruneByRanges(), it should be 
> config.useBloomIndexTreebasedFilter().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-854) Incremental Cleaning should not revert to brute force all-partition scanning in any cases

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-854:

Fix Version/s: (was: 0.6.0)

> Incremental Cleaning should not revert to brute force all-partition scanning 
> in any cases
> -
>
> Key: HUDI-854
> URL: https://issues.apache.org/jira/browse/HUDI-854
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Major
>
> After [https://github.com/apache/incubator-hudi/pull/1576] . Incremental 
> Cleaning would still resort to full partition scan when  no previous clean 
> operation was done in the dataset. This ticket is to design and implement a 
> safe solution which would avoid full scanning in all cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-865) Improve Hive Syncing by directly translating avro schema to Hive types

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-865:

Fix Version/s: (was: 0.6.0)

> Improve Hive Syncing by directly translating avro schema to Hive types
> --
>
> Key: HUDI-865
> URL: https://issues.apache.org/jira/browse/HUDI-865
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> With the current code in master and proposed improvements with  
> [https://github.com/apache/incubator-hudi/pull/1559,|https://github.com/apache/incubator-hudi/pull/1559]
> Hive Sync integration would resort to the following translations for finding 
> table schema
>  Avro-Schema to Parquet-Schema to Hive Schema transformations
> We need to implement logic to skip the extra hop to parquet schema when 
> generating hive schema. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-865) Improve Hive Syncing by directly translating avro schema to Hive types

2020-05-10 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17103801#comment-17103801
 ] 

Vinoth Chandar commented on HUDI-865:
-

whats the goal here? performance? 

> Improve Hive Syncing by directly translating avro schema to Hive types
> --
>
> Key: HUDI-865
> URL: https://issues.apache.org/jira/browse/HUDI-865
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
>
> With the current code in master and proposed improvements with  
> [https://github.com/apache/incubator-hudi/pull/1559,|https://github.com/apache/incubator-hudi/pull/1559]
> Hive Sync integration would resort to the following translations for finding 
> table schema
>  Avro-Schema to Parquet-Schema to Hive Schema transformations
> We need to implement logic to skip the extra hop to parquet schema when 
> generating hive schema. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-436) Integrate HFile and Compaction

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-436:

Fix Version/s: (was: 0.6.0)

> Integrate HFile and Compaction 
> ---
>
> Key: HUDI-436
> URL: https://issues.apache.org/jira/browse/HUDI-436
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-434) Design and develop HFile based Index using InlineFS

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-434:

Fix Version/s: (was: 0.6.0)

> Design and develop HFile based Index using InlineFS
> ---
>
> Key: HUDI-434
> URL: https://issues.apache.org/jira/browse/HUDI-434
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Fix Version/s: (was: 0.6.0)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-303) Avro schema case sensitivity testing

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-303:

Labels: bug-bash-0.6.0  (was: )

> Avro schema case sensitivity testing
> 
>
> Key: HUDI-303
> URL: https://issues.apache.org/jira/browse/HUDI-303
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> As a fallout of [PR 956|https://github.com/apache/incubator-hudi/pull/956] we 
> would like to understand how Avro behaves with case sensitive column names.
> Couple of action items:
>  * Test with different field names just differing in case.
>  * *AbstractRealtimeRecordReader* is one of the classes where we are 
> converting Avro Schema field names to lower case, to be able to verify them 
> against column names from Hive. We can consider removing the *lowercase* 
> conversion there if we verify it does not break anything.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-648) Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction writes

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-648:

Fix Version/s: (was: 0.6.0)

> Implement error log/table for Datasource/DeltaStreamer/WriteClient/Compaction 
> writes
> 
>
> Key: HUDI-648
> URL: https://issues.apache.org/jira/browse/HUDI-648
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>
> We would like a way to hand the erroring records from writing or compaction 
> back to the users, in a separate table or log. This needs to work generically 
> across all the different writer paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-408) [Umbrella] Refactor/Code clean up hoodie write client

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-408.
-
Resolution: Fixed

> [Umbrella] Refactor/Code clean up hoodie write client 
> --
>
> Key: HUDI-408
> URL: https://issues.apache.org/jira/browse/HUDI-408
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Nishith Agarwal
>Assignee: Vinoth Chandar
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-43) Introduce a WriteContext abstraction to HoodieWriteClient #384

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-43?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-43:
---
Parent: (was: HUDI-408)
Issue Type: Improvement  (was: Sub-task)

> Introduce a WriteContext abstraction to HoodieWriteClient #384
> --
>
> Key: HUDI-43
> URL: https://issues.apache.org/jira/browse/HUDI-43
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> [https://github.com/uber/hudi/issues/384]
>  
> HoodieTable, WriteConfig and other classes passed between "client" and "io"  
> etc need to standardize on this 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-677) Abstract/Refactor all transaction management logic into a set of classes

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-677:

Parent: (was: HUDI-408)
Issue Type: Improvement  (was: Sub-task)

> Abstract/Refactor all transaction management logic into a set of classes 
> -
>
> Key: HUDI-677
> URL: https://issues.apache.org/jira/browse/HUDI-677
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> Hudi's timeline management code sits in HoodieActiveTimeline and 
> HoodieDefaultTimeline classes, taking action through the four stages : 
> REQUESTED, INFLIGHT, COMPLETED, INVALID.
> For sake of better readability and maintenance, we should look into 
> reimplementing these as a state machine. 
> Note that this is better done after organizing the action execution classes 
> (as in HUDI-756) in hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-435) Make async compaction/cleaning extensible to new usages

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-435.
---
Resolution: Won't Fix

Based on new thinking for RFC-15, we are going to leverage hoodie table 
abstraction directly.. so no needd for async compaction for index specifically 

> Make async compaction/cleaning extensible to new usages
> ---
>
> Key: HUDI-435
> URL: https://issues.apache.org/jira/browse/HUDI-435
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Compaction, Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> Once HFile based index is available, next step is to make compaction 
> extensible to be available for all components.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-677) Abstract/Refactor all transaction management logic into a set of classes

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-677:

Fix Version/s: (was: 0.6.0)

> Abstract/Refactor all transaction management logic into a set of classes 
> -
>
> Key: HUDI-677
> URL: https://issues.apache.org/jira/browse/HUDI-677
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
>
> Hudi's timeline management code sits in HoodieActiveTimeline and 
> HoodieDefaultTimeline classes, taking action through the four stages : 
> REQUESTED, INFLIGHT, COMPLETED, INVALID.
> For sake of better readability and maintenance, we should look into 
> reimplementing these as a state machine. 
> Note that this is better done after organizing the action execution classes 
> (as in HUDI-756) in hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Parent: (was: HUDI-408)
Issue Type: Improvement  (was: Sub-task)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: help-wanted
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-408) [Umbrella] Refactor/Code clean up hoodie write client

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-408:

Fix Version/s: (was: 0.6.0)

> [Umbrella] Refactor/Code clean up hoodie write client 
> --
>
> Key: HUDI-408
> URL: https://issues.apache.org/jira/browse/HUDI-408
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Nishith Agarwal
>Assignee: Vinoth Chandar
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-431) Design and develop parquet logging in Log file

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-431:

Fix Version/s: (was: 0.6.0)

> Design and develop parquet logging in Log file
> --
>
> Key: HUDI-431
> URL: https://issues.apache.org/jira/browse/HUDI-431
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>
> Once Inline FS is available, enable parquet logging support with 
> HoodieLogFile. LogFile can expose a writer (essentially ParquetWriter) and 
> users can write records as though writing to parquet files. Similarly on the 
> read path, a reader (parquetReader) will be exposed which the user can use to 
> read data out of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-338) Reduce Hoodie commit/instant time granularity to millis from secs

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-338:

Fix Version/s: 0.6.0

> Reduce Hoodie commit/instant time granularity to millis from secs
> -
>
> Key: HUDI-338
> URL: https://issues.apache.org/jira/browse/HUDI-338
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-733) presto query data error

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-733:

Labels: bug-bash-0.6.0  (was: )

> presto query data error
> ---
>
> Key: HUDI-733
> URL: https://issues.apache.org/jira/browse/HUDI-733
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
>Affects Versions: 0.5.1
>Reporter: jing
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: bug-bash-0.6.0
> Attachments: hive_table.png, parquet_context.png, parquet_schema.png, 
> presto_query_data.png
>
>
> We found a data sequence issue in Hudi when we use API to import data(use 
> spark.read.json("filename") read to dataframe then write  to hudi). The 
> original d is rowkey:1 dt:2 time:3.
> But the value is unexpected when query the data by Presto(rowkey:2 dt:1 
> time:2), but correctly in Hive.
> After analysis, if I use dt to partition the column data, it is also written 
> in the parquet file. dt = xxx, and the value of the partition column should 
> be the value in the path of the hudi. However, I found that the value of the 
> presto query must be one-to-one with the columns in the parquet. He will not 
> detect the column names.
> Transformation methods and suggestions:
>  # Can the inputformat class be ignored to read the column value of the 
> partition column dt in parquet?
>  # Can hive data be synchronized without dt as a partition column? Consider 
> adding a column such as repl_dt as a partition column and dt as an ordinary 
> field.
>  # The dt column is not written to the parquet file.
>      4, dt is written to the parquet file, but as the last column.
>  
> [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-575:

Fix Version/s: 0.6.0

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-432) Benchmark HFile for scan vs seek

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-432.
-
Resolution: Fixed

> Benchmark HFile for scan vs seek
> 
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Performance, Storage Management
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, 
> Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 
> AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record 
> indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs 
> like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
> shuffle keyList 
> trim the list to keysToRead 
> start timer HFile 
> read benchmark(scan/seek) 
> end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M 
> entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-863:

Labels: bug-bash-0.6.0 pull-request-available  (was: pull-request-available)

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-10 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-863:

Fix Version/s: 0.6.0

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >