[jira] [Updated] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-02-18 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-409:
--
Issue Type: Improvement  (was: Bug)

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-02-18 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang reassigned HUDI-409:
-

Assignee: Nishith Agarwal

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] tooptoop4 opened a new issue #1342: [SUPPORT] do cow tables need to be converted when changing from hoodie to hudi?

2020-02-18 Thread GitBox
tooptoop4 opened a new issue #1342: [SUPPORT] do cow tables need to be 
converted when changing from hoodie to hudi?
URL: https://github.com/apache/incubator-hudi/issues/1342
 
 
   this mentions running utility to convert mor files
   Migration Guide From com.uber.hoodie to org.apache.hudi - HUDI - Apache 
Software Foundationfiles
   
   
   if i have hoodie cow files from 0.4.6 do i need to run convert utility or 
can hudi 0.5.1 read/write to it naturally?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-408) [Umbrella] Refactor/Code clean up hoodie write client

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-408:
---
Fix Version/s: (was: 0.5.2)
   0.6.0

> [Umbrella] Refactor/Code clean up hoodie write client 
> --
>
> Key: HUDI-408
> URL: https://issues.apache.org/jira/browse/HUDI-408
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-581:
---
Fix Version/s: (was: 0.6.0)
   0.5.2

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
smarthi commented on a change in pull request #1341: Add exportToTable option 
to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r380993198
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/TempTableUtil.java
 ##
 @@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.utils;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TempTableUtil {
+  private static final Logger LOG = LogManager.getLogger(TempTableUtil.class);
+
+  private JavaSparkContext jsc;
 
 Review comment:
   Is this gonna be coupled to Spark? It's best not to add more Spark specific 
code given that there is work being undertaken to support other engines like 
flink on Hudi.  Can this be abstracted out to be engine agnostic ?  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #193

2020-02-18 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.29 KB...]
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] satishkotha commented on issue #1320: [HUDI-571] Add min/max headers on archived files

2020-02-18 Thread GitBox
satishkotha commented on issue #1320: [HUDI-571] Add min/max headers on 
archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#issuecomment-587992448
 
 
   @n3nash  please take one more look. might be easier to discuss this in 
person.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-18 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r381034791
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieLogBlock.java
 ##
 @@ -121,7 +121,7 @@ public long getLogBlockLength() {
* new enums at the end.
*/
   public enum HeaderMetadataType {
-INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE
+INSTANT_TIME, TARGET_INSTANT_TIME, SCHEMA, COMMAND_BLOCK_TYPE, 
MIN_INSTANT_TIME, MAX_INSTANT_TIME
 
 Review comment:
   I tried doing this. there are couple problems:
   1) java types- HoodieAvroDataBlock relies on header key to be of type: 
HeaderMetadataType. Nested enum has different java type, so this requires lot 
more refactoring. enums do not seem to have inheritance support in java, so 
this likely involves converting enum to string/ordinals and lot of testing to 
make sure we did not break any backward compatibility
   
   2) enum nesting will end up creating totally new set of ordinals. It can be 
confusing to see which enum ordinals show. 
   
   tl;dr building better abstraction seems lot more work. Let me know if we 
want to investigate this further. Likely will have to done as separate task and 
postpone/abandon merging this until then.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-18 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r381030813
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieCommitArchiveLog.java
 ##
 @@ -268,6 +270,19 @@ public Path getArchiveFilePath() {
 return archiveFilePath;
   }
 
+  private void writeHeaderBlock(Schema wrapperSchema, List 
instants) throws Exception {
+if (!instants.isEmpty()) {
+  Collections.sort(instants, HoodieInstant.COMPARATOR);
+  HoodieInstant minInstant = instants.get(0);
+  HoodieInstant maxInstant = instants.get(instants.size() - 1);
+  Map metadataMap = Maps.newHashMap();
+  metadataMap.put(HeaderMetadataType.SCHEMA, wrapperSchema.toString());
+  metadataMap.put(HeaderMetadataType.MIN_INSTANT_TIME, 
minInstant.getTimestamp());
+  metadataMap.put(HeaderMetadataType.MAX_INSTANT_TIME, 
maxInstant.getTimestamp());
+  this.writer.appendBlock(new HoodieAvroDataBlock(Collections.emptyList(), 
metadataMap));
+}
+  }
+
   private void writeToFile(Schema wrapperSchema, List records) 
throws Exception {
 
 Review comment:
   addressed. added a unit test for this scenario


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max headers on archived files

2020-02-18 Thread GitBox
satishkotha commented on a change in pull request #1320: [HUDI-571] Add min/max 
headers on archived files
URL: https://github.com/apache/incubator-hudi/pull/1320#discussion_r381030780
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java
 ##
 @@ -182,8 +183,11 @@ private String getMetadataKey(String action) {
   //read the avro blocks
   while (reader.hasNext()) {
 HoodieAvroDataBlock blk = (HoodieAvroDataBlock) reader.next();
-// TODO If we can store additional metadata in datablock, we can 
skip parsing records
-// (such as startTime, endTime of records in the block)
+if (isDataOutOfRange(blk, filter)) {
 
 Review comment:
   addressed. added a unit test for this scenario


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Labels: easyfix pull-request-available  (was: easyfix 
pull-request-available pull-requests-available)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-request-available
> Attachments: test_data.json, test_schema.avsc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
smarthi commented on a change in pull request #1341: Add exportToTable option 
to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r380993346
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/TempTableUtil.java
 ##
 @@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.utils;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TempTableUtil {
+  private static final Logger LOG = LogManager.getLogger(TempTableUtil.class);
+
+  private JavaSparkContext jsc;
+  private SQLContext sqlContext;
+
+  public TempTableUtil(String appName) {
+try {
+  SparkConf sparkConf = new SparkConf().setAppName(appName)
+  .set("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer").setMaster("local[8]");
+  jsc = new JavaSparkContext(sparkConf);
+  jsc.setLogLevel("ERROR");
+
+  sqlContext = new SQLContext(jsc);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to initialize spark context ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void write(String tableName, List headers, 
List> rows) {
+try {
+  if (headers.isEmpty() || rows.isEmpty()) {
+return;
+  }
+
+  if (rows.stream().filter(row -> row.size() != headers.size()).count() > 
0) {
+throw new HoodieException("Invalid row, does not match headers " + 
headers.size() + " " + rows.size());
+  }
+
+  // replace all whitespaces in headers to make it easy to write sql 
queries
+  List headersNoSpaces = headers.stream().map(title -> 
title.replaceAll("\\s+",""))
+  .collect(Collectors.toList());
+
+  // generate schema for table
+  StructType structType = new StructType();
+  for (int i = 0; i < headersNoSpaces.size(); i++) {
+// try guessing data type from column data.
+DataType headerDataType = getDataType(rows.get(0).get(i));
+structType = 
structType.add(DataTypes.createStructField(headersNoSpaces.get(i), 
headerDataType, true));
+  }
+  List records = rows.stream().map(row -> 
RowFactory.create(row.toArray(new Comparable[row.size()])))
+  .collect(Collectors.toList());
+  Dataset dataset = this.sqlContext.createDataFrame(records, 
structType);
+  dataset.createOrReplaceTempView(tableName);
+  System.out.println("Wrote table view: " + tableName);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to write ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void runQuery(String sqlText) {
+try {
+  this.sqlContext.sql(sqlText).show(Integer.MAX_VALUE, false);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to read ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  public void deleteTable(String tableName) {
+try {
+  sqlContext.sql("DROP TABLE IF EXISTS " + tableName);
+} catch (Throwable ex) {
+  // log full stack trace and rethrow. Without this its difficult to debug 
failures, if any
+  LOG.error("unable to initialize spark context ", ex);
+  throw new HoodieException(ex);
+}
+  }
+
+  private DataType getDataType(Comparable comparable) {
 
 Review comment:
   Please use Generic types if possible.


This is an automated message from the Apache Git Service.
To respond to the 

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
smarthi commented on a change in pull request #1341: Add exportToTable option 
to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r380993198
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/utils/TempTableUtil.java
 ##
 @@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.utils;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.RowFactory;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.types.DataType;
+import org.apache.spark.sql.types.DataTypes;
+import org.apache.spark.sql.types.StructType;
+
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class TempTableUtil {
+  private static final Logger LOG = LogManager.getLogger(TempTableUtil.class);
+
+  private JavaSparkContext jsc;
 
 Review comment:
   Is this gonna be coupled to Spark? Its beat not to add more Spark specific 
code given that there is work to support other engines.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
smarthi commented on a change in pull request #1341: Add exportToTable option 
to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r380990171
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
 ##
 @@ -18,13 +18,16 @@
 
 package org.apache.hudi.cli;
 
+import com.google.common.base.Strings;
 
 Review comment:
   Please desist from using any Guava APIs.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
smarthi commented on a change in pull request #1341: Add exportToTable option 
to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r380990440
 
 

 ##
 File path: hudi-cli/src/main/java/org/apache/hudi/cli/HoodiePrintHelper.java
 ##
 @@ -57,11 +60,38 @@ public static String print(String[] header, String[][] 
rows) {
*/
   public static String print(TableHeader rowHeader, Map> fieldNameToConverterMap,
   String sortByField, boolean isDescending, Integer limit, boolean 
headerOnly, List rows) {
+return print(rowHeader, fieldNameToConverterMap, sortByField, 
isDescending, limit, headerOnly, rows, "");
+  }
+
+  /**
+   * Serialize Table to printable string and also export a temporary view to 
easily write sql queries.
+   *
+   * Ideally, exporting view needs to be outside PrintHelper, but all commands 
use this. So this is easy
+   * way to add support for all commands
+   *
+   * @param rowHeader Row Header
+   * @param fieldNameToConverterMap Field Specific Converters
+   * @param sortByField Sorting field
+   * @param isDescending Order
+   * @param limit Limit
+   * @param headerOnly Headers only
+   * @param rows List of rows
+   * @param tempTableName table name to export
+   * @return Serialized form for printing
+   */
+  public static String print(TableHeader rowHeader, Map> fieldNameToConverterMap,
+  String sortByField, boolean isDescending, Integer limit, boolean 
headerOnly, List rows,
+  String tempTableName) {
 
 if (headerOnly) {
   return HoodiePrintHelper.print(rowHeader);
 }
 
+if (!Strings.isNullOrEmpty(tempTableName)) {
 
 Review comment:
   Replace this with StringUtils.isNullOrEmpty() pending PR# 1159


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #954: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: table not found

2020-02-18 Thread GitBox
umehrot2 commented on issue #954:  
org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: 
 table not found
URL: https://github.com/apache/incubator-hudi/issues/954#issuecomment-587942358
 
 
   @vinothchandar I am happy to take up this integration piece. I have been 
keeping a close track on all the discussions around hive/glue catalog sync 
issues, and most of them have been around misconfigurations. One issue that is 
relevant is that schema evolution does now work against Glue catalog and I will 
create a JIRA for that.
   
   I also think we can add some questions related to glue in the FAQ regarding 
misconfigurations. I can add it there, but would like to know the process for 
doing that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on issue #954: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: table not found

2020-02-18 Thread GitBox
umehrot2 commented on issue #954:  
org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: 
 table not found
URL: https://github.com/apache/incubator-hudi/issues/954#issuecomment-587941167
 
 
   @jinshuangxian This particular issue has been fixed since our first release 
of Hudi on emr-5.28.0. So, you can use either `emr-5.28.0` or `emr-5.29.0` 
without this issue. Let me know if you are running into an actual issue.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] satishkotha opened a new pull request #1341: Add exportToTable option to CLI

2020-02-18 Thread GitBox
satishkotha opened a new pull request #1341: Add exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   CLI shell is very restrictive and its sometimes hard to filter specific 
rows. Adding ability to export results of CLI command into temporary table. So 
CLI users can write HiveQL queries to look for any specific information. 
   (This idea has been brought up by multiple folks on the team, thanks 
everyone for great suggestions)
   
   ## Brief change log
   - Add 'exportToTableName' cli option for commits command.
   - This creates a temp table in spark session. Users can write HiveQL queries 
against the table to filter desired row.
   - Note, intentionally not using Hoodie format for temp tables (to avoid 
versioning and any potential bugs in hoodie)
   
   Example usage:
   ->commits show archived --startTs "20200121224545" --endTs "20200224004414"  
 --includeExtraMetadata true --exportToTableName satishkotha_debug
   Wrote table view: satishkotha_debug
   
   >temp_query --sql "select Instant, NumInserts, NumWrites from 
satishkotha_debug where FileId='ed33bd99-466f-4417-bd92-5d914fa58a8f' and 
Instant > '20200123211217' order by Instant"
   +--+--+-+
   |Instant   |NumInserts|NumWrites|
   +--+--+-+
   |20200123221012|0 |2418 |
   |20200123223835|0 |2418 |
   |20200123231230|0 |2418 |
   |20200123233911|0 |2418 |
   |20200124000848|3 |3|
   |20200124004403|7 |10   |
   |20200124013616|1 |11   |
   |20200124020556|1 |12   |
   |20200124061752|1 |13   |
   
   
   ## Verify this pull request
   
   Only changes CLI. can be verified by running CLI commands.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-18 Thread GitBox
ramachandranms commented on a change in pull request #1332: [HUDI -409] Match 
header and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r380939677
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
 
 Review comment:
   verified that the tests will fail if the footer is evolved in any way. also 
added the following logs to better troubleshoot why a corrupted block was 
detected.
   
   ```
   4828 [main] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - 
Found corrupted block in file 
HoodieLogFile{pathStr='/var/folders/ch/42h2kyw10_l0509znbppmrsmgn/T/junit3752725921138110721/.test-fileid1_100.log.1_1-0-1',
 fileLen=0}. No magic hash found right after footer block size entry
   
   4700 [main] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - 
Found corrupted block in file 
HoodieLogFile{pathStr='/var/folders/ch/42h2kyw10_l0509znbppmrsmgn/T/junit8143862081174382297/.test-fileid1_100.log.1_1-0-1',
 fileLen=0}. Header block size(2135) did not match the footer block size(2235)
   
   4316 [main] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - 
Found corrupted block in file 
HoodieLogFile{pathStr='/var/folders/ch/42h2kyw10_l0509znbppmrsmgn/T/junit973614944657653272/.test-fileid1_100.log.1_1-0-1',
 fileLen=0} with block size(21350) running past EOF```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1332: [HUDI -409] Match header and footer block length to improve corrupted block detection

2020-02-18 Thread GitBox
ramachandranms commented on a change in pull request #1332: [HUDI -409] Match 
header and footer block length to improve corrupted block detection
URL: https://github.com/apache/incubator-hudi/pull/1332#discussion_r380938785
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java
 ##
 @@ -239,6 +239,15 @@ private boolean isBlockCorrupt(int blocksize) throws 
IOException {
   return true;
 }
 
+// check if the blocksize mentioned in the footer is the same as the 
header; by seeking back the length of a long
+// the backward seek does not incur additional IO as {@link 
org.apache.hadoop.hdfs.DFSInputStream#seek()}
+// only moves the index. actual IO happens on the next read operation
+inputStream.seek(inputStream.getPos() - Long.BYTES);
 
 Review comment:
   added comment to the write


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to support embedding any file format as an InlineFile

2020-02-18 Thread GitBox
vinothchandar commented on issue #1176: [HUDI-430] Adding InlineFileSystem to 
support embedding any file format as an InlineFile
URL: https://github.com/apache/incubator-hudi/pull/1176#issuecomment-587730864
 
 
   @nsivabalan awesome. will do.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

2020-02-18 Thread GitBox
vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for 
Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-587712891
 
 
   >So we need to label each stage and they should show up correctly.
   
   if we could do that, and if you can post the `upsert()` dag for example, 
that would be great.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

2020-02-18 Thread GitBox
prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for 
Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-587701211
 
 
   > 1. It seems like you are only covering cases where an RDD is getting 
created? Is it possible to change the job group between stages?
   
   The setJobGroup() description applies to the Thread and is used until either 
it is updated or removed. So we need to label each stage and they should show 
up correctly.
   
   We can label RDD creations as well as operations on them.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-617:

Labels: easyfix pull-request-available pull-requests-available  (was: 
easyfix pull-requests-available)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-request-available, pull-requests-available
> Attachments: test_data.json, test_schema.avsc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated (8c6138c -> c2b08cd)

2020-02-18 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 8c6138c  [MINOR] Add javadoc to SchedulerConfGenerator and code clean 
(#1340)
 add c2b08cd  [HUDI-617] Add support for types implementing CharSequence 
(#1339)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/utilities/keygen/TimestampBasedKeyGenerator.java| 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1339: [HUDI-617] Add support for types implementing CharSequence

2020-02-18 Thread GitBox
vinothchandar merged pull request #1339: [HUDI-617] Add support for types 
implementing CharSequence
URL: https://github.com/apache/incubator-hudi/pull/1339
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (0049323 -> 8c6138c)

2020-02-18 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 0049323  [HUDI-108] Removing 2GB spark partition limitations in 
HoodieBloomIndex with spark 2.4.4 (#1315)
 add 8c6138c  [MINOR] Add javadoc to SchedulerConfGenerator and code clean 
(#1340)

No new revisions were added by this update.

Summary of changes:
 .../deltastreamer/SchedulerConfGenerator.java  | 36 +-
 1 file changed, 28 insertions(+), 8 deletions(-)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1340: [MINOR] Add javadoc to SchedulerConfGenerator and Code clean

2020-02-18 Thread GitBox
vinothchandar merged pull request #1340: [MINOR] Add javadoc to 
SchedulerConfGenerator and Code clean
URL: https://github.com/apache/incubator-hudi/pull/1340
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-108) Simplify HoodieBloomIndex without the need for 2GB limit handling

2020-02-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-108:

Labels: pull-request-available  (was: )

> Simplify HoodieBloomIndex without the need for 2GB limit handling
> -
>
> Key: HUDI-108
> URL: https://issues.apache.org/jira/browse/HUDI-108
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated (b8f9d0e -> 0049323)

2020-02-18 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from b8f9d0e  [HUDI-615]: Add some methods and test cases for StringUtils. 
(#1338)
 add 0049323  [HUDI-108] Removing 2GB spark partition limitations in 
HoodieBloomIndex with spark 2.4.4 (#1315)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/index/bloom/HoodieBloomIndex.java  | 72 --
 1 file changed, 13 insertions(+), 59 deletions(-)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1315: [HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4

2020-02-18 Thread GitBox
vinothchandar merged pull request #1315: [HUDI-108] Removing 2GB spark 
partition limitations in HoodieBloomIndex with spark 2.4.4
URL: https://github.com/apache/incubator-hudi/pull/1315
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1310: [HUDI-601] Improve unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, RealtimeCompactedReco

2020-02-18 Thread GitBox
prashantwason commented on a change in pull request #1310: [HUDI-601] Improve 
unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, 
RealtimeCompactedRecordReader
URL: https://github.com/apache/incubator-hudi/pull/1310#discussion_r380870271
 
 

 ##
 File path: 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeRecordReader.java
 ##
 @@ -333,6 +337,8 @@ public void testUnMergedReader() throws Exception {
 assertEquals(numRecords, numRecordsAtCommit1);
 assertEquals(numRecords, numRecordsAtCommit2);
 assertEquals(2 * numRecords, seenKeys.size());
+assertEquals(1.0, recordReader.getProgress(), 0.05);
 
 Review comment:
   Maybe add a comment here why the expected value is not strictly equal to 1.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1328: Hudi upsert hangs

2020-02-18 Thread GitBox
vinothchandar commented on issue #1328: Hudi upsert hangs
URL: https://github.com/apache/incubator-hudi/issues/1328#issuecomment-587663160
 
 
   Started on this.. Was trying to port to scala, since I am not super familiar 
with pySpark. Will resume today and circle back. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #954: org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: table not found

2020-02-18 Thread GitBox
vinothchandar commented on issue #954:  
org.apache.hudi.org.apache.hadoop_hive.metastore.api.NoSuchObjectException: 
 table not found
URL: https://github.com/apache/incubator-hudi/issues/954#issuecomment-587661582
 
 
   @umehrot2 Looks like glue integration keeps coming up :).. Do you want to 
chime in here? May be we good to also track some follow up from these issues 
(if any) towards the next release? Let me know if you are interested in driving 
this


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1310: [HUDI-601] Improve unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, RealtimeCompactedReco

2020-02-18 Thread GitBox
prashantwason commented on a change in pull request #1310: [HUDI-601] Improve 
unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, 
RealtimeCompactedRecordReader
URL: https://github.com/apache/incubator-hudi/pull/1310#discussion_r380870206
 
 

 ##
 File path: 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeRecordReader.java
 ##
 @@ -251,6 +252,9 @@ private void testReader(boolean partitioned) throws 
Exception {
   key = recordReader.createKey();
   value = recordReader.createValue();
 }
+recordReader.getPos();
+assertEquals(1.0, recordReader.getProgress(), 0.05);
 
 Review comment:
   Maybe add a comment here why the expected value is not strictly equal to 1.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1334: [HUDI-612] Fix return no data when using incremental query

2020-02-18 Thread GitBox
vinothchandar commented on issue #1334: [HUDI-612] Fix return no data when 
using incremental query
URL: https://github.com/apache/incubator-hudi/pull/1334#issuecomment-587659536
 
 
   @lamber-ken there is `HoodieDataSourceHelpers` has has some helps to figure 
out the new commits.. what are you trying to accomplish? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1310: [HUDI-601] Improve unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, RealtimeCompactedReco

2020-02-18 Thread GitBox
prashantwason commented on a change in pull request #1310: [HUDI-601] Improve 
unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, 
RealtimeCompactedRecordReader
URL: https://github.com/apache/incubator-hudi/pull/1310#discussion_r380867747
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroWriteSupport.java
 ##
 @@ -0,0 +1,56 @@
+package org.apache.hudi.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroWriteSupport;
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.BloomFilterFactory;
+import org.apache.hudi.common.bloom.filter.BloomFilterTypeCode;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.HoodieAvroUtils;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+public class TestHoodieAvroWriteSupport {
+
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
+
+  @Test
+  public void testAddKey() throws IOException {
+List rowKeys = new ArrayList<>();
+for (int i = 0; i < 1000; i++) {
+  rowKeys.add(UUID.randomUUID().toString());
+}
+String filePath = folder.getRoot() + "/test.parquet";
+Schema schema = HoodieAvroUtils.getRecordKeySchema();
+BloomFilter filter = BloomFilterFactory.createBloomFilter(
+  1000, 0.0001, 1,
+  BloomFilterTypeCode.SIMPLE.name());
+HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(
+  new AvroSchemaConverter().convert(schema), schema, filter);
+ParquetWriter writer = new ParquetWriter(new Path(filePath), writeSupport, 
CompressionCodecName.GZIP,
+  120 * 1024 * 1024, ParquetWriter.DEFAULT_PAGE_SIZE);
+for (String rowKey : rowKeys) {
+  GenericRecord rec = new GenericData.Record(schema);
+  rec.put(HoodieRecord.RECORD_KEY_METADATA_FIELD, rowKey);
+  writer.write(rec);
+  writeSupport.add(rowKey);
 
 Review comment:
   You can also use HoodieParquetWriter instead of ParquetWriter which is how 
the writeSupport.add(rowKey) is called within HUDI code. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1310: [HUDI-601] Improve unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, RealtimeCompactedReco

2020-02-18 Thread GitBox
prashantwason commented on a change in pull request #1310: [HUDI-601] Improve 
unit test coverage for HoodieAvroWriteSupport, HoodieRealtimeRecordReader, 
RealtimeCompactedRecordReader
URL: https://github.com/apache/incubator-hudi/pull/1310#discussion_r380861934
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/avro/TestHoodieAvroWriteSupport.java
 ##
 @@ -0,0 +1,56 @@
+package org.apache.hudi.avro;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroWriteSupport;
+import org.apache.hudi.common.bloom.filter.BloomFilter;
+import org.apache.hudi.common.bloom.filter.BloomFilterFactory;
+import org.apache.hudi.common.bloom.filter.BloomFilterTypeCode;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.HoodieAvroUtils;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.hadoop.ParquetWriter;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+import org.junit.Assert;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TemporaryFolder;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.UUID;
+
+public class TestHoodieAvroWriteSupport {
+
+  @Rule
+  public TemporaryFolder folder = new TemporaryFolder();
+
+  @Test
+  public void testAddKey() throws IOException {
+List rowKeys = new ArrayList<>();
+for (int i = 0; i < 1000; i++) {
+  rowKeys.add(UUID.randomUUID().toString());
+}
+String filePath = folder.getRoot() + "/test.parquet";
+Schema schema = HoodieAvroUtils.getRecordKeySchema();
+BloomFilter filter = BloomFilterFactory.createBloomFilter(
+  1000, 0.0001, 1,
+  BloomFilterTypeCode.SIMPLE.name());
+HoodieAvroWriteSupport writeSupport = new HoodieAvroWriteSupport(
+  new AvroSchemaConverter().convert(schema), schema, filter);
+ParquetWriter writer = new ParquetWriter(new Path(filePath), writeSupport, 
CompressionCodecName.GZIP,
+  120 * 1024 * 1024, ParquetWriter.DEFAULT_PAGE_SIZE);
+for (String rowKey : rowKeys) {
+  GenericRecord rec = new GenericData.Record(schema);
+  rec.put(HoodieRecord.RECORD_KEY_METADATA_FIELD, rowKey);
+  writer.write(rec);
+  writeSupport.add(rowKey);
+}
+writer.close();
+  }
 
 Review comment:
   HoodieAvroWriteSupport is responsible for adding all the recordKeys in the 
bloom filter. This can be tested by reading the parquet file to verify all the 
expected recordKeys are present in the bloom filter. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1315: [HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4

2020-02-18 Thread GitBox
nsivabalan commented on issue #1315: [HUDI-108] Removing 2GB spark partition 
limitations in HoodieBloomIndex with spark 2.4.4
URL: https://github.com/apache/incubator-hudi/pull/1315#issuecomment-587608958
 
 
   @vinothchandar / @leesf : PR is good to merge. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-589) Fix references to Views in some of the pages. Replace with Query instead

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-589:
---
Fix Version/s: 0.5.2

> Fix references to Views in some of the pages. Replace with Query instead
> 
>
> Key: HUDI-589
> URL: https://issues.apache.org/jira/browse/HUDI-589
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> querying_data.html still has some references to 'views'. This needs to be 
> replaced with 'queries'/'query types' appropriately.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-604) Update docker page

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-604:
---
Fix Version/s: 0.5.2

> Update docker page
> --
>
> Key: HUDI-604
> URL: https://issues.apache.org/jira/browse/HUDI-604
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> 1, Change one-line command to multi lines
> 2, Unify code indent



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-417) Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-417.
--

> Refactor HoodieWriteClient so that commit logic can be shareable by both 
> bootstrap and normal write operations
> --
>
> Key: HUDI-417
> URL: https://issues.apache.org/jira/browse/HUDI-417
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> Basic Code Changes are present in the fork : 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap]
>  
> The current implementation of HoodieBootstrapClient has duplicate code for 
> committing bootstrap. 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-client/src/main/java/org/apache/hudi/bootstrap/HoodieBootstrapClient.java]
>  
>  
> We can have an independent PR which would move these commit functionality 
> from HoodieWriteClient to a new base class AbstractHoodieWriteClient which 
> HoodieBootstrapClient can inherit.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-417) Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-417:
---
Fix Version/s: (was: 0.6.0)
   0.5.2

> Refactor HoodieWriteClient so that commit logic can be shareable by both 
> bootstrap and normal write operations
> --
>
> Key: HUDI-417
> URL: https://issues.apache.org/jira/browse/HUDI-417
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
>  
> Basic Code Changes are present in the fork : 
> [https://github.com/bvaradar/hudi/tree/vb_bootstrap]
>  
> The current implementation of HoodieBootstrapClient has duplicate code for 
> committing bootstrap. 
> [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-client/src/main/java/org/apache/hudi/bootstrap/HoodieBootstrapClient.java]
>  
>  
> We can have an independent PR which would move these commit functionality 
> from HoodieWriteClient to a new base class AbstractHoodieWriteClient which 
> HoodieBootstrapClient can inherit.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-502) Provide a custom time zone definition for TimestampBasedKeyGenerator

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-502:
---
Fix Version/s: (was: 0.6.0)
   0.5.2

> Provide a custom time zone definition for TimestampBasedKeyGenerator
> 
>
> Key: HUDI-502
> URL: https://issues.apache.org/jira/browse/HUDI-502
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: zhang peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TimestampBasedKeyGenerator defualt TimeZone is GMT, but it's not cover all 
> everthing. example, i push chinese time zone GMT+8:00 style timestamp data to 
> hudi, but keygenerator create GMT time zone data, it's wrong! so i submit 
> this pr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-502) Provide a custom time zone definition for TimestampBasedKeyGenerator

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi closed HUDI-502.
--

> Provide a custom time zone definition for TimestampBasedKeyGenerator
> 
>
> Key: HUDI-502
> URL: https://issues.apache.org/jira/browse/HUDI-502
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: zhang peng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TimestampBasedKeyGenerator defualt TimeZone is GMT, but it's not cover all 
> everthing. example, i push chinese time zone GMT+8:00 style timestamp data to 
> hudi, but keygenerator create GMT time zone data, it's wrong! so i submit 
> this pr.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-580) Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh

2020-02-18 Thread Suneel Marthi (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039161#comment-17039161
 ] 

Suneel Marthi commented on HUDI-580:


The license header looks OK to me - I would check with Justin once again and 
mark this as Resolved.

> Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh
> ---
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Major
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] wangxianghu opened a new pull request #1340: [MINOR] Add javadoc to SchedulerConfGenerator and Code clean

2020-02-18 Thread GitBox
wangxianghu opened a new pull request #1340: [MINOR] Add javadoc to 
SchedulerConfGenerator and Code clean
URL: https://github.com/apache/incubator-hudi/pull/1340
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Add javadoc to SchedulerConfGenerator and Code clean*
   
   ## Brief change log
   
   *Add javadoc to SchedulerConfGenerator and Code clean*
 
   ## Verify this pull request
   
   This pull request is already covered by TestSchedulerConfGenerator*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-615) Add Test cases for StringUtils.

2020-02-18 Thread Suneel Marthi (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated HUDI-615:
---
Status: Closed  (was: Patch Available)

> Add Test cases for StringUtils.
> ---
>
> Key: HUDI-615
> URL: https://issues.apache.org/jira/browse/HUDI-615
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Major
>  Labels: pull-request-available, test
> Fix For: 0.5.2
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Presently no Tests exist for org.apache.hudi.common.util.StringUtils - add 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1339: [HUDI-617] Add support for types implementing CharSequence

2020-02-18 Thread GitBox
pratyakshsharma commented on issue #1339: [HUDI-617] Add support for types 
implementing CharSequence
URL: https://github.com/apache/incubator-hudi/pull/1339#issuecomment-587413861
 
 
   LGTM


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-617) Add support for data ty0es convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Labels: easyfix pull-requests-available  (was: easyfix 
pull-request-available)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-requests-available
> Attachments: test_data.json, test_schema.avsc
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-617:

Labels: easyfix pull-request-available  (was: easyfix)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-request-available
> Attachments: test_data.json, test_schema.avsc
>
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] amitsingh-10 opened a new pull request #1339: [HUDI-617] Add support for types implementing CharSequence

2020-02-18 Thread GitBox
amitsingh-10 opened a new pull request #1339: [HUDI-617] Add support for types 
implementing CharSequence
URL: https://github.com/apache/incubator-hudi/pull/1339
 
 
   ## What is the purpose of the pull request
   Data types extending CharSequence implement a #toString method which
   provides an easy way to convert them to String. For example,
   org.apache.avro.util.Utf8 is easily convertible into String if we use
   the toString() method. It's better to make the support more generic to
   support a wider range of data types as partitionKey.
   
   ## Brief change log
 - *Modified TimestampBasedKeyGenerator to support data types convertible 
to String*
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as 
*TestTimestampBasedKeyGenerator.java*.
   
   ## Committer checklist
   
   - [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add support for ingesting multiple kafka streams in a single DeltaStreamer deployment

2020-02-18 Thread GitBox
pratyakshsharma commented on a change in pull request #1150: [HUDI-288]: Add 
support for ingesting multiple kafka streams in a single DeltaStreamer 
deployment
URL: https://github.com/apache/incubator-hudi/pull/1150#discussion_r380555306
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/TableConfig.java
 ##
 @@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import com.fasterxml.jackson.annotation.JsonIgnoreProperties;
+import com.fasterxml.jackson.annotation.JsonProperty;
+
+import java.util.Objects;
+
+/*
+Represents object with all the topic level overrides for multi table delta 
streamer execution
+ */
+@JsonIgnoreProperties(ignoreUnknown = true)
 
 Review comment:
   @bvaradar If we assume TableConfig is coming from DFSProperties (I am doing 
that change anyways :) ), then what extra benefit are we getting by decoupling 
source and target configs?  Because implementation wise I will be merging 
source and target configs into single TypedProperties instance after reading 
them separately since everywhere we are passing only single TypedProperties 
instance for reading relevant configs. 
   
   If TableConfig is read as DFSProperties instance (key-value pair), then 
non-kafka sources are automatically handled. I understand maintaining source 
and target configs separately is cleaner, but then it will be an extra overhead 
for users to maintain 2 separate files. 
   
   Please help me understand. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038864#comment-17038864
 ] 

Amit Singh commented on HUDI-617:
-

I have attached the sample data and schema to replicate the issue.

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix
> Attachments: test_data.json, test_schema.avsc
>
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Attachment: test_schema.avsc
test_data.json

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix
> Attachments: test_data.json, test_schema.avsc
>
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Fix Version/s: (was: 0.5.1)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Fix Version/s: 0.5.1

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix
> Fix For: 0.5.1
>
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)
Amit Singh created HUDI-617:
---

 Summary: Add support for data types convertible to String in 
TimestampBasedKeyGenerator
 Key: HUDI-617
 URL: https://issues.apache.org/jira/browse/HUDI-617
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: Amit Singh


Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
partition key. They are  Double, Long, Float and String. However, if the 
`avro.java.string` is not specified in the schema provided, Hudi throws the 
error:
org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
partition field: org.apache.avro.util.Utf8
at 
org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
...
It will be better if the support was more generalised to include the data types 
that provide method to convert them to String such as `Utf8` since all these 
methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-18 Thread Amit Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Singh updated HUDI-617:

Description: 
Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
partition key. They are  Double, Long, Float and String. However, if the 
`avro.java.string` is not specified in the schema provided, Hudi throws the 
following error:


 org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
partition field: org.apache.avro.util.Utf8
 at 
org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
 at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)



 It will be better if the support was more generalised to include the data 
types that provide method to convert them to String such as `Utf8` since all 
these methods implement the `CharSequence` interface.

  was:
Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
partition key. They are  Double, Long, Float and String. However, if the 
`avro.java.string` is not specified in the schema provided, Hudi throws the 
error:
org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
partition field: org.apache.avro.util.Utf8
at 
org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
...
It will be better if the support was more generalised to include the data types 
that provide method to convert them to String such as `Utf8` since all these 
methods implement the `CharSequence` interface.


> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)