date:20200224

[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-24 Thread GitBox

yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-590725355
 
 
   @n3nash Can you verify all the test cases in your local. I have verified 
them. Everything is OK. The Travis can not finish the full verification.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383695243
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/table/CleanHelper.java
 ##
 @@ -59,17 +59,17 @@
  * 
  * TODO: Should all cleaning be done based on {@link HoodieCommitMetadata}
  */
-public class CleanExecutor> implements 
Serializable {
+public class CleanHelper> implements 
Serializable {
 
 Review comment:
   My personal feeling, `XXXHelper` usually means a utility class that contains 
some static helper methods. I would prefer the old one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] satishkotha commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-24 Thread GitBox

satishkotha commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r383690354
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
 ##
 @@ -145,13 +148,16 @@ private String 
printCommitsWithMetadata(HoodieDefaultTimeline timeline,
 .addTableHeaderField("Total Rollback 
Blocks").addTableHeaderField("Total Log Records")
 .addTableHeaderField("Total Updated Records 
Compacted").addTableHeaderField("Total Write Bytes");
 
-return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
+return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending,
+limit, headerOnly, rows, tempTableName);
   }
 
   @CliCommand(value = "commits show", help = "Show the commits")
   public String showCommits(
   @CliOption(key = {"includeExtraMetadata"}, help = "Include extra 
metadata",
   unspecifiedDefaultValue = "false") final boolean 
includeExtraMetadata,
+  @CliOption(key = {"exportToTableName"}, mandatory = false, help = "hive 
table name to export",
 
 Review comment:
   Changed names to view instead of table. Initial idea was to create actual 
tables with all the metadata and register them. so thats where name came from. 
But, i think exporting to external table requires lot more work with the way 
CLI is setup. Just doing views in CLI for now. I can work on writing a tool 
outside CLI to export metadata to another table.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-633) archival fails with large clean files

2020-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-633:

Labels: pull-request-available  (was: )

> archival fails with large clean files
> -
>
> Key: HUDI-633
> URL: https://issues.apache.org/jira/browse/HUDI-633
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: satish
>Assignee: satish
>Priority: Major
>  Labels: pull-request-available
>
> Caused by: java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:3236)
> at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
> at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
> at 
> org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
> at 
> org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
> at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:55)
> at org.apache.avro.io.Encoder.writeString(Encoder.java:121)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:180)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
> at 
> org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
> 10:01
> at 
> org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)
> at 
> com.uber.hoodie.common.table.log.block.HoodieAvroDataBlock.getContentBytes(HoodieAvroDataBlock.java:124)
> at 
> com.uber.hoodie.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:126)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.writeToFile(HoodieCommitArchiveLog.java:267)
> at 
> com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:249)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] satishkotha opened a new pull request #1355: [HUDI-633] limit archive file block size by number of bytes

2020-02-24 Thread GitBox

satishkotha opened a new pull request #1355: [HUDI-633] limit archive file 
block size by number of bytes
URL: https://github.com/apache/incubator-hudi/pull/1355
 
 
   ## What is the purpose of the pull request
   
   With large clean files, archival process results in OOM. See HUDI-633. Limit 
archive file block size by number of bytes
   
   ## Brief change log
   
   - Add option to limit  archival batch size by number of bytes in block in 
addition to maximum number of records allowed in a batch
   - This does not prevent OOM if a single record is larger than jvm heap size. 
 
   - Note that in worst case, a single instant details can take up entire 
block. This likely has higher metadata overhead. But I think marginal increase 
in storage is acceptable for metadata. 
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   - Run TestHoodieCommitArchiveLog#testArchiveTableWithLargeCleanFiles
   - Verified large clean file that caused OOM can be archived with specified 
config.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo (#1352)

2020-02-24 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 83c8ad5  [HUDI-625] Fixing performance issues around DiskBasedMap & 
kryo (#1352)
83c8ad5 is described below

commit 83c8ad5a38e1b561d501d0dcbcbd39fb02638054
Author: lamber-ken 
AuthorDate: Tue Feb 25 14:40:37 2020 +0800

[HUDI-625] Fixing performance issues around DiskBasedMap & kryo (#1352)
---
 .../hudi/common/util/SerializationUtils.java   | 45 ++
 1 file changed, 3 insertions(+), 42 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java 
b/hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
index 9096080..9d075bb 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
@@ -21,18 +21,13 @@ package org.apache.hudi.common.util;
 import org.apache.hudi.exception.HoodieSerializationException;
 
 import com.esotericsoftware.kryo.Kryo;
-import com.esotericsoftware.kryo.Serializer;
 import com.esotericsoftware.kryo.io.Input;
 import com.esotericsoftware.kryo.io.Output;
-import com.esotericsoftware.kryo.serializers.FieldSerializer;
-import com.esotericsoftware.reflectasm.ConstructorAccess;
-import org.objenesis.instantiator.ObjectInstantiator;
+import org.objenesis.strategy.StdInstantiatorStrategy;
 
 import java.io.ByteArrayOutputStream;
 import java.io.IOException;
 import java.io.Serializable;
-import java.lang.reflect.Constructor;
-import java.lang.reflect.InvocationTargetException;
 
 /**
  * {@link SerializationUtils} class internally uses {@link Kryo} serializer 
for serializing / deserializing objects.
@@ -121,50 +116,16 @@ public class SerializationUtils {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
-};
-  }
-}
   }
 }

[GitHub] [incubator-hudi] vinothchandar merged pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar merged pull request #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383682876
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
 
 Review comment:
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken edited a comment on issue #1352: [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590702201
 
 
   > > > The (400 enties) upsert time improve from ~hours to ~5min.
   > 
   > @lamber-ken this is for each upsert or the entire sequence of 1 insert + 2 
upserts? I ask because, I got ~2 minutes on my mac with the custom serializers
   
   the last upsert operation, range 2min~5min. I placed the max cost time here.
   
   step1 insert fast
   step2 upsert 30 (1 ~ 3min)
   step3 upsert 40 (2 ~ 5min)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590706623
 
 
   > classes extending BaseAvroPayload may be okay. but we allow users to 
implement their own payload classes.. if they don't have data converted as 
byte[], do we need to register them etc?
   
   It's right way to register the class to kryo, but if the custom payload 
doesn't have a no-arg constructor, it will not work.
   
   And, we already set the instantiator strategy, kryo built-in serializers 
have high performance.
   ```
   kryo.setInstantiatorStrategy(new DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1347: [HUDI-627] Aggregate 
code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383681014
 
 

 ##
 File path: pom.xml
 ##
 @@ -51,6 +51,7 @@
 packaging/hudi-timeline-server-bundle
 docker/hoodie/hadoop
 hudi-integ-test
+hudi-test-coverage-aggregator
 
 Review comment:
   is there a way to do this without adding this module?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383679017
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
 
 Review comment:
   Very right.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1353: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1353: [HUDI-632] Update 
documentation (docker_demo) to mention both commit and deltacommit files
URL: https://github.com/apache/incubator-hudi/pull/1353#discussion_r383679029
 
 

 ##
 File path: content/docs/docker_demo.html
 ##
 @@ -543,7 +543,7 @@ Step 2: Incrementally
 You can use HDFS web-browser to look at the tables
 http://namenode:50070/explorer.html#/user/hive/warehouse/stock_ticks_cow.
 
-You can explore the new partition folder created in the table along with a 
“deltacommit”
+You can explore the new partition folder created in the table along with a 
“commit”/“deltacommit”
 
 Review comment:
   @vikrantgoel can you please change the .md file? html is generated off that 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383677795
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
 
 Review comment:
   @lamber-ken to gary's point , we also need to consider this case
   
   >What about the user-defined custom payload? Do we need to register it 
somewhere?
   
   classes extending BaseAvroPayload may be okay. but we allow users to 
implement their own payload classes.. if they don't have data converted as 
`byte[]`, do we need to register them etc? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383678384
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
 
 Review comment:
   I guess, this achieves the same purpose as the code in `newInstantiator` 
that you deleted below? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken edited a comment on issue #1352: [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590702201
 
 
   > > > The (400 enties) upsert time improve from ~hours to ~5min.
   > 
   > @lamber-ken this is for each upsert or the entire sequence of 1 insert + 2 
upserts? I ask because, I got ~2 minutes on my mac with the custom serializers
   
   the last upsert operation, range 2min~5min. I placed the max cost time here.
   
   step1 insert fast
   step2 upsert 3 (1 ~ 3min)
   step3 upsert 4 (2 ~ 5min)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590702201
 
 
   > > > The (400 enties) upsert time improve from ~hours to ~5min.
   > 
   > @lamber-ken this is for each upsert or the entire sequence of 1 insert + 2 
upserts? I ask because, I got ~2 minutes on my mac with the custom serializers
   
   just 1 upsert operation, range 2min~5min. I placed the max cost time here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar commented on issue #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590700972
 
 
   >>The (400 enties) upsert time improve from ~hours to ~5min.
   
   @lamber-ken this is for each upsert or the entire sequence of 1 insert + 2 
upserts? I ask because, I got ~2 minutes on my mac with the custom serializers 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-581:

Labels: pull-request-available  (was: )

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] smarthi opened a new pull request #1354: [WIP[HUDI-581] NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-24 Thread GitBox

smarthi opened a new pull request #1354: [WIP[HUDI-581] NOTICE need more work 
as it missing content form included 3rd party ALv2 licensed NOTICE files
URL: https://github.com/apache/incubator-hudi/pull/1354
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Need to fix NOTICE file, else Justin's gonna smack ya.
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-634) Document breaking changes for 0.5.2 release

2020-02-24 Thread Vinoth Chandar (Jira)

Vinoth Chandar created HUDI-634:
---

 Summary: Document breaking changes for 0.5.2 release
 Key: HUDI-634
 URL: https://issues.apache.org/jira/browse/HUDI-634
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Release  Administrative
Reporter: Vinoth Chandar
 Fix For: 0.5.2


* Write Client restructuring has moved classes around (HUDI-554) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup 
package structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383637095
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java
 ##
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.client;
 
+import org.apache.hudi.common.HoodieClientTestHarness;
+import org.apache.hudi.WriteStatus;
 
 Review comment:
   >It is not been used anywhere. 
   
   RDD level APIs have return types like this and projects like uber's marmaray 
, hudi's deltastramer do use them. Since, you feel strongly, let me move it. :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383633637
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java
 ##
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.client;
 
+import org.apache.hudi.common.HoodieClientTestHarness;
+import org.apache.hudi.WriteStatus;
 
 Review comment:
   Actually, I agree with the context like `SparkContext` can be placed under 
the top-level package.  However, IMO, the `WriteStatus` is not as important as 
`SparkContext`. It is not been used anywhere. Hudi support `write` and `read` 
operation. We should make them a pair relationship. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-580) Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh

2020-02-24 Thread vinoyang (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044067#comment-17044067
 ] 

vinoyang commented on HUDI-580:
---

[~lamber-ken] I have marked this issue as a blocker.  We need to fix it ASAP.

> Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh
> ---
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: compliance
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-580) Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh

2020-02-24 Thread vinoyang (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-580:
--
Priority: Blocker  (was: Major)

> Incorrect license header in docker/hoodie/hadoop/base/entrypoint.sh
> ---
>
> Key: HUDI-580
> URL: https://issues.apache.org/jira/browse/HUDI-580
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie
>Reporter: leesf
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: compliance
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590634434
 
 
   > response -> the StdInstantiatorStrategy will allow kryo to fall back to 
Java Serde, is that what we want ?
   
   Hi @n3nash, By default, an instantiator is returned that uses reflection if 
the class has a zero argument constructor, an exception is thrown. If a 
`setInstantiatorStrategy(InstantiatorStrategy)` is set, it will be used instead 
of throwing an exception.
   
   More, the key point is the previous KryoBase in 
`org.apache.hudi.common.util.SerializationUtils` was used in a wrong way.
   
   The 
"super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();", this 
will cause call it each time
   
   
![image](https://user-images.githubusercontent.com/20113411/75208580-1ae3f480-57b7-11ea-8f1e-3439f369d92b.png)
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-633) archival fails with large clean files

2020-02-24 Thread satish (Jira)

satish created HUDI-633:
---

 Summary: archival fails with large clean files
 Key: HUDI-633
 URL: https://issues.apache.org/jira/browse/HUDI-633
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: satish
Assignee: satish


Caused by: java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Arrays.java:3236)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at 
org.apache.avro.io.BufferedBinaryEncoder$OutputStreamSink.innerWrite(BufferedBinaryEncoder.java:216)
at 
org.apache.avro.io.BufferedBinaryEncoder.flushBuffer(BufferedBinaryEncoder.java:93)
at 
org.apache.avro.io.BufferedBinaryEncoder.ensureBounds(BufferedBinaryEncoder.java:108)
at 
org.apache.avro.io.BufferedBinaryEncoder.writeFixed(BufferedBinaryEncoder.java:153)
at org.apache.avro.io.BinaryEncoder.writeString(BinaryEncoder.java:55)
at org.apache.avro.io.Encoder.writeString(Encoder.java:121)
at 
org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:213)
at 
org.apache.avro.generic.GenericDatumWriter.writeString(GenericDatumWriter.java:208)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:76)
at 
org.apache.avro.generic.GenericDatumWriter.writeArray(GenericDatumWriter.java:138)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:68)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
at 
org.apache.avro.generic.GenericDatumWriter.writeMap(GenericDatumWriter.java:180)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:69)
at 
org.apache.avro.generic.GenericDatumWriter.writeField(GenericDatumWriter.java:114)
at 
org.apache.avro.generic.GenericDatumWriter.writeRecord(GenericDatumWriter.java:104)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:66)
10:01
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:58)

at 
com.uber.hoodie.common.table.log.block.HoodieAvroDataBlock.getContentBytes(HoodieAvroDataBlock.java:124)
at 
com.uber.hoodie.common.table.log.HoodieLogFormatWriter.appendBlock(HoodieLogFormatWriter.java:126)
at 
com.uber.hoodie.io.HoodieCommitArchiveLog.writeToFile(HoodieCommitArchiveLog.java:267)
at 
com.uber.hoodie.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:249)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590634434
 
 
   > response -> the StdInstantiatorStrategy will allow kryo to fall back to 
Java Serde, is that what we want ?
   
   Hi @n3nash, By default, an instantiator is returned that uses reflection if 
the class has a zero argument constructor, an exception is thrown. If a 
`setInstantiatorStrategy(InstantiatorStrategy)` is set, it will be used instead 
of throwing an exception.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383610437
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
 
 Review comment:
   Get your point, these serializers are not necessary now. 
   1. Custom serializers are need when there are special requirements for 
serialization and deserialization. 
   2. We use `InstantiatorStrategy`, which is also efficient. kryo-creation [1] 
   ```
   kryo.setInstantiatorStrategy(new DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   ```
   3. If we do that currently, it will affect many places.
   
   [1] https://github.com/EsotericSoftware/kryo#object-creation


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383605610
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
 
 Review comment:
   What I curious about was what you mentioned in the ticket:
   ```
   kryo.register(HoodieKey.class, new HoodieKeySerializer());
   kryo.register(GenericData.Record.class, new GenericDataRecordSerializer());
   kryo.register(HoodieRecord.class, new HoodieRecordSerializer());
   kryo.register(HoodieRecordLocationSerializer.class, new 
HoodieRecordLocationSerializer());
   kryo.register(OverwriteWithLatestAvroPayload.class, new 
OverwriteWithLatestPayloadSerializer());
   ```
   Where this is done? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590626006
 
 
   > WOW! @lamber-ken Thanks for investigating this! Your finding make a lot of 
sense to me. This change definitely gonna boost the upsert performance a lot.
   
   Thanks, we are all having fun here : )  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383601871
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
 
 Review comment:
   Hi @garyli1019, hudi put `HoodieRecord` to kryo, the payload data inside has 
been coverted to byte[] data, see `BaseAvroPayload#BaseAvroPayload`. So It 
doesn't affect anything if we register payload class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1353: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread GitBox

lamber-ken commented on issue #1353: [HUDI-632] Update documentation 
(docker_demo) to mention both commit and deltacommit files
URL: https://github.com/apache/incubator-hudi/pull/1353#issuecomment-590622878
 
 
   Hi @vikrantgoel, the *.html files are generated by jekyll. The right way is 
modify *.md files.
   
https://github.com/apache/incubator-hudi/blob/asf-site/docs/_docs/0_4_docker_demo.md


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590618074
 
 
   Thanks @vinothchandar, had opend a new pr 
https://github.com/apache/incubator-hudi/pull/1352  : )


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-24 Thread GitBox

prashantwason commented on a change in pull request #1341: [HUDI-626] Add 
exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#discussion_r383588912
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java
 ##
 @@ -145,13 +148,16 @@ private String 
printCommitsWithMetadata(HoodieDefaultTimeline timeline,
 .addTableHeaderField("Total Rollback 
Blocks").addTableHeaderField("Total Log Records")
 .addTableHeaderField("Total Updated Records 
Compacted").addTableHeaderField("Total Write Bytes");
 
-return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
+return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending,
+limit, headerOnly, rows, tempTableName);
   }
 
   @CliCommand(value = "commits show", help = "Show the commits")
   public String showCommits(
   @CliOption(key = {"includeExtraMetadata"}, help = "Include extra 
metadata",
   unspecifiedDefaultValue = "false") final boolean 
includeExtraMetadata,
+  @CliOption(key = {"exportToTableName"}, mandatory = false, help = "hive 
table name to export",
 
 Review comment:
   This is not a real hive table. So maybe reword this for clarity. Also, 
export means something persistent while this is a temporary view. 
   
   Name of in-memory view to cache results.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383579593
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
 
 Review comment:
   The default is `FieldSerializer` and the description of 
setIgnoreSyntheticFields is `Controls if synthetic fields are serialized. 
Default is true.` So I think this Override might be unnecessary. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

garyli1019 commented on a change in pull request #1352: [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#discussion_r383579400
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/SerializationUtils.java
 ##
 @@ -121,50 +116,16 @@ Object deserialize(byte[] objectData) {
 
 public Kryo newKryo() {
 
-  Kryo kryo = new KryoBase();
+  Kryo kryo = new Kryo();
   // ensure that kryo doesn't fail if classes are not registered with kryo.
   kryo.setRegistrationRequired(false);
   // This would be used for object initialization if nothing else works 
out.
-  kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
+  kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   // Handle cases where we may have an odd classloader setup like with 
libjars
   // for hadoop
   kryo.setClassLoader(Thread.currentThread().getContextClassLoader());
   return kryo;
 }
 
-private static class KryoBase extends Kryo {
-  @Override
-  protected Serializer newDefaultSerializer(Class type) {
-final Serializer serializer = super.newDefaultSerializer(type);
-if (serializer instanceof FieldSerializer) {
-  final FieldSerializer fieldSerializer = (FieldSerializer) serializer;
-  fieldSerializer.setIgnoreSyntheticFields(true);
-}
-return serializer;
-  }
-
-  @Override
-  protected ObjectInstantiator newInstantiator(Class type) {
-return () -> {
-  // First try reflectasm - it is fastest way to instantiate an object.
-  try {
-final ConstructorAccess access = ConstructorAccess.get(type);
-return access.newInstance();
-  } catch (Throwable t) {
-// ignore this exception. We may want to try other way.
-  }
-  // fall back to java based instantiation.
-  try {
-final Constructor constructor = type.getConstructor();
-constructor.setAccessible(true);
-return constructor.newInstance();
-  } catch (NoSuchMethodException | IllegalAccessException | 
InstantiationException
-  | InvocationTargetException e) {
-// ignore this exception. we will fall back to default 
instantiation strategy.
-  }
-  return 
super.getInstantiatorStrategy().newInstantiatorOf(type).newInstance();
 
 Review comment:
   I agree that the `newInstance()` seems like creating a new serializer 
instance every time if the above conditions are not triggered. And also I think 
Kyro should be faster than the java default.
   What about the user-defined custom payload? Do we need to register it 
somewhere? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-632:

Labels: pull-request-available  (was: )

> Update documentation (docker_demo) to mention both commit and deltacommit 
> files 
> 
>
> Key: HUDI-632
> URL: https://issues.apache.org/jira/browse/HUDI-632
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Vikrant Goel
>Priority: Minor
>  Labels: pull-request-available
>
> In the demo, we could have commit or deltacommit files created depending on 
> the type of table. Updating it will help avoid potential confusion.
> [https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vikrantgoel opened a new pull request #1353: [HUDI-632] Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread GitBox

vikrantgoel opened a new pull request #1353: [HUDI-632] Update documentation 
(docker_demo) to mention both commit and deltacommit files
URL: https://github.com/apache/incubator-hudi/pull/1353
 
 
   ## What is the purpose of the pull request
   
   JIRA: https://issues.apache.org/jira/browse/HUDI-632
   
   ## Brief change log
 - *Mentioned both commit and deltacommit files in docker_demo*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [x] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Updated] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread Vikrant Goel (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikrant Goel updated HUDI-632:
--
Status: In Progress  (was: Open)

> Update documentation (docker_demo) to mention both commit and deltacommit 
> files 
> 
>
> Key: HUDI-632
> URL: https://issues.apache.org/jira/browse/HUDI-632
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Vikrant Goel
>Priority: Minor
>
> In the demo, we could have commit or deltacommit files created depending on 
> the type of table. Updating it will help avoid potential confusion.
> [https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread Vikrant Goel (Jira)

Vikrant Goel created HUDI-632:
-

 Summary: Update documentation (docker_demo) to mention both commit 
and deltacommit files 
 Key: HUDI-632
 URL: https://issues.apache.org/jira/browse/HUDI-632
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Docs
Reporter: Vikrant Goel


In the demo, we could have commit or deltacommit files created depending on the 
type of table. Updating it will help avoid potential confusion.

[https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-632) Update documentation (docker_demo) to mention both commit and deltacommit files

2020-02-24 Thread Vikrant Goel (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vikrant Goel updated HUDI-632:
--
Status: Open  (was: New)

> Update documentation (docker_demo) to mention both commit and deltacommit 
> files 
> 
>
> Key: HUDI-632
> URL: https://issues.apache.org/jira/browse/HUDI-632
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Vikrant Goel
>Priority: Minor
>
> In the demo, we could have commit or deltacommit files created depending on 
> the type of table. Updating it will help avoid potential confusion.
> [https://hudi.incubator.apache.org/docs/docker_demo.html#step-2-incrementally-ingest-data-from-kafka-topic]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

codecov-io edited a comment on issue #1347: [HUDI-627] Aggregate code coverage 
and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#issuecomment-590583224
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=h1) 
Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@078d482`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `73.33%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1347/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1347   +/-   ##
   =
 Coverage  ?   66.99%   
 Complexity?   97   
   =
 Files ?  333   
 Lines ?16234   
 Branches  ? 1658   
   =
 Hits  ?10876   
 Misses? 4621   
 Partials  ?  737
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1347/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=)
 | `93.18% <100%> (ø)` | `0 <0> (?)` | |
   | 
[...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1347/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==)
 | `72.58% <66.66%> (ø)` | `0 <0> (?)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=footer).
 Last update 
[078d482...f3c3492](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] codecov-io commented on issue #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

codecov-io commented on issue #1347: [HUDI-627] Aggregate code coverage and 
publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#issuecomment-590583224
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=h1) 
Report
   > :exclamation: No coverage uploaded for pull request base 
(`master@078d482`). [Click here to learn what that 
means](https://docs.codecov.io/docs/error-reference#section-missing-base-commit).
   > The diff coverage is `0%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1347/graphs/tree.svg?width=650=VTTXabwbs2=150=pr)](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=tree)
   
   ```diff
   @@   Coverage Diff@@
   ## master   #1347   +/-   ##
   
 Coverage  ?   0.56%   
 Complexity?   2   
   
 Files ? 333   
 Lines ?   16234   
 Branches  ?1658   
   
 Hits  ?  92   
 Misses?   16139   
 Partials  ?   3
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...main/scala/org/apache/hudi/DataSourceOptions.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1347/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvRGF0YVNvdXJjZU9wdGlvbnMuc2NhbGE=)
 | `67.04% <0%> (ø)` | `0 <0> (?)` | |
   | 
[...in/scala/org/apache/hudi/IncrementalRelation.scala](https://codecov.io/gh/apache/incubator-hudi/pull/1347/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSW5jcmVtZW50YWxSZWxhdGlvbi5zY2FsYQ==)
 | `0% <0%> (ø)` | `0 <0> (?)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=footer).
 Last update 
[078d482...f3c3492](https://codecov.io/gh/apache/incubator-hudi/pull/1347?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-614) .hoodie_partition_metadata created for non-partitioned table

2020-02-24 Thread Andrew Wong (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043918#comment-17043918
 ] 

Andrew Wong commented on HUDI-614:
--

Update: This problem does not occur in the docker environment. In the docker 
demo env, I was able to create a non-partitioned table in Spark (saved to 
hdfs), use run_sync_tool.sh to sync it to hive, and then query it successfully 
from presto. (It still made the .hoodie_partition_metadata file though).

> .hoodie_partition_metadata created for non-partitioned table
> 
>
> Key: HUDI-614
> URL: https://issues.apache.org/jira/browse/HUDI-614
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Affects Versions: 0.5.0, 0.5.1
>Reporter: Andrew Wong
>Priority: Major
>
> Original issue: [https://github.com/apache/incubator-hudi/issues/1329]
> I made a non-partitioned Hudi table using Spark. I was able to query it with 
> Spark & Hive, but when I tried querying it with Presto, I received the error 
> {{Could not find partitionDepth in partition metafile}}.
> I attempted this task using emr-5.28.0 in AWS. I tried using the built-in 
> spark-shell with both Amazon's /usr/lib/hudi/hudi-spark-bundle.jar (following 
> [https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/)]
>  and the org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating jar 
> (following [https://hudi.apache.org/docs/quick-start-guide.html]).
> I used NonpartitionedKeyGenerator & NonPartitionedExtractor in my write 
> options, according to 
> [https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIuseDeltaStreamerorSparkDataSourceAPItowritetoaNon-partitionedHudidataset?].
>  You can see my code in the github issue linked above.
> In both cases I see the .hoodie_partition_metadata file was created in the 
> table path in S3. Querying the table worked in spark-shell & hive-cli, but 
> attempting to query the table in presto-cli resulted in the error, "Could not 
> find partitionDepth in partition metafile".
> Please look into the bug or check the documentation. If there is a problem 
> with the EMR install I can contact the AWS team responsible.
> cc: [~bhasudha]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] popart commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-24 Thread GitBox

popart commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned 
table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-590578619
 
 
   Update: This problem does not occur in the docker environment. In the docker 
demo env, I was able to create a non-partitioned table in Spark (saved to 
hdfs), use run_sync_tool.sh to sync it to hive, and then query it successfully 
from presto. (It still made the .hoodie_partition_metadata file though).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] adamjoneill commented on issue #1325: presto - querying nested object in parquet file created by hudi

2020-02-24 Thread GitBox

adamjoneill commented on issue #1325: presto - querying nested object in 
parquet file created by hudi
URL: https://github.com/apache/incubator-hudi/issues/1325#issuecomment-590553111
 
 
   @bhasudha thanks for the links. I think i'd need to learn how to debug the 
hudi application to get a better understanding of what's happening. Is this 
something you could point me in the direction of?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

ramachandranms commented on a change in pull request #1347: [HUDI-627] 
Aggregate code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383506197
 
 

 ##
 File path: scripts/upload_code_coverage.sh
 ##
 @@ -0,0 +1,28 @@
+#!/bin/bash
+
+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+mode=$1
+
+if [ "$mode" = "unit" ];
 
 Review comment:
   made changes as u have suggested.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

ramachandranms commented on a change in pull request #1347: [HUDI-627] 
Aggregate code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383500368
 
 

 ##
 File path: pom.xml
 ##
 @@ -278,8 +276,6 @@
 report
   
   
-
-
${project.build.directory}/coverage-reports/jacoco-ut.exec
 
 Review comment:
   1. This seems to be specific to how sonarqube works and we don't use 
sonar/sonarqube for aggregation
   2. [This](https://github.com/jacoco/jacoco/wiki/MavenMultiModule#usage) is 
the recommendation from official jacoco documentation, which I have followed in 
this PR.
   
   I did try out the sonar documentation on a test project locally and couldn't 
make it to work.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1151: [HUDI-476] Add hudi-examples module

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1151: [HUDI-476] Add 
hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r383495270
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/deltastreamer/HoodieDeltaStreamerDfsSourceExample.java
 ##
 @@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.examples.common.HoodieExampleDataGenerator;
+import org.apache.hudi.examples.common.HoodieExampleSparkUtils;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
+import org.apache.hudi.utilities.sources.JsonDFSSource;
+import org.apache.hudi.utilities.transform.IdentityTransformer;
+
+import com.beust.jcommander.JCommander;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+
+/**
+ * Simple examples of #{@link HoodieDeltaStreamer} from #{@link JsonDFSSource}.
+ *
+ * To run this example, you should
+ *   1. prepare sample data as 
`hudi-examples/src/main/resources/dfs-delta-streamer`
+ *   2. For running in IDE, set VM options `-Dspark.master=local[2]`
+ *   3. For running in shell, using `spark-submit`
+ *
+ * Usage: HoodieDeltaStreamerDfsSourceExample \
+ *--target-base-path /tmp/hoodie/dfsdeltatable \
+ *--table-type MERGE_ON_READ \
+ *--target-table dfsdeltatable
+ *
+ */
+public class HoodieDeltaStreamerDfsSourceExample {
+
+  public static void main(String[] args) throws Exception {
+
+final HoodieDeltaStreamer.Config cfg = defaultDfsStreamerConfig();
 
 Review comment:
   I ll actually let you pick :)..  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1151: [HUDI-476] Add hudi-examples module

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1151: [HUDI-476] Add 
hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r383479608
 
 

 ##
 File path: hudi-examples/pom.xml
 ##
 @@ -0,0 +1,206 @@
+
+
+http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
+
+  
+hudi
+org.apache.hudi
+0.5.2-SNAPSHOT
+  
+  4.0.0
+
+  hudi-examples
+  jar
+
+  
+${project.parent.basedir}
+  
+
+  
+
+  
+src/main/resources
+  
+
+
+
+  
+org.apache.maven.plugins
 
 Review comment:
   This ties back to how we let the users run the examples. Another way is to 
not have a fat jar here, but just have a `run_hudi_example.sh` script just use 
the spark-bundle/utilities-bundle after ksql is build.. 
   
   This way, we don't have to also maintain this bundle separately.. Users will 
be using the bundles under `packaging` in production anyway. So just reuse them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1151: [HUDI-476] Add hudi-examples module

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1151: [HUDI-476] Add 
hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r383477929
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/deltastreamer/HoodieDeltaStreamerDfsSourceExample.java
 ##
 @@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.examples.common.HoodieExampleDataGenerator;
+import org.apache.hudi.examples.common.HoodieExampleSparkUtils;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
+import org.apache.hudi.utilities.sources.JsonDFSSource;
+import org.apache.hudi.utilities.transform.IdentityTransformer;
+
+import com.beust.jcommander.JCommander;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+
+/**
+ * Simple examples of #{@link HoodieDeltaStreamer} from #{@link JsonDFSSource}.
+ *
+ * To run this example, you should
+ *   1. prepare sample data as 
`hudi-examples/src/main/resources/dfs-delta-streamer`
+ *   2. For running in IDE, set VM options `-Dspark.master=local[2]`
+ *   3. For running in shell, using `spark-submit`
+ *
+ * Usage: HoodieDeltaStreamerDfsSourceExample \
 
 Review comment:
   a shell script to run any Example class is a better idea.. I was just 
referring to having a command that users can just copy paste to a terminal and 
run.. 
   
   >>data prep part of the examples themselves and then also provide sane 
defaults for input/output paths
   
   Thoughts on this? 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] selvarajperiyasamy commented on issue #143: Tracking ticket for folks to be added to slack group

2020-02-24 Thread GitBox

selvarajperiyasamy commented on issue #143: Tracking ticket for folks to be 
added to slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-590527180
 
 
   Please add me selvaraj.periyasamy1...@gmail.com
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate 
code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383480429
 
 

 ##
 File path: scripts/upload_code_coverage.sh
 ##
 @@ -0,0 +1,28 @@
+#!/bin/bash
+
+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+mode=$1
+
+if [ "$mode" = "unit" ];
 
 Review comment:
   Integration tests do not generate jacoco report (as yet). This is not 
detected by codecov as an error.  So it may be simple to just add "bash <(curl 
-s https://codecov.io/bash)" directly to the travis.yml. Future changes to 
integration test wont require any changes here.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate 
code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383483676
 
 

 ##
 File path: pom.xml
 ##
 @@ -278,8 +276,6 @@
 report
   
   
-
-
${project.build.directory}/coverage-reports/jacoco-ut.exec
 
 Review comment:
   As per the following link, you can have JaCoCo append coverage of all 
modules in the same file in the parent module target folder. We can change the 
dataFile to "${project.basedir}/../target/coverage-reports/jacoco-ut.exec".
   Might also have to change the prepare-agent destFile.
   
   
https://community.sonarsource.com/t/unit-tests-across-multi-module-projects/9662/3
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate 
code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383483676
 
 

 ##
 File path: pom.xml
 ##
 @@ -278,8 +276,6 @@
 report
   
   
-
-
${project.build.directory}/coverage-reports/jacoco-ut.exec
 
 Review comment:
   As per the following link, you can have JaCoCo append coverage of all 
modules in the same file in the parent module target folder. We can change the 
dataFile to "${project.basedir}/../target/coverage-reports/jacoco-ut.exec".
   
   
https://community.sonarsource.com/t/unit-tests-across-multi-module-projects/9662/3
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate code coverage and publish to codecov.io during CI

2020-02-24 Thread GitBox

prashantwason commented on a change in pull request #1347: [HUDI-627] Aggregate 
code coverage and publish to codecov.io during CI
URL: https://github.com/apache/incubator-hudi/pull/1347#discussion_r383480429
 
 

 ##
 File path: scripts/upload_code_coverage.sh
 ##
 @@ -0,0 +1,28 @@
+#!/bin/bash
+
+
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+mode=$1
+
+if [ "$mode" = "unit" ];
 
 Review comment:
   Integration tests do not generate jacoco report (as yet). This is not 
detected by codecov as an error. 
   So it may be simple to just add "bash <(curl -s https://codecov.io/bash)" 
directly to the travis.yml.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] garyli1019 commented on issue #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-24 Thread GitBox

garyli1019 commented on issue #1348: HUDI-597 Enable incremental pulling from 
defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#issuecomment-590518177
 
 
   @vinothchandar Thanks for reviewing! I will add an example to the webpage 
after 0.5.2 released.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1352: [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352#issuecomment-590516727
 
 
   The (400 enties) upsert time improve from ~hours to ~5min.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1151: [HUDI-476] Add hudi-examples module

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1151: [HUDI-476] Add 
hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#discussion_r383477929
 
 

 ##
 File path: 
hudi-examples/src/main/java/org/apache/hudi/examples/deltastreamer/HoodieDeltaStreamerDfsSourceExample.java
 ##
 @@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.examples.deltastreamer;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.examples.common.HoodieExampleDataGenerator;
+import org.apache.hudi.examples.common.HoodieExampleSparkUtils;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
+import org.apache.hudi.utilities.sources.JsonDFSSource;
+import org.apache.hudi.utilities.transform.IdentityTransformer;
+
+import com.beust.jcommander.JCommander;
+import org.apache.spark.SparkConf;
+import org.apache.spark.api.java.JavaSparkContext;
+
+
+/**
+ * Simple examples of #{@link HoodieDeltaStreamer} from #{@link JsonDFSSource}.
+ *
+ * To run this example, you should
+ *   1. prepare sample data as 
`hudi-examples/src/main/resources/dfs-delta-streamer`
+ *   2. For running in IDE, set VM options `-Dspark.master=local[2]`
+ *   3. For running in shell, using `spark-submit`
+ *
+ * Usage: HoodieDeltaStreamerDfsSourceExample \
 
 Review comment:
   a shell script to run any Example class is a better idea.. I was just 
referring to having a command that users can just copy paste to a terminal and 
run.. 
   
   >>data prep part of the examples themselves and then also provide sane 
defaults for input/output paths
   Thoughts on this? 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

2020-02-24 Thread GitBox

prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for 
Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-590515904

   I dont have them yet. I can run any specific Hudi test already committed
   and quickly get the screenshot if that helps.

   On Mon, Feb 24, 2020 at 11:39 AM vinoth chandar 
   wrote:

   > @prashantwason
   > 

   > Wondering if you have some screenshots for the upsert dag.. (I can try
   > running tjhe PR locally if not)
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or unsubscribe
   > 

   > .
   >

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-24 Thread GitBox

vinothchandar merged pull request #1348: HUDI-597 Enable incremental pulling 
from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated (078d482 -> 4e7fcde)

2020-02-24 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 078d482  [HUDI-624]: Split some of the code from PR for HUDI-479 
(#1344)
 add 4e7fcde  [HUDI-597] Enable incremental pulling from defined partitions 
(#1348)

No new revisions were added by this update.

Summary of changes:
 .../main/scala/org/apache/hudi/DataSourceOptions.scala |  9 -
 .../scala/org/apache/hudi/IncrementalRelation.scala| 18 +++---
 hudi-spark/src/test/scala/TestDataSource.scala |  9 +
 3 files changed, 32 insertions(+), 4 deletions(-)

[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1352: [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken opened a new pull request #1352: [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1352
 
 
   ## What is the purpose of the pull request
   
   **ISSUE:** https://github.com/apache/incubator-hudi/issues/1328
   **JIRA:** https://issues.apache.org/jira/browse/HUDI-625
   
   **User report upsert hangs**
   **Analysis**
   Upsert (400 entries)
   ```
   WARN HoodieMergeHandle: 
   Number of entries in MemoryBasedMap => 150875 
   Total size in bytes of MemoryBasedMap => 83886580 
   Number of entries in DiskBasedMap => 3849125 
   Size of file spilled to disk => 1443046132
   ```
   Hang stackstrace (DiskBasedMap#get)
   ```
   at 
org.apache.hudi.common.util.SerializationUtils$KryoInstantiator$KryoBase$$Lambda$265/1458915834.newInstance(Unknown
 Source)
   at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1139)
   at 
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:562)
   at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:538)
   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:731)
   at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
   at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:543)
   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:813)
   at 
org.apache.hudi.common.util.SerializationUtils$KryoSerializerInstance.deserialize(SerializationUtils.java:112)
   at 
org.apache.hudi.common.util.SerializationUtils.deserialize(SerializationUtils.java:86)
   at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:217)
   at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:211)
   at 
org.apache.hudi.common.util.collection.DiskBasedMap.get(DiskBasedMap.java:207)
   at 
org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:173)
   at 
org.apache.hudi.common.util.collection.ExternalSpillableMap.get(ExternalSpillableMap.java:55)
   at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:280)
   at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:434)
   at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:424)
   at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
   at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
   at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor$$Lambda$76/1412692041.call(Unknown
 Source)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
   ```
   Average time of DiskBasedMap#get
   ```
   $ monitor *DiskBasedMap get -c 12
   
   Affect(class-cnt:1 , method-cnt:4) cost in 221 ms.
timestampclass method  total  success  fail  avg-rt(ms) 
 fail-rate
   

2020-02-20 18:13:36  DiskBasedMap  get 5814   5814 0 6.12   
 0.00%
   
   
timestampclass method  total  success  fail  avg-rt(ms) 
 fail-rate
   

   2020-02-20 18:13:48  DiskBasedMap   get 9117   9117 0 3.89   
 0.00%
   
   
timestampclass method  total  success  fail  avg-rt(ms) 
 fail-rate
   

2020-02-20 18:14:16  DiskBasedMap  get 8490   8490 0 4.10   
 0.00%
   
   ```
   
   Call time strace:
   ```
   `---[4.361707ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:get()
   +---[0.001704ms] java.util.Map:get()
   `---[4.344261ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:get()
   `---[4.328981ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:get()
   +---[0.00122ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:getRandomAccessFile()
   `---[4.313586ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:get()
   `---[4.283509ms] 
org.apache.hudi.common.util.collection.DiskBasedMap:get()
   +---[0.001169ms] 
org.apache.hudi.common.util.collection.DiskBasedMap$ValueMetadata:getOffsetOfValue()
   +---[7.1E-4ms] java.lang.Long:longValue()
   +---[6.97E-4ms]

[GitHub] [incubator-hudi] vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi.

2020-02-24 Thread GitBox

vinothchandar commented on issue #1289: [HUDI-92] Provide reasonable names for 
Spark DAG stages in Hudi.
URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-590512271
 
 
   @prashantwason Wondering if you have some screenshots for the upsert dag.. 
(I can try running tjhe PR locally if not) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by Date type columns

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1330: [HUDI-607] Fix to 
allow creation/syncing of Hive tables partitioned by Date type columns
URL: https://github.com/apache/incubator-hudi/pull/1330#discussion_r383472176
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -77,6 +80,11 @@ public static Object getNestedFieldVal(GenericRecord 
record, String fieldName, b
 
   // return, if last part of name
   if (i == parts.length - 1) {
+
+if (isLogicalTypeDate(valueNode, part)) {
 
 Review comment:
   Hmmm.. just feel that now, this method is doing multiple things and those 
get tricky over time.. Simple compromise.. please pull this  conversion code to 
its own helper method, that we can invoke from here.. that way, over time, more 
conversions are just added to that helper., while this methods just focusses on 
getting a nested value .


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Assigned] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-24 Thread lamber-ken (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-625:
---

Assignee: lamber-ken  (was: Vinoth Chandar)

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at

[GitHub] [incubator-hudi] vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590509005
 
 
   @lamber-ken Please assign HUDI-625 to yourself, when aapche issues site 
comes back online.. I will take the `DiskBasedMap` related changes from here in 
a separae JIRA..
   
   @n3nash wdyt ?  I think the per-record overhead of HoodieRecord need not be 
persisted to the DiskBasedMap i.e just have ``, others we 
can dynamically add when reading the map.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

vinothchandar commented on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590508114
 
 
   @lamber-ken has a smaller fix that probably does not involve explicit 
serializers. but the way we implemented kryo Serialization is definitely 
problematic. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup 
package structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383467914
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/rollback/RollbackExecutor.java
 ##
 @@ -69,7 +69,7 @@ public RollbackExecutor(HoodieTableMetaClient metaClient, 
HoodieWriteConfig conf
* Performs all rollback actions that we have collected in parallel.
*/
   public List performRollback(JavaSparkContext jsc, 
HoodieInstant instantToRollback,
-  List rollbackRequests) {
+   List 
rollbackRequests) {
 
 Review comment:
   yes.. will revert. dont know how it got changed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup 
package structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383467693
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestUpdateSchemaEvolution.java
 ##
 @@ -16,10 +16,9 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.func;
+package org.apache.hudi;
 
 Review comment:
   I will move classes out to specific packages.. The test files I left out 
there are kind of cross functional across packages.. But agree. let me make one 
more pass


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590506849
 
 
   Hi @n3nash, we talked about this issue in 
[HUDI-625](https://issues.apache.org/jira/browse/HUDI-625) . 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup 
package structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383467262
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java
 ##
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.client;
 
+import org.apache.hudi.common.HoodieClientTestHarness;
+import org.apache.hudi.WriteStatus;
 
 Review comment:
   Actually, top level API classes e.g SparkContext can be placed under the 
root package right.?  That was my reasoning at-least.. This is the object we 
return back to the user and made sense to leave at the top level. WDYT 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on issue #1346: [HUDI-554] Cleanup package structure in 
hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#issuecomment-590505752
 
 
   Will create a new ticket to accumulate the breaking changes for next 
releases.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

vinothchandar commented on a change in pull request #1346: [HUDI-554] Cleanup 
package structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383465753
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/table/CleanExecutor.java
 ##
 @@ -60,17 +59,17 @@
  * 
  * TODO: Should all cleaning be done based on {@link HoodieCommitMetadata}
  */
-public class HoodieCleanHelper> implements 
Serializable {
+public class CleanExecutor> implements 
Serializable {
 
 Review comment:
   Agree.. Renamed this to stay consistent with `RollbackExecutor`.. but may be 
we should rename that to `RollbackHelper`.  Will take a pass and rename both 
consistently.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

n3nash edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590504340
 
 
   @vinothchandar this is interesting, why does Kryo need serializer 
implementations ? Ohh, just saw @lamber-ken's response -> the 
StdInstantiatorStrategy will allow kryo to fall back to Java Serde, is that 
what we want ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

n3nash edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590504340
 
 
   @vinothchandar this is interesting, why does Kryo need serializers ? Ohh, 
just saw @lamber-ken's response -> the StdInstantiatorStrategy will allow kryo 
to fall back to Java Serde, is that what we want ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-24 Thread GitBox

n3nash commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590504340
 
 
   @vinothchandar this is interesting, why does Kryo need serializers ? I mean 
it should be able to do this kind of serializing by itself ? For avro payloads 
of GenericData I understand but confused about others like HoodieKey etc..


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

n3nash commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383462046
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestUpdateSchemaEvolution.java
 ##
 @@ -16,10 +16,9 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.func;
+package org.apache.hudi;
 
 Review comment:
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-24 Thread GitBox

n3nash commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383461624
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/table/CleanExecutor.java
 ##
 @@ -60,17 +59,17 @@
  * 
  * TODO: Should all cleaning be done based on {@link HoodieCommitMetadata}
  */
-public class HoodieCleanHelper> implements 
Serializable {
+public class CleanExecutor> implements 
Serializable {
 
 Review comment:
   This looks like there should be an Executor<> interface if we are going to 
move to *executor style of naming ? Do you have plans to introduce one later ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-24 Thread lamber-ken (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043773#comment-17043773
 ] 

lamber-ken commented on HUDI-625:
-

Thanks, if you don't mind, I think I'd like drive it :D

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-24 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043755#comment-17043755
 ] 

Vinoth Chandar commented on HUDI-625:
-

> if we modify / add filed, we will rework these serializers. Kryo has done 
> these work inside. 

valid point. are you saying we just need to do a 1 line fix?  do you want to 
open a PR with your suggested fix for the kryo? (;please test the original 4M 
upsert problem from above and verify performance is good)..

Please spell out clearly how you/if you want to drive this 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
>

[GitHub] [incubator-hudi] vinothchandar commented on issue #1329: [SUPPORT] Presto cannot query non-partitioned table

2020-02-24 Thread GitBox

vinothchandar commented on issue #1329: [SUPPORT] Presto cannot query 
non-partitioned table
URL: https://github.com/apache/incubator-hudi/issues/1329#issuecomment-590475157
 
 
   @bhasudha let's look at this more closely and confirm whats going on here? 
This stack trace indicates, just ipf.getSplits() being called.. and thus its 
general code. We do have tests around querying non-partitioned tables.. So need 
to reproduce this in docker setup or sth and go from therE?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] pushpavanthar commented on issue #143: Tracking ticket for folks to be added to slack group

2020-02-24 Thread GitBox

pushpavanthar commented on issue #143: Tracking ticket for folks to be added to 
slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-590229950
 
 
   please add me pushpavant...@gmail.com


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-561) hudi partition path config

2020-02-24 Thread Raymond Xu (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043207#comment-17043207
 ] 

Raymond Xu commented on HUDI-561:
-

[~liujinhui] I came across this ticket from a discussion of transformer where I 
wanted to address similar issue via a custom transformer. After seeing the key 
generator classes, I think it is more suitable for a custom key generator. In 
the case you described, simply extend 
[https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java]
 and transform the partition path accordingly in getKey()

Having date time format as config entries to me looks like a sign of going too 
far on helping out users.

 

> hudi partition path config
> --
>
> Key: HUDI-561
> URL: https://issues.apache.org/jira/browse/HUDI-561
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 24h
>  Time Spent: 10m
>  Remaining Estimate: 23h 50m
>
> The current hudi partition is in accordance with 
> hoodie.datasource.write.partitionpath.field = keyname
> example:
> keyname 2019/12/20
> Usually the time format may be -MM-dd HH: mm: ss or other
> -MM-dd HH: mm: ss cannot be partitioned correctly
> So I want to add configuration :
> hoodie.datasource.write.partitionpath.source.format = -MM-dd HH: mm: ss
> hoodie.datasource.write.partitionpath.target.format =  / MM / dd



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

85 matches

Mail list logo