[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-23 Thread GitBox
yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383117457
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java
 ##
 @@ -16,8 +16,10 @@
  * limitations under the License.
  */
 
-package org.apache.hudi;
+package org.apache.hudi.client;
 
+import org.apache.hudi.common.HoodieClientTestHarness;
+import org.apache.hudi.WriteStatus;
 
 Review comment:
   ditto, can we place it into e.g. `org.apache.hudi.write.common.WriteStatus`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-23 Thread GitBox
yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383116984
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestUpdateSchemaEvolution.java
 ##
 @@ -16,10 +16,9 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.func;
+package org.apache.hudi;
 
 Review comment:
   IMO, `org.apache.hudi` is a very common( or say generic) level. It would be 
better to make the specific classes host in a detailed package that can 
represent his location. WDYT?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package structure in hudi-client

2020-02-23 Thread GitBox
yanghua commented on a change in pull request #1346: [HUDI-554] Cleanup package 
structure in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1346#discussion_r383115191
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/rollback/RollbackExecutor.java
 ##
 @@ -69,7 +69,7 @@ public RollbackExecutor(HoodieTableMetaClient metaClient, 
HoodieWriteConfig conf
* Performs all rollback actions that we have collected in parallel.
*/
   public List performRollback(JavaSparkContext jsc, 
HoodieInstant instantToRollback,
-  List rollbackRequests) {
+   List 
rollbackRequests) {
 
 Review comment:
   Is the original indent the right one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] apoorva007 commented on issue #143: Tracking ticket for folks to be added to slack group

2020-02-23 Thread GitBox
apoorva007 commented on issue #143: Tracking ticket for folks to be added to 
slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-590172680
 
 
   Please add me : apoorva.aggar...@grofers.com


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #198

2020-02-23 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.25 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.2-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.5.2-SNAPSHOT
[WARNING] 'artifactId' contains an 

[jira] [Updated] (HUDI-581) NOTICE need more work as it missing content form included 3rd party ALv2 licensed NOTICE files

2020-02-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-581:
--
Priority: Blocker  (was: Major)

> NOTICE need more work as it missing content form included 3rd party ALv2 
> licensed NOTICE files
> --
>
> Key: HUDI-581
> URL: https://issues.apache.org/jira/browse/HUDI-581
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Assignee: Suneel Marthi
>Priority: Blocker
> Fix For: 0.5.2
>
>
> Issues pointed out in general@incubator ML, more context here: 
> [https://lists.apache.org/thread.html/rd3f4a72d82a4a5a81b2c6bd71e1417054daa38637ce8e07901f26f04%40%3Cgeneral.incubator.apache.org%3E]
>  
> Would get it fixed before next release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-02-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-289:
--
Fix Version/s: (was: 0.5.2)
   0.6.0

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-23 Thread GitBox
lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590142768
 
 
   This is a great start.  
   
   IMO, because we already set InstantiatorStrategy, so we needn't register 
class agian.
   `kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());`
   
   From kryo guide[1], we also modify this:
   ```
   kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   ```
   
   So, we do bellow changes will be ok
   
   - Change `Kryo kryo = new KryoBase();` to `Kryo kryo = new Kryo();`
   - `kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));`
   
   [1] https://github.com/EsotericSoftware/kryo#object-creation


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1342: [SUPPORT] do cow tables need to be converted when changing from hoodie to hudi?

2020-02-23 Thread GitBox
lamber-ken edited a comment on issue #1342: [SUPPORT] do cow tables need to be 
converted when changing from hoodie to hudi?
URL: https://github.com/apache/incubator-hudi/issues/1342#issuecomment-588567557
 
 
   hi @tooptoop4, I test it in my local env. No need to run any convert utility.
   
   **Notice:**
   - Change `com.uber.hoodie` to `org.apache.hudi`
   - Better to improve spark to 2.4.4 version
   - Add `org.apache.spark:spark-avro_2.11:2.4.4` to start command
   ```
   bin/spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   
   
   
   More, from `0.4.6` to `0.5.1` is a big version change, I suggest that you'd 
better test it with some old cow tables in your environment too. If you meet 
some issue, you can ask us at any time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-23 Thread GitBox
lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590142768
 
 
   This is a great start.  
   
   IMO, because we already set InstantiatorStrategy, so we needn't register 
class agian.
   `kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());`
   
   From kryo guide[1], we also modify this:
   ```
   kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));
   ```
   
   So, we do bellow changes is ok
   
   - Change `Kryo kryo = new KryoBase();` to `Kryo kryo = new Kryo();`
   - `kryo.setInstantiatorStrategy(new Kryo.DefaultInstantiatorStrategy(new 
StdInstantiatorStrategy()));`
   
   [1] https://github.com/EsotericSoftware/kryo#object-creation


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 2:01 AM:
--

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. 

Because we already set InstantiatorStrategy, so we needn't register class agian.
{code:java}
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
{code}
Only modify one line like the above solution is ok.

 

Let's think about it, if we modify / add filed, we will rework these 
serializers. Kryo has done these work inside. WDYT :)
{code:java}
kryo.register(HoodieKey.class, new HoodieKeySerializer());
kryo.register(GenericData.Record.class, new GenericDataRecordSerializer());
kryo.register(HoodieRecord.class, new HoodieRecordSerializer());
kryo.register(HoodieRecordLocationSerializer.class, new 
HoodieRecordLocationSerializer());
kryo.register(OverwriteWithLatestAvroPayload.class, new 
OverwriteWithLatestPayloadSerializer());
{code}
 

>From the kryo guide, it also illustrates this point.

[https://github.com/EsotericSoftware/kryo#object-creation]

 

>> Tool

[https://github.com/alibaba/arthas]


was (Author: lamber-ken):
The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. 

Because we already set InstantiatorStrategy, so we needn't register class agian.
{code:java}
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
{code}
Only modify one line like the above solution is ok.

 

Let's think about it, if we modify / add filed, we will rework these 
serializers. Kryo has help done these work. WDYT :)
{code:java}
kryo.register(HoodieKey.class, new HoodieKeySerializer());
kryo.register(GenericData.Record.class, new GenericDataRecordSerializer());
kryo.register(HoodieRecord.class, new HoodieRecordSerializer());
kryo.register(HoodieRecordLocationSerializer.class, new 
HoodieRecordLocationSerializer());
kryo.register(OverwriteWithLatestAvroPayload.class, new 
OverwriteWithLatestPayloadSerializer());
{code}
 

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = 

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 1:58 AM:
--

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. 

Because we already set InstantiatorStrategy, so we needn't register class agian.
{code:java}
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
{code}
Only modify one line like the above solution is ok.

 

Let's think about it, if we modify / add filed, we will rework these 
serializers. Kryo has help done these work. WDYT :)
{code:java}
kryo.register(HoodieKey.class, new HoodieKeySerializer());
kryo.register(GenericData.Record.class, new GenericDataRecordSerializer());
kryo.register(HoodieRecord.class, new HoodieRecordSerializer());
kryo.register(HoodieRecordLocationSerializer.class, new 
HoodieRecordLocationSerializer());
kryo.register(OverwriteWithLatestAvroPayload.class, new 
OverwriteWithLatestPayloadSerializer());
{code}
 

>> Tool

[https://github.com/alibaba/arthas]


was (Author: lamber-ken):
The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. 


Because we already set InstantiatorStrategy, so we needn't register class agian.
{code:java}
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
{code}
Only modify one line like the above solution is ok.

 

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   

[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043016#comment-17043016
 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 2/24/20 1:40 AM:
--

I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized 
from Kafka, or for compaction, translate to refreshed schema (shortcut if 
schema ids match), snapshot by HoodieWriteHandle (or derived) class, from 
SchemaProvider.

Alternatively, snapshot latest schema at the beginning of the DeltaSync loop, 
and use it  across the insert/update/delete (by HudiWriter) and for compaction 
(by HoodieWriteHandle ).


was (Author: yx3...@gmail.com):
I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized 
from Kafka, or for compaction, translate to refreshed schema (shortcut if 
schema ids match), snapshot by HoodieWriteHandle (or derived) class, from 
SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Priority: Major
>  Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-23 Thread GitBox
lamber-ken edited a comment on issue #1351: [WIP] [HUDI-625] Fixing performance 
issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590142768
 
 
   This is a great start.
   IMO, because we already set InstantiatorStrategy, so we needn't register 
class agian.
   `kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());`
   
   Change `Kryo kryo = new KryoBase();` to `Kryo kryo = new Kryo();` will ok. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-23 Thread GitBox
lamber-ken commented on issue #1351: [WIP] [HUDI-625] Fixing performance issues 
around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351#issuecomment-590142768
 
 
   Because we already set InstantiatorStrategy, so we needn't register class 
agian.
   `kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());`
   
   Change `Kryo kryo = new KryoBase();` to `Kryo kryo = new Kryo();` will ok. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 1:32 AM:
--

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. 


Because we already set InstantiatorStrategy, so we needn't register class agian.
{code:java}
kryo.setInstantiatorStrategy(new 
org.objenesis.strategy.StdInstantiatorStrategy());
{code}
Only modify one line like the above solution is ok.

 

>> Tool

[https://github.com/alibaba/arthas]


was (Author: lamber-ken):
The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. if we use original "new 
kryo", we needn't register class.

Only modify one line like the above solution is ok.

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 1:29 AM:
--

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call it each time, it's wrong way. if we use original "new 
kryo", we needn't register class.

Only modify one line like the above solution is ok.

>> Tool

[https://github.com/alibaba/arthas]


was (Author: lamber-ken):
The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call each time, it's wrong way. if we use original "new kryo", 
we needn't register class.

Only modify one line like the above solution is ok.

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, 

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 1:24 AM:
--

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#ff}.newInstance();{color}",
 this will cause call each time, it's wrong way. if we use original "new kryo", 
we needn't register class.

Only modify one line like the above solution is ok.

>> Tool

[https://github.com/alibaba/arthas]


was (Author: lamber-ken):
The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#FF}.newInstance();{color}",
 this will cause call each time, it's wrong way.

if we use original "new kryo", we needn't register class.

 

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043117#comment-17043117
 ] 

lamber-ken commented on HUDI-625:
-

The key issue is 
"super.getInstantiatorStrategy().newInstantiatorOf(type){color:#FF}.newInstance();{color}",
 this will cause call each time, it's wrong way.

if we use original "new kryo", we needn't register class.

 

>> Tool

[https://github.com/alibaba/arthas]

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of 

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043115#comment-17043115
 ] 

Vinoth Chandar commented on HUDI-625:
-

I fixed that in my PR as well .. Do you want to drive the kryo fixes or shall I 
take it over?  let me know :)

btw what tool is that?  Looks cool to see the timing drill downs like that. 


P.S: lets keep conversations here :) .. JIRA is the source of truth. 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap 

[GitHub] [incubator-hudi] satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI

2020-02-23 Thread GitBox
satishkotha commented on issue #1341: [HUDI-626] Add exportToTable option to CLI
URL: https://github.com/apache/incubator-hudi/pull/1341#issuecomment-590139655
 
 
   @smarthi could you review this when you get a chance?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-23 Thread GitBox
garyli1019 commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r383055959
 
 

 ##
 File path: hudi-spark/src/test/scala/TestDataSource.scala
 ##
 @@ -135,6 +136,14 @@ class TestDataSource extends AssertionsForJUnit {
 countsPerCommit = 
hoodieIncViewDF2.groupBy("_hoodie_commit_time").count().collect();
 assertEquals(1, countsPerCommit.length)
 assertEquals(commitInstantTime2, countsPerCommit(0).get(0))
+
+// pull the latest commit within certain partitions
+val hoodieIncViewDF3 = spark.read.format("org.apache.hudi")
+  .option(DataSourceReadOptions.QUERY_TYPE_OPT_KEY, 
DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL)
+  .option(DataSourceReadOptions.BEGIN_INSTANTTIME_OPT_KEY, 
commitInstantTime1)
+  .option(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "/2016/*/*/*")
 
 Review comment:
   Since we add `*` in front of it so it will still work without `/`. Here is 
just trying to be consistent with loading the full table. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-23 Thread GitBox
garyli1019 commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r383055617
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -84,7 +85,7 @@ class IncrementalRelation(val sqlContext: SQLContext,
 
   val filters = {
 if 
(optParams.contains(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY)) {
-  val filterStr = 
optParams.get(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY).getOrElse("")
+  val filterStr = 
optParams.getOrElse(DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY, "")
 
 Review comment:
   Good point, will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1348: HUDI-597 Enable incremental pulling from defined partitions

2020-02-23 Thread GitBox
garyli1019 commented on a change in pull request #1348: HUDI-597 Enable 
incremental pulling from defined partitions
URL: https://github.com/apache/incubator-hudi/pull/1348#discussion_r383055560
 
 

 ##
 File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
 ##
 @@ -100,17 +101,22 @@ class IncrementalRelation(val sqlContext: SQLContext,
 .get, classOf[HoodieCommitMetadata])
   fileIdToFullPath ++= metadata.getFileIdAndFullPaths(basePath).toMap
 }
+val pathGlobPattern = 
optParams.getOrElse(DataSourceReadOptions.INCR_PATH_GLOB_OPT_KEY, "")
+val filteredFullPath = if(!pathGlobPattern.equals("")) {
+  val globMatcher = new GlobPattern("*" + pathGlobPattern)
 
 Review comment:
   the path here is a full HDFS path so we need `*` here to match with the 
prefix.  The benefit if we include `*` here is that the user will have a 
consistent interface. When loading the full table, they will do `.load(basePath 
+ "/2016/*/*/*")` and in incremental pulling the `String` the user defined will 
be the same. If we leave the `*` to the user I think it might cause some 
confusion there and the users need to read this part of the code themselves to 
fully understand how things work here.  
   Yea I couldn't find any documents as well. The `GlobFilter` in the API list 
is using `GlobPattern` inside 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/GlobFilter.java#L67
 and the class is still around 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/GlobPattern.java


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043098#comment-17043098
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 12:29 AM:
---

hi [~vinoth], I sent some messages to you in slack yesterday, may be these 
messages are lost. write here again. :)
 The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
  
 Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.

*Analysis:* 
 !image-2020-02-24-08-15-48-615.png|width=800,height=672!
  

*Solution:*

!https://files.slack.com/files-pri/T4D7BR6T1-FUCRN0WN9/image.png|width=751,height=198!

!image-2020-02-24-08-17-33-739.png|width=790,height=243!
  
  


was (Author: lamber-ken):
hi [~vinoth], I sent some messages to you in slack yesterday, may be these 
messages are lost. write here again. :)
 The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
  
 Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.

*Analysis:* 
 !image-2020-02-24-08-15-48-615.png|width=800,height=672!
  

*Solution:*

!https://files.slack.com/files-pri/T4D7BR6T1-FUCRN0WN9/image.png|width=698,height=184!


 !image-2020-02-24-08-17-33-739.png|width=790,height=243!
  
  

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043098#comment-17043098
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 12:23 AM:
---

hi [~vinoth], I sent some messages to you in slack yesterday, may be these 
messages are lost. write here again. :)
 The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
  
 Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.

*Analysis:* 
 !image-2020-02-24-08-15-48-615.png|width=800,height=672!
  

*Solution:*

!https://files.slack.com/files-pri/T4D7BR6T1-FUCRN0WN9/image.png|width=698,height=184!


 !image-2020-02-24-08-17-33-739.png|width=790,height=243!
  
  


was (Author: lamber-ken):
hi [~vinoth], I sent some messages to you in slack yesterday, may be these 
messages are lost. write here again. :)
 The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
  
 Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.
  
 !image-2020-02-24-08-15-48-615.png|width=800,height=672!
  
 !image-2020-02-24-08-17-33-739.png|width=790,height=243!
  
  

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   

[jira] [Comment Edited] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043098#comment-17043098
 ] 

lamber-ken edited comment on HUDI-625 at 2/24/20 12:18 AM:
---

hi [~vinoth], I sent some messages to you in slack yesterday, may be these 
messages are lost. write here again. :)
 The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
  
 Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.
  
 !image-2020-02-24-08-15-48-615.png|width=800,height=672!
  
 !image-2020-02-24-08-17-33-739.png|width=790,height=243!
  
  


was (Author: lamber-ken):
hi [~vinoth], I send some messages to you use slack, may be these messages are 
lost. write here again. :)
The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
 
Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.
 
!image-2020-02-24-08-15-48-615.png|width=800,height=672!
 
!image-2020-02-24-08-17-33-739.png|width=790,height=243!
 
 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")

[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043098#comment-17043098
 ] 

lamber-ken commented on HUDI-625:
-

hi [~vinoth], I send some messages to you use slack, may be these messages are 
lost. write here again. :)
The overried method 
"SerializationUtils.KryoInstantiator.KryoBase#newInstantiator" will called when 
each entry deserialized, but the only "Kryo#newInstantiator" be called when 
then type doesn’t has a no-arg constructor.
 
Solution is use new Kryo(), I had test the snippet end to end in spark-shell. 
If you have time, you can modify it, then test in your local env.
 
!image-2020-02-24-08-15-48-615.png|width=800,height=672!
 
!image-2020-02-24-08-17-33-739.png|width=790,height=243!
 
 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png, image-2020-02-24-08-17-33-739.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-625:

Attachment: image-2020-02-24-08-15-48-615.png

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png, 
> image-2020-02-24-08-15-48-615.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> at 

[GitHub] [incubator-hudi] smarthi commented on a change in pull request #1350: [HUDI-629]: Replace Guava's Hashing with an equivalent in NumericUtils.java

2020-02-23 Thread GitBox
smarthi commented on a change in pull request #1350: [HUDI-629]: Replace 
Guava's Hashing with an equivalent in NumericUtils.java
URL: https://github.com/apache/incubator-hudi/pull/1350#discussion_r382958076
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/minicluster/HdfsTestService.java
 ##
 @@ -66,7 +66,7 @@ public Configuration getHadoopConf() {
   }
 
   public MiniDFSCluster start(boolean format) throws IOException {
-Preconditions.checkState(workDir != null, "The work dir must be set before 
starting cluster.");
+Objects.requireNonNull(workDir, "The work dir must be set before starting 
cluster.");
 
 Review comment:
   For this null check, its unnecessary - there is a checkState in 
ValidationUtils to check other boolean conditions. But null check, its fine to 
use Objects.checkNotNull() 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043040#comment-17043040
 ] 

Vinoth Chandar commented on HUDI-625:
-

https://github.com/apache/incubator-hudi/pull/1351/files 

With these changes, I can confirm that writing happens in couple mins, even 
without bumping up the merge memory.. 

[~lamber-ken] time to split into concrete JIRAs. Please let me know if you want 
to take up the kryo part of the implementation. I will timeout in a day and 
grab both of them :) 

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in 

[jira] [Updated] (HUDI-625) Address performance concerns on DiskBasedMap.get() during upsert of thin records

2020-02-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-625:

Labels: pull-request-available  (was: )

> Address performance concerns on DiskBasedMap.get() during upsert of thin 
> records
> 
>
> Key: HUDI-625
> URL: https://issues.apache.org/jira/browse/HUDI-625
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-02-20-23-34-24-155.png, 
> image-2020-02-20-23-34-27-466.png, image-2020-02-21-15-35-56-637.png
>
>
> [https://github.com/apache/incubator-hudi/issues/1328]
>  
>  So what's going on here is that each entry (single data field) is estimated 
> to be around 500-750 bytes in memory and things spill a lot... 
> {code:java}
> 20/02/20 23:00:39 INFO ExternalSpillableMap: Estimated Payload size => 760 
> for 3675605,HoodieRecord{key=HoodieKey { recordKey=3675605 
> partitionPath=default}, currentLocation='HoodieRecordLocation 
> {instantTime=20200220225748, fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}', 
> newLocation='HoodieRecordLocation {instantTime=20200220225921, 
> fileId=499f8d2c-df6a-4275-9166-3de4ac91f3bf-0}'} {code}
>  
> {code:java}
> INFO HoodieMergeHandle: Number of entries in MemoryBasedMap => 150875
> Total size in bytes of MemoryBasedMap => 83886580
> Number of entries in DiskBasedMap => 2849125
> Size of file spilled to disk => 1067101739 {code}
> h2. Reproduce steps
>  
> {code:java}
> export SPARK_HOME=/home/dockeradmin/hudi/spark-2.4.4-bin-hadoop2.7
> ${SPARK_HOME}/bin/spark-shell \
> --executor-memory 6G \
> --packages 
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  \
> --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
> {code}
>  
> {code:java}
> val HUDI_FORMAT = "org.apache.hudi"
> val TABLE_NAME = "hoodie.table.name"
> val RECORDKEY_FIELD_OPT_KEY = "hoodie.datasource.write.recordkey.field"
> val PRECOMBINE_FIELD_OPT_KEY = "hoodie.datasource.write.precombine.field"
> val OPERATION_OPT_KEY = "hoodie.datasource.write.operation"
> val BULK_INSERT_OPERATION_OPT_VAL = "bulk_insert"
> val UPSERT_OPERATION_OPT_VAL = "upsert"
> val BULK_INSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism"
> val UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism"
> val config = Map(
> "table_name" -> "example_table",
> "target" -> "file:///tmp/example_table/",
> "primary_key" ->  "id",
> "sort_key" -> "id"
> )
> val readPath = config("target") + "/*"val json_data = (1 to 400).map(i => 
> "{\"id\":" + i + "}")
> val jsonRDD = spark.sparkContext.parallelize(json_data, 2)
> val df1 = spark.read.json(jsonRDD)
> println(s"${df1.count()} records in source 1")
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, BULK_INSERT_OPERATION_OPT_VAL).
>   option(BULK_INSERT_PARALLELISM, 1).
>   mode("Overwrite").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> // Runs very slow
> df1.limit(300).write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   save(config("target"))
> // Runs very slow
> df1.write.format(HUDI_FORMAT).
>   option(PRECOMBINE_FIELD_OPT_KEY, config("sort_key")).
>   option(RECORDKEY_FIELD_OPT_KEY, config("primary_key")).
>   option(TABLE_NAME, config("table_name")).
>   option(OPERATION_OPT_KEY, UPSERT_OPERATION_OPT_VAL).
>   option(UPSERT_PARALLELISM, 20).
>   mode("Append").
>   
> save(config("target"))println(s"${spark.read.format(HUDI_FORMAT).load(readPath).count()}
>  records in Hudi table")
> {code}
>  
>  
>  
> h2. *Analysis*
> h3. *Upsert (400 entries)*
> {code:java}
> WARN HoodieMergeHandle: 
> Number of entries in MemoryBasedMap => 150875 
> Total size in bytes of MemoryBasedMap => 83886580 
> Number of entries in DiskBasedMap => 3849125 
> Size of file spilled to disk => 1443046132
> {code}
> h3. Hang stackstrace (DiskBasedMap#get)
>  
> {code:java}
> "pool-21-thread-2" Id=696 cpuUsage=98% RUNNABLE
> at java.util.zip.ZipFile.getEntry(Native Method)
> at java.util.zip.ZipFile.getEntry(ZipFile.java:310)
> -  locked 

[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1351: [WIP] [HUDI-625] Fixing performance issues around DiskBasedMap & kryo

2020-02-23 Thread GitBox
vinothchandar opened a new pull request #1351: [WIP] [HUDI-625] Fixing 
performance issues around DiskBasedMap & kryo
URL: https://github.com/apache/incubator-hudi/pull/1351
 
 
- This is very rough cut of few things I tried; Just for sharing purposes
- Kryo needs serializers and once we add them, the ser/deser is fast and 
writing finishes 10-20x faster
- DiskbasedMap is tracking too many things redundantly and incurring its 
cost as well.
- TODO : Need to break the kryo and map fix in differnt PRs
- TODO : For map entry thinning, need to handle compaction, fix code 
structure, tests
- TODO : For kyro, one more pass with good understanding of APIs, tests, 
null handling, cleanup
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-631) HoodieAvroUtils.rewrite does not handle schema change such as optional fields removal

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)
Yixue (Andrew) Zhu created HUDI-631:
---

 Summary: HoodieAvroUtils.rewrite does not handle schema change 
such as optional fields removal
 Key: HUDI-631
 URL: https://issues.apache.org/jira/browse/HUDI-631
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Common Core
Reporter: Yixue (Andrew) Zhu


The utility function 
[HoodieAvroUtils.rewrite|https://github.com/apache/incubator-hudi/blob/5b7bb142dc6712c41fd8ada208ab3186369431f9/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205]
 does not handle schema evolution such as optional fields removal (from new 
schema).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043016#comment-17043016
 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 2/23/20 6:36 PM:
--

I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized 
from Kafka, or for compaction, translate to refreshed schema (shortcut if 
schema ids match), snapshot by HoodieWriteHandle (or derived) class, from 
SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.


was (Author: yx3...@gmail.com):
I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match), 
snapshot by HoodieWriteHandle (or derived) class, from SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Priority: Major
>  Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043016#comment-17043016
 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 2/23/20 6:35 PM:
--

I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match), 
snapshot by HoodieWriteHandle (or derived) class, from SchemaProvider.
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.


was (Author: yx3...@gmail.com):
I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match).
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Priority: Major
>  Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043016#comment-17043016
 ] 

Yixue (Andrew) Zhu edited comment on HUDI-603 at 2/23/20 6:33 PM:
--

I am still working on reading Hudi code base, but I think one possible approach 
would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match).
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.


was (Author: yx3...@gmail.com):
I think one possible approach would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match).
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Priority: Major
>  Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-603) HoodieDeltaStreamer should periodically fetch table schema update

2020-02-23 Thread Yixue (Andrew) Zhu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043016#comment-17043016
 ] 

Yixue (Andrew) Zhu commented on HUDI-603:
-

I think one possible approach would work:
 # A SchemaProvider derived class can be introduced to retrieve latest Schema 
if needed, from Confluence Schema registry. 
 # Enhance class AvroSource or Source derived class to record Avro schema id 
for serialization, as used by Confluence Schema registry. When deserialized for 
compaction, translate to refreshed schema (shortcut if schema ids match).
 # Custom serializer for GenericRecord can be registered in Spark, to use 
schema id.

> HoodieDeltaStreamer should periodically fetch table schema update
> -
>
> Key: HUDI-603
> URL: https://issues.apache.org/jira/browse/HUDI-603
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Yixue Zhu
>Priority: Major
>  Labels: evolution, schema
>
> HoodieDeltaStreamer create SchemaProvider instance and delegate to DeltaSync 
> for periodical sync. However, default implementation of SchemaProvider does 
> not refresh schema, which can change due to schema evolution. DeltaSync 
> snapshot the schema when it creates writeClient, using the SchemaProvider 
> instance or pick up from source, and the schema for writeClient is not 
> refreshed during the loop of Sync.
> I think this needs to be addressed to support schema evolution fully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-23 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-617.

Fix Version/s: 0.5.2
   Resolution: Fixed

Fixed via master: c2b08cdfc9b762801a63fee988f1c24cc17df4ce

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-request-available
> Fix For: 0.5.2
>
> Attachments: test_data.json, test_schema.avsc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-617) Add support for data types convertible to String in TimestampBasedKeyGenerator

2020-02-23 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf updated HUDI-617:
---
Status: Open  (was: New)

> Add support for data types convertible to String in TimestampBasedKeyGenerator
> --
>
> Key: HUDI-617
> URL: https://issues.apache.org/jira/browse/HUDI-617
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Amit Singh
>Priority: Minor
>  Labels: easyfix, pull-request-available
> Attachments: test_data.json, test_schema.avsc
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, TimestampBasedKeyGenerator only supports 4 data types for the 
> partition key. They are  Double, Long, Float and String. However, if the 
> `avro.java.string` is not specified in the schema provided, Hudi throws the 
> following error:
>  org.apache.hudi.exception.HoodieNotSupportedException: Unexpected type for 
> partition field: org.apache.avro.util.Utf8
>  at 
> org.apache.hudi.utilities.keygen.TimestampBasedKeyGenerator.getKey(TimestampBasedKeyGenerator.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$f92c188c$1(DeltaSync.java:338)
> 
>  It will be better if the support was more generalised to include the data 
> types that provide method to convert them to String such as `Utf8` since all 
> these methods implement the `CharSequence` interface.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)