[ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046166#comment-17046166
 ] 

lamber-ken commented on HUDI-635:
---------------------------------

hi [~vinoth], here are my initial thoughts, may be not correct

Can we try to replace RDD<HoodieRecord> with Dataset<Row> ? RDD don't have 
inbuilt optimization engine. When working with structured data, RDDs cannot 
take advantages of sparks advance optimizers. 
 * When upsert data to HUDI, convert to avro data .. many convert operations 
may cost more time
 * DataFrame support add column, additional columns (_hoodie_commit_time, 
_hoodie_commit_seqno, _hoodie_record_key)

Also, we can expose Row to users( instead of GenericRecord ) in payload, use 
can use methods like getString getDate .. etc, which are more friendly.

If use GenericRecord, user need to care about the schema and convert data from 
bytes.

WDYT? :)

 

> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>
>                 Key: HUDI-635
>                 URL: https://issues.apache.org/jira/browse/HUDI-635
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Performance, Writer Core
>            Reporter: Vinoth Chandar
>            Assignee: Vinoth Chandar
>            Priority: Major
>
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to