[
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046166#comment-17046166
]
lamber-ken commented on HUDI-635:
---------------------------------
hi [~vinoth], here are my initial thoughts, may be not correct
Can we try to replace RDD<HoodieRecord> with Dataset<Row> ? RDD don't have
inbuilt optimization engine. When working with structured data, RDDs cannot
take advantages of sparks advance optimizers.
* When upsert data to HUDI, convert to avro data .. many convert operations
may cost more time
* DataFrame support add column, additional columns (_hoodie_commit_time,
_hoodie_commit_seqno, _hoodie_record_key)
Also, we can expose Row to users( instead of GenericRecord ) in payload, use
can use methods like getString getDate .. etc, which are more friendly.
If use GenericRecord, user need to care about the schema and convert data from
bytes.
WDYT? :)
> MergeHandle's DiskBasedMap entries can be thinner
> -------------------------------------------------
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Performance, Writer Core
> Reporter: Vinoth Chandar
> Assignee: Vinoth Chandar
> Priority: Major
>
> Instead of <Key, HoodieRecord>, we can just track <Key, Payload> ... Helps
> with use-cases like HUDI-625
--
This message was sent by Atlassian Jira
(v8.3.4#803005)