Alexey Kudinkin created HUDI-4249:
-------------------------------------

             Summary: Fix in-memory HoodieData implementations to operate lazily
                 Key: HUDI-4249
                 URL: https://issues.apache.org/jira/browse/HUDI-4249
             Project: Apache Hudi
          Issue Type: Bug
            Reporter: Alexey Kudinkin
            Assignee: Alexey Kudinkin
             Fix For: 0.12.0


Currently both `HoodieListData` and `HoodieMapPairData` operate eagerly on 
their payloads meaning that each transformation is immediately applied. 

This has following performance drawbacks:
 # It always executes full transformation regardless of whether the whole 
sequence will be required, potentially wasting quite a bit of compute.
 # It also might be the cause of OOMs if the sequence potentially could be 
larger than available memory (where caller might be relying on assumption that 
it would be performing stream processing)

 

Instead it should be rebased to hold `Stream`s internally and provide semantic 
close to Spark's RDD container.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to