GitHub user the-other-tim-brown edited a comment on the discussion: RFC-80: 
Design proposal discussions

Reader path:

Baseline (No split column groups):
Pros:
- All data is in a base file + log files and can be easily read
- Can easily prune the files that must be read

Cons:
- Since total number of keys per file is potentially smaller, we will have more 
files to open even if only a subset of columns is used.

Proposal A:
Pros:
- If we can maintain consistent ordering between the column groups, we can open 
multiple iterators and just iterate through them and join the values to compute 
the final row.
- We can more easily prune files that need to be read since they are grouped by 
the keys.

Cons:
- Potentially small files can lead to performance issues.
- If the ordering of keys is not consistent between the files, we will need to 
do a join on the rows or buffer some of the files in memory to compute the 
final rows.
- If event time ordering is used and the ordering field is not in the column 
group that is read, then we will potentially need to read the value from the 
other file group to properly determine the final row when merging log files.

Proposal B:
Pros:
- Well sized files leads to better read performance for individual files

Cons:
- Since row keys are now split amongst various file groups, the rows must be 
computed by doing a join between the column groups. 
- If a filter is specified on a field in a column group, we will not be able to 
easily prune the candidate files from the other file groups leading to more IO 
for a given query.
- For incremental queries, if the commit time is only reflected in the updated 
column groups then we may not be able to effectively filter out files since we 
can only know when the row was updated after joining all the column groups

GitHub link: 
https://github.com/apache/hudi/discussions/14062#discussioncomment-14630183

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to