[jira] [Commented] (KUDU-2243) CFile Reader improvements

Dan Burkert (JIRA) Thu, 25 Jan 2018 11:21:11 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339682#comment-16339682
 ]


Dan Burkert commented on KUDU-2243:
-----------------------------------

We discussed the concept renaming on slack and came to the following 
conclusions:

CFile should be renamed to Chunk:
 * CFile has connotations that it's a file, but in reality a CFile is 1:1 with 
the fs::BlockManager Block.
 * CFile maps very closely to Parquet's Column Chunk Abstraction ('Column 
chunk: A chunk of the data for a particular column')
 * We'd therefore have column chunks, ad-hoc index chunks, bloom chunks, and 
delta chunks

cfile block/cblock should be renamed to page:
 * As the unit of encoding and compression, and the smallest indivisible 
on-disk container, it maps very well to the classical database concept of a 
page.
 * It maps well to Parquet's concept of a page ('Page: Column chunks are 
divided up into pages. A page is conceptually an indivisible unit (in terms of 
compression and encoding). There can be multiple page types which is 
interleaved in a column chunk.')

The current fs block manager block abstraction will remain, to which the 
'block' term will unambiguously refer.

> CFile Reader improvements
> -------------------------
>
>                 Key: KUDU-2243
>                 URL: https://issues.apache.org/jira/browse/KUDU-2243
>             Project: Kudu
>          Issue Type: Improvement
>          Components: cfile
>    Affects Versions: 1.6.0
>            Reporter: Dan Burkert
>            Priority: Major
>
> I've done a pretty thorough review of all the CFile reader code over the last 
> few days in order to make a targeted bug fix, and I've got some ideas for how 
> we can simplify it.  I'd like to get others thoughts.
> * To reduce confusion between CFile data blocks and FS manager blocks, I 
> think we should change all references in code and docs of CFile data blocks 
> to 'cblock'.
> * Much of the complexity of the CFileIterator is due to it's complex public 
> API, which requires separate {{Seek(idx) -> Prepare(nrows) -> Scan(output 
> buf, predicates)}} calls.  Additionally, the Prepare step can materialize 
> many blocks, which then need to be put in a queue. I think all of this could 
> be simplified by changing the API to be {{Seek(idx) -> Scan(nrows, output 
> buf, predicates)}}, and have the CFile iterator only cache the 
> most-recently-materialized block (instead of the queue). For really big scan 
> batches, this will change the internal scan/materialize pattern from 
> materializing all cblocks up front then copying, to materializing and copying 
> of cblocks being interleaved.  Since in most cases cblocks are usually much 
> bigger (256kib) than scan batches (100 cells), I think it won't actually lead 
> to measurably different behavior.
> * {{QueueCurrentDataBlock}} and {{ReadCurrentDataBlock}} should drop 
> {{Current}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KUDU-2243) CFile Reader improvements

Reply via email to