[jira] [Commented] (KUDU-2243) CFile Reader improvements

Todd Lipcon (JIRA) Mon, 18 Dec 2017 19:06:06 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296086#comment-16296086
 ]


Todd Lipcon commented on KUDU-2243:
-----------------------------------

bq. To reduce confusion between CFile data blocks and FS manager blocks, I 
think we should change all references in code and docs of CFile data blocks to 
'cblock'.

I'm on board with that but then I think we should also rename the internal 
"blocks" of data to "pages" -- ie have PageDecoder, PageEncoder, etc, to make a 
more clear distinction between a "column block" and a page of data within that 
column block.

bq. Much of the complexity of the CFileIterator is due to it's complex public 
API, which requires separate Seek(idx) -> Prepare(nrows) -> Scan(output buf, 
predicates) call

I seem to recall that this design was meant to allow the output of the Scan to 
reference memory held by the prepared blocks. ie in the case of a PLAIN-encoded 
string column, it could just have pointers directly to the data in the block, 
rather than copying them into any output arena.  The data returned by Scan 
would be guaranteed to be valid only until the next 'Prepare' call at which 
point you might drop the memory from the previous batch.

I don't recall at the moment whether we actually make use of this optimization, 
though. The docs for 'Scan' seem to indicate it is not.

> CFile Reader improvements
> -------------------------
>
>                 Key: KUDU-2243
>                 URL: https://issues.apache.org/jira/browse/KUDU-2243
>             Project: Kudu
>          Issue Type: Improvement
>          Components: cfile
>    Affects Versions: 1.6.0
>            Reporter: Dan Burkert
>
> I've done a pretty thorough review of all the CFile reader code over the last 
> few days in order to make a targeted bug fix, and I've got some ideas for how 
> we can simplify it.  I'd like to get others thoughts.
> * To reduce confusion between CFile data blocks and FS manager blocks, I 
> think we should change all references in code and docs of CFile data blocks 
> to 'cblock'.
> * Much of the complexity of the CFileIterator is due to it's complex public 
> API, which requires separate {{Seek(idx) -> Prepare(nrows) -> Scan(output 
> buf, predicates)}} calls.  Additionally, the Prepare step can materialize 
> many blocks, which then need to be put in a queue. I think all of this could 
> be simplified by changing the API to be {{Seek(idx) -> Scan(nrows, output 
> buf, predicates)}}, and have the CFile iterator only cache the 
> most-recently-materialized block (instead of the queue). For really big scan 
> batches, this will change the internal scan/materialize pattern from 
> materializing all cblocks up front then copying, to materializing and copying 
> of cblocks being interleaved.  Since in most cases cblocks are usually much 
> bigger (256kib) than scan batches (100 cells), I think it won't actually lead 
> to measurably different behavior.
> * {{QueueCurrentDataBlock}} and {{ReadCurrentDataBlock}} should drop 
> {{Current}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (KUDU-2243) CFile Reader improvements

Reply via email to