Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5304: reduce transfer of Parquet decompression buffers
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6876/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

Line 1029: Status BaseScalarColumnReader::ReadDataPage() {
> Not necessarily for this patch, but we could also consider setting stream_-
We already do that in HdfsParquetScanner::InitColumns() if the file is 
compressed or the column is non-varlen. It doesn't do it if the page is 
dictionary-encoded - that would be trickier since that can change between data 
pages.


Line 1037:     decompressed_data_pool_->FreeAll();
> Why not Reset()?
You mean Clear()? I don't want to make that change - it's really difficult to 
reason about memory consumption with Clear() in general because 
arbitrarily-sized chunks can just accumulate in the pool. E.g. 
Allocate(MAX_CHUNK_SIZE), Clear(), Allocate(MAX_CHUNK_SIZE + 1), Clear(), 
Allocate(MAX_CHUNK_SIZE + 2), etc is pathologically bad because the chunks 
never get recycled.


http://gerrit.cloudera.org:8080/#/c/6876/2/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 472:   /// * Dictionary-compressed pages, where the var-len data lives in 
the dictionary.
> May help to add a remark that the dictionary lives in the dictionary_pool.
Done


Line 474:   bool CurrentPageContainsTupleData() {
> Also not true if pages are compressed.
I'm not sure that I understand - that only affects whether the data is in the 
I/O buffer or the decompressed_data_pool_. This function doesn't claim or need 
to distinguish between those cases.

If the page is uncompressed then the var-len data would be in the I/O buffer 
(and the decompressed_data_pool_ transfer is a no-op), while if the page is 
compressed the var-len data is in the decompressed_data_pool_ and we need to 
transfer from decompressed_data_pool_. In that case we set contains_tuple_data 
= false on the stream so the I/O buffers aren't transferred.


-- 
To view, visit http://gerrit.cloudera.org:8080/6876
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to