Tim Armstrong has posted comments on this change. Change subject: IMPALA-5304: reduce transfer of Parquet decompression buffers ......................................................................
Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/6876/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 1029: Status BaseScalarColumnReader::ReadDataPage() { > Not necessarily for this patch, but we could also consider setting stream_- We already do that in HdfsParquetScanner::InitColumns() if the file is compressed or the column is non-varlen. It doesn't do it if the page is dictionary-encoded - that would be trickier since that can change between data pages. Line 1037: decompressed_data_pool_->FreeAll(); > Why not Reset()? You mean Clear()? I don't want to make that change - it's really difficult to reason about memory consumption with Clear() in general because arbitrarily-sized chunks can just accumulate in the pool. E.g. Allocate(MAX_CHUNK_SIZE), Clear(), Allocate(MAX_CHUNK_SIZE + 1), Clear(), Allocate(MAX_CHUNK_SIZE + 2), etc is pathologically bad because the chunks never get recycled. http://gerrit.cloudera.org:8080/#/c/6876/2/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 472: /// * Dictionary-compressed pages, where the var-len data lives in the dictionary. > May help to add a remark that the dictionary lives in the dictionary_pool. Done Line 474: bool CurrentPageContainsTupleData() { > Also not true if pages are compressed. I'm not sure that I understand - that only affects whether the data is in the I/O buffer or the decompressed_data_pool_. This function doesn't claim or need to distinguish between those cases. If the page is uncompressed then the var-len data would be in the I/O buffer (and the decompressed_data_pool_ transfer is a no-op), while if the page is compressed the var-len data is in the decompressed_data_pool_ and we need to transfer from decompressed_data_pool_. In that case we set contains_tuple_data = false on the stream so the I/O buffers aren't transferred. -- To view, visit http://gerrit.cloudera.org:8080/6876 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
