[Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files

Matthew Jacobs (Code Review) Tue, 19 Jul 2016 15:43:55 -0700

Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files
......................................................................



Patch Set 5:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

PS5, Line 381: Encoding may fail for several reasons - because the current page 
is not big enough,
             :     // because we've encoded the maximum number of unique 
dictionary values and need to
             :     // switch to plain encoding, etc. so we may need to try 
again more than once.
I haven't spent a ton of time looking through all the table-writer code, so 
this could be a non-issue, but I'm a bit worried that a subtle bug in 
EncodeValue/FinalizeCurrentPage/NewPage could lead to infinite loops here, 
perhaps in corner cases with weird data. Is there a clear set of state 
transitions? This relies on EncodeValue() behaving properly, and it is hard to 
read this code and understand why it is _obviously correct_. I don't think your 
code increases the risk of issues, but worth thinking about any DCHECKs that 
could help. I haven't spent a ton of time looking through the rest of this code 
so maybe it's not an issue.


http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc
File be/src/util/parquet-reader.cc:

PS5, Line 133: We i
Remove we


PS5, Line 146: with our RLE scheme it is not possible to determine how many 
values
             : //     were actually written if the final run is a literal run, 
only if the final run is
             : //     a repeated run.
We can't we determine how many values were written in a literal run?


PS5, Line 149: CheckDataPage
I think the decompressing is getting confusing with the memory management. How 
about splitting out the decompression into a separate fn that takes both the 
compressed data buffer and a buffer already allocated by the caller (which 
should be of size header.uncompressed_page_size). Then the fn that actually 
does the work to check a data page can just take a const uint8_t* to 
uncompressed data.


PS5, Line 149: uint8_t* data
Please have the comment mention that data is decompressed if the header 
indicates it is compressed, and that this is an in/out parameter that will 
return the uncompressed data.


PS5, Line 150: std::vector<uint8_t> decompressed_buffer;
why is this stack allocated? Isn't this out of scope why this fn returns but 
you return the pointer?


PS5, Line 171: *reinterpret_cast<int*>(data);
Can you add 1 sentence about the data layout or point to somewhere that does?


PS5, Line 174:  
nit extra space


-- 
To view, visit http://gerrit.cloudera.org:8080/3556
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 5
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3376: Extra definition level when writing Parquet files

Reply via email to