Matthew Jacobs has posted comments on this change. Change subject: IMPALA-3376: Extra definition level when writing Parquet files ......................................................................
Patch Set 5: (8 comments) http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS5, Line 381: Encoding may fail for several reasons - because the current page is not big enough, : // because we've encoded the maximum number of unique dictionary values and need to : // switch to plain encoding, etc. so we may need to try again more than once. I haven't spent a ton of time looking through all the table-writer code, so this could be a non-issue, but I'm a bit worried that a subtle bug in EncodeValue/FinalizeCurrentPage/NewPage could lead to infinite loops here, perhaps in corner cases with weird data. Is there a clear set of state transitions? This relies on EncodeValue() behaving properly, and it is hard to read this code and understand why it is _obviously correct_. I don't think your code increases the risk of issues, but worth thinking about any DCHECKs that could help. I haven't spent a ton of time looking through the rest of this code so maybe it's not an issue. http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc File be/src/util/parquet-reader.cc: PS5, Line 133: We i Remove we PS5, Line 146: with our RLE scheme it is not possible to determine how many values : // were actually written if the final run is a literal run, only if the final run is : // a repeated run. We can't we determine how many values were written in a literal run? PS5, Line 149: CheckDataPage I think the decompressing is getting confusing with the memory management. How about splitting out the decompression into a separate fn that takes both the compressed data buffer and a buffer already allocated by the caller (which should be of size header.uncompressed_page_size). Then the fn that actually does the work to check a data page can just take a const uint8_t* to uncompressed data. PS5, Line 149: uint8_t* data Please have the comment mention that data is decompressed if the header indicates it is compressed, and that this is an in/out parameter that will return the uncompressed data. PS5, Line 150: std::vector<uint8_t> decompressed_buffer; why is this stack allocated? Isn't this out of scope why this fn returns but you return the pointer? PS5, Line 171: *reinterpret_cast<int*>(data); Can you add 1 sentence about the data layout or point to somewhere that does? PS5, Line 174: nit extra space -- To view, visit http://gerrit.cloudera.org:8080/3556 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 Gerrit-PatchSet: 5 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
