Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-3376: Extra definition level when writing Parquet files ......................................................................
Patch Set 6: (12 comments) http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 380: // because we've encoded the maximum number of unique dictionary values and need to > We're up to 4 levels on nesting in this function, which seems a bit high to Done PS5, Line 381: itch to plain encoding, etc. so we may need to try again more than once. : // TODO: Have a clearer set of state transitions here, to make it easier to see that : // this won't loop forever. > I haven't spent a ton of time looking through all the table-writer code, so As discussed, I definitely agree but this is a tricky thing to fix so I'll leave a todo. Line 415: // Now that the value has been successfully written, write the definition level. > Can you add a short comment to explain the DCHECK? Is the buffer_full() che Done http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc File be/src/util/parquet-reader.cc: PS5, Line 133: Inhe > Remove we Done PS5, Line 146: with our RLE scheme it is not possible to determine how many values : // were actually written if the final run is a literal run, only if the final run is : // a repeated run > We can't we determine how many values were written in a literal run? Only for a trailing literal run, as we always encode them in groups of 8, for byte alignment reasons. This isn't an issue for the test I wrote, as all of the def levels are the same (i.e. the table doesn't have any NULL values), so they will always be encoding as a single repeated run. PS5, Line 150: uint8_t* data = page; > why is this stack allocated? Isn't this out of scope why this fn returns bu For this (and the above comments): data isn't an out parameter - the function changes where the 'data' parameter points, but not what it points at, so the 'data' variable in the main loop is unaffected. In fact, we can't make data an out parameter (eg. to split this into two functions as suggested above), since the main loop relies on data continuing to point to the original buffer to iterate through the rest of the pages. (of course, we could still split the function by creating a new pointer to pass in as an output parameter, but things start getting more complicated because then the main loop has to determine if it needs to allocate/deallocate memory for decompression) I understand the confusion, though, and I'll update it to be more clear. PS5, Line 171: definition_level_encoding == p > Can you add 1 sentence about the data layout or point to somewhere that doe Done PS5, Line 172: th > Make it int32_t here and the line above to be explicit. Done PS5, Line 174: v > nit extra space Done http://gerrit.cloudera.org:8080/#/c/3556/5/tests/query_test/test_writers.py File tests/query_test/test_writers.py: > I'm not sure that it makes sense to have this as a separate test file. We a Done Line 25 > Is this necessary? I think we should be able to do this for all supported f Done Line 37 > This seems like a fairly narrowly targeted test (just a col of integers). M Done -- To view, visit http://gerrit.cloudera.org:8080/3556 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 Gerrit-PatchSet: 6 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Matthew Jacobs <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
