Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-3376: Extra definition level when writing Parquet files
......................................................................


Patch Set 6:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 380:   // because we've encoded the maximum number of unique dictionary 
values and need to
> We're up to 4 levels on nesting in this function, which seems a bit high to
Done


PS5, Line 381: itch to plain encoding, etc. so we may need to try again more 
than once.
             :   // TODO: Have a clearer set of state transitions here, to make 
it easier to see that
             :   // this won't loop forever.
> I haven't spent a ton of time looking through all the table-writer code, so
As discussed, I definitely agree but this is a tricky thing to fix so I'll 
leave a todo.


Line 415:   // Now that the value has been successfully written, write the 
definition level.
> Can you add a short comment to explain the DCHECK? Is the buffer_full() che
Done


http://gerrit.cloudera.org:8080/#/c/3556/5/be/src/util/parquet-reader.cc
File be/src/util/parquet-reader.cc:

PS5, Line 133: Inhe
> Remove we
Done


PS5, Line 146: with our RLE scheme it is not possible to determine how many 
values
             : //     were actually written if the final run is a literal run, 
only if the final run is
             : //     a repeated run 
> We can't we determine how many values were written in a literal run?
Only for a trailing literal run, as we always encode them in groups of 8, for 
byte alignment reasons.

This isn't an issue for the test I wrote, as all of the def levels are the same 
(i.e. the table doesn't have any NULL values), so they will always be encoding 
as a single repeated run.


PS5, Line 150: uint8_t* data = page;
> why is this stack allocated? Isn't this out of scope why this fn returns bu
For this (and the above comments): data isn't an out parameter - the function 
changes where the 'data' parameter points, but not what it points at, so the 
'data' variable in the main loop is unaffected.

In fact, we can't make data an out parameter (eg. to split this into two 
functions as suggested above), since the main loop relies on data continuing to 
point to the original buffer to iterate through the rest of the pages.

(of course, we could still split the function by creating a new pointer to pass 
in as an output parameter, but things start getting more complicated because 
then the main loop has to determine if it needs to allocate/deallocate memory 
for decompression)

I understand the confusion, though, and I'll update it to be more clear.


PS5, Line 171: definition_level_encoding == p
> Can you add 1 sentence about the data layout or point to somewhere that doe
Done


PS5, Line 172:  th
> Make it int32_t here and the line above to be explicit.
Done


PS5, Line 174: v
> nit extra space
Done


http://gerrit.cloudera.org:8080/#/c/3556/5/tests/query_test/test_writers.py
File tests/query_test/test_writers.py:

> I'm not sure that it makes sense to have this as a separate test file. We a
Done


Line 25
> Is this necessary? I think we should be able to do this for all supported f
Done


Line 37
> This seems like a fairly narrowly targeted test (just a col of integers). M
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3556
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Matthew Jacobs <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to