Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9693 )

Change subject: IMPALA-5842: Write page index in Parquet files
......................................................................


Patch Set 2:

(20 comments)

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@300
PS2, Line 300:   std::vector<bool> null_pages_;
> Just saw Tim's comment. I think we discussed this a while ago in the Parque
OK, I'll implement this part after that.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@739
PS2, Line 739: null_pages_
> Sry for the confusion. The spec says that this must be set if and only if a
Oh, sure.


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1195
PS2, Line 1195:   // If the file contains more than one row_group, don't write 
the index.
> I missed this earlier - this seems weird. Do we ever write more than one ro
AFAIK we always write only one row group. I just copy-pasted this part from 
Pooja's work. Should I substitute it to a DCHECK maybe?


http://gerrit.cloudera.org:8080/#/c/9693/2/be/src/exec/hdfs-parquet-table-writer.cc@1224
PS2, Line 1224: <
> That sounds good to me.
Great, I implemented it that way.


http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/9693/3/be/src/exec/hdfs-parquet-table-writer.cc@1211
PS3, Line 1211:     // Let's see if there's an ordering on min/max values.
> After moving column.min_values_ above it is invalid here. It also leaves th
We are using columns_ in the next section to write the offset index, so can't 
reset them here, but at the end of the function.

I only call BaseColumnWriter::Reset() on them, because unique_ptr::reset 
generates segfaults later.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@55
PS3, Line 55:
> We should assert that the schema is flat, e.g. by comparing lengths of row_
Added an assert.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@63
PS3, Line 63:
> You should not need "is not None" here and elsewhere. The parentheses also
I removed it from couple of places.
Kept them in asserts, because I think it is more readable that way.
Also kept for "if max_value is not None" and such, because max_value can be set 
to 0.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@84
PS3, Line 84:
> This looks complicated enough to warrant a namedtuple, maybe even a class.
I chose namedtuple


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@87
PS3, Line 87:
> ...row_groups... (plural)
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@90
PS3, Line 90:
> this comment seems wrong
Removed the last sentence.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@103
PS3, Line 103:
> nit: All comments should follow PEP8 class comment format.
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@118
PS3, Line 118:
> long line
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@122
PS3, Line 122:
> It seems that this method is only called once, and without skip_col_idxs. R
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@147
PS3, Line 147:
> Impala currently does not write these, add an assertion
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@164
PS3, Line 164:
> What happens if it is None?
Added some extra asserts about it. If either stats.min_value or stats.max_value 
is None, the other must be None as well. And all pages need to be null.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@165
PS3, Line 165:
> Don't overwrite the variable
Done


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@205
PS3, Line 205:
> copy-paste error?
yup, done.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/query_test/test_parquet_page_index.py@230
PS3, Line 230:
> We also should have a test with a file that has 4K of 0xFF in it, just to m
Do you know a quick way of generate a parquet file like that?

I added the other tests.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py
File tests/util/get_parquet_index.py:

PS3:
> Maybe we should merge this file with get_parquet_metadata.py (and clean it
Moved the contents to get_parquet_metadata.


http://gerrit.cloudera.org:8080/#/c/9693/3/tests/util/get_parquet_index.py@29
PS3, Line 29:
> typo
Done



--
To view, visit http://gerrit.cloudera.org:8080/9693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icbacf7fe3b7672e3ce719261ecef445b16f8dec9
Gerrit-Change-Number: 9693
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 23 Mar 2018 22:27:10 +0000
Gerrit-HasComments: Yes

Reply via email to