Lars Volker has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 2:

(15 comments)

Thanks for the review, and apologies for the long delay. I ran into several 
issues when comparing statistics with Hive, until I found PARQUET-251, which 
seems to cause invalid stats to be written by Hive.

http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG
Commit Message:

Line 9: Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
> Do we have end-to-end testing that the column stats are correct? We really 
There is a simple test in test_insert_parquet.py but it only checks rowgroup 
statistics, not page statistics. I think it will be easier to write more 
extensive tests once the read path has been implemented.


http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 90:   BaseColumnWriter(HdfsParquetTableWriter* parent, ExprContext* 
expr_ctx,
> I feel like this file is getting pretty large. Can we pull these classes ou
Done


Line 97:       num_values_(0),
> I think we call this Merge() in other places like aggregate functions.
Done


Line 101:       def_levels_(NULL),
> I think the memory management here needs a bit more explanation. I.e. when 
Done


Line 206:     // levels data and the encoded values.  If compression is 
enabled, this is the
> Do we know what ordering the parquet specification defines for each type? T
I added a comment to the class header. To me it looks safe to rely on the 
overloaded operators. Should we try and explicitly stamp out the template class 
only for the supported types to prevent new types from being added without 
manual interference? Are there any particular classes that you think may break 
in the future?


Line 224:   scoped_ptr<Codec> compressor_;
> I guess this is the explanation of Copy() I was looking for. I think we nee
Done


Line 288:         new ColumnStats<T>(parent_->per_file_mem_pool_.get(), 
encoded_value_size_));
> This can potentially use a lot of memory since the old strings are never fr
Done


PS1, Line 349: value to h
> Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMem
Done


Line 485: 
> Can't we initialise most of these using the initialiser list? We should do 
I moved the initialization to Reset() to pass the correct memory pool, similar 
to dict_encoder_.


Line 598: 
> Can't we initialise most of these using the initialiser list? We should do 
I made the *_stats_ inline members, so their default initialization is 
sufficient. The *_stats_base members are stored in the base class. I'm 
reluctant to move them into the c'tor, since for all other column writers they 
will not be initialized at construction time, but when calling Reset().


Line 628:   // TODO: copy repetition data when we support nested types.
> Do we need scoped_ptr? Can we just store these inline?
Done


http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 214: 
> Extra line
Done


Line 281:     'hive_skip_col_idx' are excluded from the verification of the 
expected values.
> I think we need to find a way to test timestamp.
Done


Line 293: 
> We should also test the types that aren't in alltypes - char, varchar, deci
Done


Line 299:         (qualified_table_name, source_table, num_lines)
> Does this exercise the string copying code? If the table is too small it se
I added a test using tpch_parquet.customer and manually verified that the 
resulting .parq file has multiple pages per column, thus exercising the string 
copy code. In hindsight it should already be sufficient to use a file with more 
multiple row batches, which functional.alltypes (7300 rows) should provide. I 
still think it makes sense to have a test with multiple pages per column so I'd 
opt to keep the new one, too.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to