Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer
......................................................................


Patch Set 1:

(15 comments)

I think the overall approach makes sense. I had some higher-level comments - I 
didn't do a full final pass since there will probably be some churn from this 
round of comments.

http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG
Commit Message:

Line 9: TODO Check whether we can use python bindings for parquet-cpp or pyarrow
Do we have end-to-end testing that the column stats are correct? We really 
don't want to turn this on before it's totally right, since if we write out bad 
data, then the files can exist for a long time.

Until we are totally confident that we're writing out the correct stats, we 
should turn this off by default and hide it behind an experimental flag.


http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 90: class HdfsParquetTableWriter::ColumnStatsBase {
I feel like this file is getting pretty large. Can we pull these classes out 
into a separate header?

Also, do the classes need to be nested inside HdfsParquetTableWriter?


Line 97:   virtual void Extend(const ColumnStatsBase* other) = 0;
I think we call this Merge() in other places like aggregate functions.


Line 101:   virtual void CopyValues(MemPool* mem_pool){};
I think the memory management here needs a bit more explanation. I.e. when it 
is ok for the object to reference memory that is not from 'mem_pool'.


Line 206:       min_value_ = min(min_value_, v);
Do we know what ordering the parquet specification defines for each type? The 
non-obvious things for me are:

* How are NULL values handled/ordered?
* How are strings ordered?
* Is it always going to be the same as our own overloaded operators? This makes 
me uneasy since if someone adds a new data type they probably won't think to 
look at the Parquet spec before implementing the overloaded operator. I would  
prefer this to be more explicit even if it results in more code.

We should probably explain this in the class header. And also probably not rely 
on the overloaded operators.


Line 224: // StringValues need to be copied at the end of processing a row 
batch, since the batch
I guess this is the explanation of Copy() I was looking for. I think we need to 
document it in the base class since it's in the base class API.


Line 288:     char* new_ptr = 
reinterpret_cast<char*>(mem_pool->Allocate(value->len));
This can potentially use a lot of memory since the old strings are never freed. 
I think using StringBuffer would avoid that problem.


PS1, Line 349: CopyValues
Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMemory, 
since we don't always copy the value, e.g. if it's an integer? Not sure what a 
good name would be.


Line 485:     page_stats_.reset(new ColumnStats<T>(encoded_value_size_));
Can't we initialise most of these using the initialiser list? We should do that 
for consistency.


Line 598:     page_stats_.reset(new ColumnStats<bool>());
Can't we initialise most of these using the initialiser list? We should do that 
for consistency.


Line 628:   scoped_ptr<ColumnStats<bool>> page_stats_;
Do we need scoped_ptr? Can we just store these inline?


http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

Line 214: 
Extra line


Line 281:       # Timestamps are written differently by Hive and Impala so we 
don't validate them
I think we need to find a way to test timestamp.


Line 293:         "functional.alltypes" % qualified_table_name)
We should also test the types that aren't in alltypes - char, varchar, decimal.


Line 299:     # clause ensures that the query is executed on the coordinator, 
resulting in a single
Does this exercise the string copying code? If the table is too small it seems 
like it may not.


-- 
To view, visit http://gerrit.cloudera.org:8080/5611
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to