Tim Armstrong has posted comments on this change. Change subject: IMPALA-3909: Populate min/max statistics in Parquet writer ......................................................................
Patch Set 1: (15 comments) I think the overall approach makes sense. I had some higher-level comments - I didn't do a full final pass since there will probably be some churn from this round of comments. http://gerrit.cloudera.org:8080/#/c/5611/1//COMMIT_MSG Commit Message: Line 9: TODO Check whether we can use python bindings for parquet-cpp or pyarrow Do we have end-to-end testing that the column stats are correct? We really don't want to turn this on before it's totally right, since if we write out bad data, then the files can exist for a long time. Until we are totally confident that we're writing out the correct stats, we should turn this off by default and hide it behind an experimental flag. http://gerrit.cloudera.org:8080/#/c/5611/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 90: class HdfsParquetTableWriter::ColumnStatsBase { I feel like this file is getting pretty large. Can we pull these classes out into a separate header? Also, do the classes need to be nested inside HdfsParquetTableWriter? Line 97: virtual void Extend(const ColumnStatsBase* other) = 0; I think we call this Merge() in other places like aggregate functions. Line 101: virtual void CopyValues(MemPool* mem_pool){}; I think the memory management here needs a bit more explanation. I.e. when it is ok for the object to reference memory that is not from 'mem_pool'. Line 206: min_value_ = min(min_value_, v); Do we know what ordering the parquet specification defines for each type? The non-obvious things for me are: * How are NULL values handled/ordered? * How are strings ordered? * Is it always going to be the same as our own overloaded operators? This makes me uneasy since if someone adds a new data type they probably won't think to look at the Parquet spec before implementing the overloaded operator. I would prefer this to be more explicit even if it results in more code. We should probably explain this in the class header. And also probably not rely on the overloaded operators. Line 224: // StringValues need to be copied at the end of processing a row batch, since the batch I guess this is the explanation of Copy() I was looking for. I think we need to document it in the base class since it's in the base class API. Line 288: char* new_ptr = reinterpret_cast<char*>(mem_pool->Allocate(value->len)); This can potentially use a lot of memory since the old strings are never freed. I think using StringBuffer would avoid that problem. PS1, Line 349: CopyValues Maybe this shouldn't be CopyValues, rather CopyMemory? or CopyReferencedMemory, since we don't always copy the value, e.g. if it's an integer? Not sure what a good name would be. Line 485: page_stats_.reset(new ColumnStats<T>(encoded_value_size_)); Can't we initialise most of these using the initialiser list? We should do that for consistency. Line 598: page_stats_.reset(new ColumnStats<bool>()); Can't we initialise most of these using the initialiser list? We should do that for consistency. Line 628: scoped_ptr<ColumnStats<bool>> page_stats_; Do we need scoped_ptr? Can we just store these inline? http://gerrit.cloudera.org:8080/#/c/5611/1/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: Line 214: Extra line Line 281: # Timestamps are written differently by Hive and Impala so we don't validate them I think we need to find a way to test timestamp. Line 293: "functional.alltypes" % qualified_table_name) We should also test the types that aren't in alltypes - char, varchar, decimal. Line 299: # clause ensures that the query is executed on the coordinator, resulting in a single Does this exercise the string copying code? If the table is too small it seems like it may not. -- To view, visit http://gerrit.cloudera.org:8080/5611 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I8368ee58daa50c07a3b8ef65be70203eb941f619 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-HasComments: Yes
