Lars Volker has posted comments on this change. Change subject: IMPALA-3376: Extra definition level when writing Parquet files ......................................................................
Patch Set 1: (19 comments) http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS1, Line 382: true can this run more than twice? Is there a way to write it without a loop? It looks like only the first 4 lines would be repeated. Line 395: page_size_ = bytes_needed; shouldn't this line be below the check? Otherwise you change page_size_ (which looks like a side effect given it's a class member) and then decide to abort. Line 412: DCHECK(def_levels_->Put(value != NULL)); DCHECK is a no-op in release builds, so this would not get executed. That's why it was bool ret = Put(); DCHECK(ret);. http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/parquet-reader.cc File be/src/util/parquet-reader.cc: Line 132: // We inherit from RleDecoder to get access to repeat_count_, which single line? Line 139: uint32_t repeat_count() { return repeat_count_; } make const Line 143: // - Compressed pages can be uncompressed. What does this mean? Would it be an error or is it allowed? PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number of : // def levels - with our RLE scheme it is not possible to determine how many values : // were actually written if the final run is a literal run, only if the final run is : // a repeated run. Is there a way to fix this? PS1, Line 149: PageHeader header const PageHeader& ? If you need the copy, making it explicitly in the code could be more clear. PS1, Line 154: malloc I cannot see where this gets free'd. Does it leak? PS1, Line 175: i++ nit: ++i PS1, Line 181: ++count Isn't count == i? http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 165: bool BufferFull() { return buffer_full_; } Usually we seem to name these simple getters like the member, i.e. "buffer_full()", also it can be const. http://gerrit.cloudera.org:8080/#/c/3556/1/tests/query_test/test_writers.py File tests/query_test/test_writers.py: Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved. Wouldn't this need to be the ASF header? Line 22: def test_hdfs_parquet_table_writer(self, vector): this test might need an annotation to not be run in parallel (@pytest.mark.execute_serially), or multiple parallel runs would collide on the same table name. PS1, Line 24: """ I think these should be in an own line for multiline docstrings. PS1, Line 30: copyToLocal Where does this copy to? If it's the cwd you might want to create a temporary folder instead. PS1, Line 34: impalad_basedir + '/util/parquet-reader' use os.path.join() PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f) os.path.join() Line 39: check_call(['rm', '-r', 'hdfs_parquet_table_writer']) use shutils.rmtree -- To view, visit http://gerrit.cloudera.org:8080/3556 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-HasComments: Yes
