Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-3376: Extra definition level when writing Parquet files ......................................................................
Patch Set 1: (19 comments) http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: PS1, Line 382: true > can this run more than twice? Is there a way to write it without a loop? It Yes, there are multiple independent reasons why EncodeValue may fail and (though highly unlikely) its possible the loop will run more than once. I've updated the comment to reflect this. Line 395: page_size_ = bytes_needed; > shouldn't this line be below the check? Otherwise you change page_size_ (wh Good point. This isn't really new code, this whole block was just copy and pasted for this review, so its interesting this wasn't a problem before. I don't see any reason why it shouldn't be safe to make this change, though, so I'll do it. Line 412: DCHECK(def_levels_->Put(value != NULL)); > DCHECK is a no-op in release builds, so this would not get executed. That's Done http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/parquet-reader.cc File be/src/util/parquet-reader.cc: Line 132: // We inherit from RleDecoder to get access to repeat_count_, which > single line? Done Line 139: uint32_t repeat_count() { return repeat_count_; } > make const Done Line 143: // - Compressed pages can be uncompressed. > What does this mean? Would it be an error or is it allowed? Done PS1, Line 145: Note that this will not catch every instance of Impala writing the wrong number of : // def levels - with our RLE scheme it is not possible to determine how many values : // were actually written if the final run is a literal run, only if the final run is : // a repeated run. > Is there a way to fix this? Not without reducing the efficiency of our rle encoder (this behavior has to do with byte alignment), which doesn't seem worth it just to allow for this testing. PS1, Line 149: PageHeader header > const PageHeader& ? If you need the copy, making it explicitly in the code Done PS1, Line 154: malloc > I cannot see where this gets free'd. Does it leak? Done PS1, Line 175: i++ > nit: ++i Done PS1, Line 181: ++count > Isn't count == i? Done http://gerrit.cloudera.org:8080/#/c/3556/1/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: Line 165: bool BufferFull() { return buffer_full_; } > Usually we seem to name these simple getters like the member, i.e. "buffer_ Done http://gerrit.cloudera.org:8080/#/c/3556/1/tests/query_test/test_writers.py File tests/query_test/test_writers.py: Line 1: # Copyright (c) 2016 Cloudera, Inc. All rights reserved. > Wouldn't this need to be the ASF header? Done Line 22: def test_hdfs_parquet_table_writer(self, vector): > this test might need an annotation to not be run in parallel (@pytest.mark. We want to be able to run in parallel, so I'll update it to use unique_database PS1, Line 24: """ > I think these should be in an own line for multiline docstrings. We're inconsistent about this - a quick grep through the python tests looks like we put it on its own line less than half of the time. I don't have an opinion either way, but it would be nice if we had a python style guide to follow. PS1, Line 30: copyToLocal > Where does this copy to? If it's the cwd you might want to create a tempora Done PS1, Line 34: impalad_basedir + '/util/parquet-reader' > use os.path.join() Done PS1, Line 35: 'hdfs_parquet_table_writer/' + str(f) > os.path.join() Done Line 39: check_call(['rm', '-r', 'hdfs_parquet_table_writer']) > use shutils.rmtree Done -- To view, visit http://gerrit.cloudera.org:8080/3556 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2cafd7ef6b607ce6f815072b8af7395a892704d9 Gerrit-PatchSet: 1 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-HasComments: Yes
