Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8319 )
Change subject: IMPALA-4123: Columnar decoding in Parquet ...................................................................... Patch Set 14: (5 comments) http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@162 PS14, Line 162: int32_t ParquetLevelDecoder::NextRepeatedRunLength() { There was a subtle bug here discovered by the fuzz test. When max_level_ is 0, we don't actually initialise rle_decoder_, which means that the NextNumRepeats() check below is not valid and can hit a DCHECK. CacheNextBatch() handles the max_level_ == 0 case specially and we need to do the same here. I added some DCHECKs, variable initialisation and other defensiveness in the process of debugging how NextRepeatedRunLength() got called when rle_decoder_ wasn't initialised. I think they're worth keeping in. http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@597 PS14, Line 597: // If the file is corrupt, we may have more cached def levels than values in the page. > Shouldn't we raise a warning when the file is corrupt? I think I misdiagnosed a fuzz test failure and this is the wrong fix (and unnecessary). It looks like the CacheHasNext() check should ensure that the below loop doesn't execute more than num_buffered_values_ times since the cache is populated with at most num_buffered_values_. I also don't think there's a way to easily cross-check the def levels and the number of values in the page metadata - the only way we know the exact number of def levels is the page metadata. We can end up with "extra" literal values at the end of the encoded def levels since the literal run length is always a multiple of 8. http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc File be/src/runtime/tuple.cc: http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc@210 PS14, Line 210: local_offset > nit: it seems that 'local_offset' is unnecessary and we could just use 'off I did this so that the compiler could hoist reads of 'offset' out of the loop below. Otherwise it thinks 'tuple_mem' and 'offset' might alias and the loop below ends up with unnecessary loads of 'offset' in it. The simpler solution is to pass the argument by value. I did this and left a comment explaining why it's significant. http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h File be/src/util/rle-encoding.h: http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h@132 PS14, Line 132: int64_t dict_len, OutType* values, int64_t stride) WARN_UNUSED_RESULT; > nit: you could use a default argument instead of having two functions, i.e. Good point. http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql@717 PS14, Line 717: INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} SELECT c.* FROM functional_parquet.complextypestbl c join functional.alltypes sort by id; > Do we need the same insert stmt after LOAD and after DEPENDENT_LOAD_HIVE? Yeah, actually LOAD is never executed since we don't create a text version of the table. -- To view, visit http://gerrit.cloudera.org:8080/8319 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef Gerrit-Change-Number: 8319 Gerrit-PatchSet: 14 Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 08 Nov 2018 01:45:51 +0000 Gerrit-HasComments: Yes