Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8319 )

Change subject: IMPALA-4123: Columnar decoding in Parquet
......................................................................


Patch Set 14:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@162
PS14, Line 162: int32_t ParquetLevelDecoder::NextRepeatedRunLength() {
There was a subtle bug here discovered by the fuzz test. When max_level_ is 0, 
we don't actually initialise rle_decoder_, which means that the 
NextNumRepeats() check below is not valid and can hit a DCHECK.

CacheNextBatch() handles the max_level_ == 0 case specially and we need to do 
the same here.

I added some DCHECKs, variable initialisation and other defensiveness in the 
process of debugging how NextRepeatedRunLength() got called when rle_decoder_ 
wasn't initialised. I think they're worth keeping in.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/exec/parquet-column-readers.cc@597
PS14, Line 597:   // If the file is corrupt, we may have more cached def levels 
than values in the page.
> Shouldn't we raise a warning when the file is corrupt?
I think I misdiagnosed a fuzz test failure and this is the wrong fix (and 
unnecessary).

It looks like the CacheHasNext() check should ensure that the below loop 
doesn't execute more than num_buffered_values_ times since the cache is 
populated with at most num_buffered_values_.

I also don't think there's a way to easily cross-check the def levels and the 
number of values in the page metadata - the only way we know the exact number 
of def levels is the page metadata. We can end up with "extra" literal values 
at the end of the encoded def levels since the literal run length is always a 
multiple of 8.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/runtime/tuple.cc@210
PS14, Line 210: local_offset
> nit: it seems that 'local_offset' is unnecessary and we could just use 'off
I did this so that the compiler could hoist reads of 'offset' out of the loop 
below. Otherwise it thinks 'tuple_mem' and 'offset' might alias and the loop 
below ends up with unnecessary loads of 'offset' in it.

The simpler solution is to pass the argument by value. I did this and left a 
comment explaining why it's significant.


http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8319/14/be/src/util/rle-encoding.h@132
PS14, Line 132:       int64_t dict_len, OutType* values, int64_t stride) 
WARN_UNUSED_RESULT;
> nit: you could use a default argument instead of having two functions, i.e.
Good point.


http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8319/14/testdata/datasets/functional/functional_schema_template.sql@717
PS14, Line 717: INSERT OVERWRITE TABLE {db_name}{db_suffix}.{table_name} SELECT 
c.* FROM functional_parquet.complextypestbl c join functional.alltypes sort by 
id;
> Do we need the same insert stmt after LOAD and after DEPENDENT_LOAD_HIVE?
Yeah, actually LOAD is never executed since we don't create a text version of 
the table.



--
To view, visit http://gerrit.cloudera.org:8080/8319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8c03006981c46ef0dae30602f2b73c253d9b49ef
Gerrit-Change-Number: 8319
Gerrit-PatchSet: 14
Gerrit-Owner: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Nov 2018 01:45:51 +0000
Gerrit-HasComments: Yes

Reply via email to