[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Tim Armstrong (Code Review) Wed, 11 Oct 2017 17:52:22 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/7822 )


Change subject: IMPALA-2494: Support for byte array encoded decimals in Parquet 
scanner
......................................................................


Patch Set 4:

(12 comments)

Did a first pass over it - thought I should flush out comments since you've 
been waiting quite a while for feedback here.

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-column-readers.cc@1277
PS4, Line 1277:         switch (node.element->type) {
This case is only going to get bigger with the follow on patches - it would be 
best to make it a separate function.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@213
PS4, Line 213:   template <typename LOGICAL_TYPE, parquet::Type::type 
PHYSICAL_TYPE>
Logical type and physical type I think aren't quite right, since e.g. 
"StringValue" isn't a logical type. It's really decoding/encoding between 
Parquet physical types and Impala's internal representation.

Maybe INTERNAL_TYPE and PARQUET_PHYSICAL_TYPE? Or PARQUET_TYPE seems ok since 
that's what parquet calls it.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-common.h@376
PS4, Line 376: inline int ParquetPlainEncoder::Decode<Decimal4Value, 
parquet::Type::BYTE_ARRAY>(
I think you can avoid stamping this out if you leave it still parameterized by 
the internal Impala type, since the logic is the same in all cases. There may 
be an opportunity to reduce the redundancy above too, since there are functions 
above with identical bodies.

I tried this out on a dummy program and it looks like it's possible:

  #include <stdio.h>

  template <typename T, typename S>
  int foo(T* a, S* b) {
    return 0;
  }

  template <>
  int foo(int* a, int* b) { return 1; }

  template <typename T>
  int foo(T* a, double* b) { return 2; }

  int main(int argc, char**argv) {
    int a, b, c;
    double x, y, z;
    printf("%d\n", foo(&a, &b));
    printf("%d\n", foo(&a, &x));
  }


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc
File be/src/exec/parquet-metadata-utils.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@66
PS4, Line 66: bool IsSupportedPhysicalType(PrimitiveType impala_type,
let's make it static - we don't want to export the symbol to be linked with 
code outside this file.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@70
PS4, Line 70: (
unnecessary parens


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@173
PS4, Line 173:     // TODO: Is this check too strict?
I don't see why this shouldn't match the file metadata, this seems valid to me.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@182
PS4, Line 182:  parquet::SchemaElement*
Why not pass by const ref like above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-metadata-utils.cc@209
PS4, Line 209:         stringstream ss;
Can you convert to using Substitute() while we're here? We've been very 
gradually converting to using that for cases like this.


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc
File be/src/exec/parquet-plain-test.cc:

http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@176
PS4, Line 176:   TestType<Decimal4Value, 
parquet::Type::BYTE_ARRAY>(Decimal4Value(test_val),
It would be nice to combine the FIXED_LEN_BYTE_ARRAY and BYTE_ARRAY tests. 
Mostly the tests are the same except for the expected size, right? Could we 
make them table-driven so that we test all the same cases but just have 
different expected output?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@193
PS4, Line 193: int32_t
Any reason to use sizeof(int..) instead of sizeof(Decimal*Value) like above? Or 
if this is the better way to express it, should we change it above?


http://gerrit.cloudera.org:8080/#/c/7822/4/be/src/exec/parquet-plain-test.cc@212
PS4, Line 212:   for (int i = 1; i <=16; ++i) {
nit: missing space


http://gerrit.cloudera.org:8080/#/c/7822/4/testdata/data/binary_decimal_dictionary.parquet
File testdata/data/binary_decimal_dictionary.parquet:

PS4:
Can you update the README to describe how these files were generated.



--
To view, visit http://gerrit.cloudera.org:8080/7822
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c0e881045109f337fecba53fec21f9cfb9e619e
Gerrit-Change-Number: 7822
Gerrit-PatchSet: 4
Gerrit-Owner: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Bikramjeet Vig <bikramjeet....@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhe...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mjac...@apache.org>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Comment-Date: Thu, 12 Oct 2017 00:52:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-2494: Support for byte array encoded decimals in Parquet scanner

Reply via email to