[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Csaba Ringhofer (Code Review) Tue, 30 Apr 2019 09:09:01 -0700

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/13189 )


Change subject: IMPALA-7370: DATE: Read/Write to parquet.
......................................................................


Patch Set 2:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG@12
PS2, Line 12: Parquet uses DATE logical type for dates. DATE logical type 
annotates
            : an INT32 that stores the number of days from the Unix epoch, 1 
January
            : 1970.
Gregorian stuff + Hive could be mentioned here, and I would also prefer to have 
a test file written by Hive and a test that demonstrates this issue. I do not 
want to add special handling in this patch (and it can turn out that this 
doesn't cause problems for users), but a follow up Jira could be created for 
possible solutions (e.g. using flags + info about the writer to adjust old 
dates).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc@957
PS2, Line 957:   // The range was already checked during the int32_t->DateValue 
conversion, which
             :   // sets the date to invalid if it was out of range.
optional:
This effectively means that the value is checked twice. It is possible that 
doing memcpy to the slot regardless of validity and checking it afterwards 
would be a bit faster.

This doesn't have to be done in this patch, we can dig deeper into performance 
later. I have some ideas for making the whole validation stuff faster, e.g. 
check for the existence of any invalid values in the first pass and copy them 
unchanged, and do a second pass that sets invalid ones to NULL only if needed.

(on a second thought, the performance of plain decoding DATEs may not be too 
important, as the file needs to cover a very large range to not fit into the 
dictionary).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h@171
PS2, Line 171:   DateValue* result = reinterpret_cast<DateValue*>(slot);
             :   int size = buffer.size();
             :   const uint8_t* data = reinterpret_cast<const 
uint8_t*>(buffer.data());
             :   if (ParquetPlainEncoder::Decode<DateValue, 
parquet::Type::INT32>(data, data + size,
             :       size, result) == -1) {
             :     return false;
             :   }
This looks very similar to line 148-155. It would be nice to extract it into a 
template function (if it is straightforward, maybe I missed something).


http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h@370
PS2, Line 370: i
nit: "i" suggests a loop variable to me, so I would prefer a name like tmp or 
val (the latter is used in the next function EncodeToInt32())


http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py@408
PS2, Line 408:   ("PARQUET_DATE_OUT_OF_RANGE", 134,
             :    "Parquet file '$0' column '$1' contains an out of range date. 
"
             :    "The valid date range is 0000-01-01..9999-12-31."),
Can you create file with a corrupt value to cover related paths?


http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341
PS2, Line 1341: ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET
A function like "IsTypeSupported()" or "GetSupportedType()" could be added to 
fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java

This would make it easier to add new file format + SQL type pairs, as we 
wouldn't need to hunt down every related part in the code.


http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@236
PS2, Line 236: # Note: dictionary filtering currently does not work on dates
IMPALA-4994 could be mentioned here, it would make it easier to cleanup related 
comments once it is implemented.


http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@667
PS2, Line 667:     values.
nit: Date could be mentioned in this comment + the reason for handling it 
separately from other types.



--
To view, visit http://gerrit.cloudera.org:8080/13189
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814
Gerrit-Change-Number: 13189
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Apr 2019 16:01:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7370: DATE: Read/Write to parquet.

Reply via email to