Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/13189 )
Change subject: IMPALA-7370: DATE: Read/Write to parquet. ...................................................................... Patch Set 2: (8 comments) http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/13189/2//COMMIT_MSG@12 PS2, Line 12: Parquet uses DATE logical type for dates. DATE logical type annotates : an INT32 that stores the number of days from the Unix epoch, 1 January : 1970. Gregorian stuff + Hive could be mentioned here, and I would also prefer to have a test file written by Hive and a test that demonstrates this issue. I do not want to add special handling in this patch (and it can turn out that this doesn't cause problems for users), but a follow up Jira could be created for possible solutions (e.g. using flags + info about the writer to adjust old dates). http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-readers.cc@957 PS2, Line 957: // The range was already checked during the int32_t->DateValue conversion, which : // sets the date to invalid if it was out of range. optional: This effectively means that the value is checked twice. It is possible that doing memcpy to the slot regardless of validity and checking it afterwards would be a bit faster. This doesn't have to be done in this patch, we can dig deeper into performance later. I have some ideas for making the whole validation stuff faster, e.g. check for the existence of any invalid values in the first pass and copy them unchanged, and do a second pass that sets invalid ones to NULL only if needed. (on a second thought, the performance of plain decoding DATEs may not be too important, as the file needs to cover a very large range to not fit into the dictionary). http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h File be/src/exec/parquet/parquet-column-stats.inline.h: http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-column-stats.inline.h@171 PS2, Line 171: DateValue* result = reinterpret_cast<DateValue*>(slot); : int size = buffer.size(); : const uint8_t* data = reinterpret_cast<const uint8_t*>(buffer.data()); : if (ParquetPlainEncoder::Decode<DateValue, parquet::Type::INT32>(data, data + size, : size, result) == -1) { : return false; : } This looks very similar to line 148-155. It would be nice to extract it into a template function (if it is straightforward, maybe I missed something). http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/13189/2/be/src/exec/parquet/parquet-common.h@370 PS2, Line 370: i nit: "i" suggests a loop variable to me, so I would prefer a name like tmp or val (the latter is used in the next function EncodeToInt32()) http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: http://gerrit.cloudera.org:8080/#/c/13189/2/common/thrift/generate_error_codes.py@408 PS2, Line 408: ("PARQUET_DATE_OUT_OF_RANGE", 134, : "Parquet file '$0' column '$1' contains an out of range date. " : "The valid date range is 0000-01-01..9999-12-31."), Can you create file with a corrupt value to cover related paths? http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: http://gerrit.cloudera.org:8080/#/c/13189/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java@1341 PS2, Line 1341: ff != HdfsFileFormat.TEXT && ff != HdfsFileFormat.PARQUET A function like "IsTypeSupported()" or "GetSupportedType()" could be added to fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java This would make it easier to add new file format + SQL type pairs, as we wouldn't need to hunt down every related part in the code. http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/13189/2/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@236 PS2, Line 236: # Note: dictionary filtering currently does not work on dates IMPALA-4994 could be mentioned here, it would make it easier to cleanup related comments once it is implemented. http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py File tests/query_test/test_insert_parquet.py: http://gerrit.cloudera.org:8080/#/c/13189/2/tests/query_test/test_insert_parquet.py@667 PS2, Line 667: values. nit: Date could be mentioned in this comment + the reason for handling it separately from other types. -- To view, visit http://gerrit.cloudera.org:8080/13189 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I67da03754531660bc8de3b6935580d46deae1814 Gerrit-Change-Number: 13189 Gerrit-PatchSet: 2 Gerrit-Owner: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Attila Jeges <atti...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Gabor Kaszab <gaborkas...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Tue, 30 Apr 2019 16:01:24 +0000 Gerrit-HasComments: Yes