Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12163 )
Change subject: IMPALA-7087: Read Parquet decimal columns with lower precision/scale ...................................................................... Patch Set 1: (7 comments) http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12163/1//COMMIT_MSG@25 PS1, Line 25: If the underlying Parquet files contain rows that cannot be converted to : the higher scale without overflow, an error is thrown and the query : fails. This is different from other engines such as Hive which set the : row to NULL and allow the query to run to completion without error. I do not know the use case, so I may be wrong, but I dislike the idea of failing the query if the Parquet file is not corrupt and only the data is "too large". If a file with such data is added to a table, the table can become unusable and Impala offers no tool to get rid of the problematic file. Would it make sense to be stricter during scale + precision checking to exclude the possibility of overflow? If I understand correctly, overflow can only occur if the scale in the file is lower then the expected scale: e.g value 2 in decimal (1,0) would overflow when converted to decimal(1, 1). If precision_file - scale_file <= precision_expected - scale_expected, then the conversion should happen without overflow, shouldn't it? http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@243 PS1, Line 243: // TODO: we could allow a mismatch and do a conversion at this step. This TODO could be probably updated. http://gerrit.cloudera.org:8080/#/c/12163/1/be/src/exec/parquet/parquet-metadata-utils.cc@255 PS1, Line 255: // TODO: we could allow a mismatch and do a conversion at this step. This TODO could be probably updated. http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test: http://gerrit.cloudera.org:8080/#/c/12163/1/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test@85 PS1, Line 85: ==== Please add some tests with 'where col=const' clauses to exercise the dictionary and stat filtering code paths. Dictionary filtering is disabled if 'needs_conversion_' is true, but it may be enabled in the future if conversion will be moved to dictionary construction. I think that stat filtering won't work correctly with the current implementation - the stat value will be interpreted as if it had the same precision/scale as the SQL column, which can lead to incorrectly dropping row groups. I had to deal with somewhat similar problems in https://gerrit.cloudera.org/#/c/11057/ http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@842 PS1, Line 842: decimal_stored_as_int32.parquet I do not see this file in the patch and it is not mentioned in the README. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@848 PS1, Line 848: decimal_stored_as_int64.parquet I do not see this file in the patch and it is not mentioned in the README. http://gerrit.cloudera.org:8080/#/c/12163/1/tests/query_test/test_scanners.py@856 PS1, Line 856: binary_decimal_no_dictionary.parquet I do not see this file in the patch and it is not mentioned in the README. -- To view, visit http://gerrit.cloudera.org:8080/12163 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iafc8efd12379a39756e3e70f022a81a636dadb61 Gerrit-Change-Number: 12163 Gerrit-PatchSet: 1 Gerrit-Owner: Sahil Takiar <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Comment-Date: Mon, 07 Jan 2019 15:14:25 +0000 Gerrit-HasComments: Yes
