Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/22922 )
Change subject: IMPALA-13625: Allow reading Parquet int32/int64 as decimal without logical types ...................................................................... Patch Set 10: (4 comments) Thanks for working on this! http://gerrit.cloudera.org:8080/#/c/22922/10/be/src/exec/parquet/parquet-data-converter.h File be/src/exec/parquet/parquet-data-converter.h: http://gerrit.cloudera.org:8080/#/c/22922/10/be/src/exec/parquet/parquet-data-converter.h@94 PS10, Line 94: GetPrecision() <= 0 This change might not needed if we change GetPrecision(), see comment in the .cc file. http://gerrit.cloudera.org:8080/#/c/22922/10/be/src/exec/parquet/parquet-metadata-utils.cc File be/src/exec/parquet/parquet-metadata-utils.cc: http://gerrit.cloudera.org:8080/#/c/22922/10/be/src/exec/parquet/parquet-metadata-utils.cc@377 PS10, Line 377: GetPrecision(schema_element); The code could be greatly simplified if we changed GetPrecision() to return ColumnType::MAX_DECIMAL4_PRECISION+1 and ColumnType::MAX_DECIMAL8_PRECISION+1 when the schema_element.type is INT32 or INT64. The downside is that we would lose the check for precision - scale < File precision - File scale. I think it is a bug that we don't check it right now, as for possible overflows we should raise an error and not return NULLs, see "Overflow behavior" at https://impala.apache.org/docs/build/html/topics/impala_decimal.html Though adding this check might work current workloads.. Maybe we could just add the check when the precision is not explicitly set, so we don't break existing workloads. http://gerrit.cloudera.org:8080/#/c/22922/10/be/src/exec/parquet/parquet-metadata-utils.cc@423 PS10, Line 423: if (!is_converted_type_decimal) { : // TODO: is this validation useful? It is not required at all to read the data and : // might only serve to reject otherwise perfectly readable files. : ErrorMsg msg(TErrorCode::PARQUET_BAD_CONVERTED_TYPE, filename, : schema_element.name); : RETURN_IF_ERROR(state->LogOrReturnError(msg)); I think we shouldn't report an error from now on if the file type is INT32/INT64. http://gerrit.cloudera.org:8080/#/c/22922/10/testdata/workloads/functional-query/queries/QueryTest/parquet-type-widening.test File testdata/workloads/functional-query/queries/QueryTest/parquet-type-widening.test: http://gerrit.cloudera.org:8080/#/c/22922/10/testdata/workloads/functional-query/queries/QueryTest/parquet-type-widening.test@11 PS10, Line 11: # IMPALA-13625: Allow reading Parquet int32/int64 as decimal without logical types Please add tests for INT_MIN, INT_MAX, and INT64_MIN, INT64_MAX. -- To view, visit http://gerrit.cloudera.org:8080/22922 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I56006eb3cca28c81ec8467d77b35005fbf669680 Gerrit-Change-Number: 22922 Gerrit-PatchSet: 10 Gerrit-Owner: Daniel Vanko <[email protected]> Gerrit-Reviewer: Daniel Vanko <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 19 Jun 2025 14:08:01 +0000 Gerrit-HasComments: Yes
