Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/22289 )
Change subject: IMPALA-12927: Support specifying format for reading JSON BINARY columns ...................................................................... Patch Set 1: (4 comments) Thanks for working on this! Added a few comments, with one affecting the basic design of the patch. Feel free to ask for pointers about how to implement these changes. http://gerrit.cloudera.org:8080/#/c/22289/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22289/1//COMMIT_MSG@22 PS1, Line 22: perty unset or : incorrectly set, and will provide an error message. It could be useful to add a flag that acts as a default for this property if it is not set (empty by default). While there is already an easy workaround for this case (set the table property) this may need some workloads to be updated so a global flag could be a quicker fix. http://gerrit.cloudera.org:8080/#/c/22289/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/22289/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@572 PS1, Line 572: slotDesc.setType(Type.STRING); This doesn't look correct to me because even if it works for JSON, theoretically a table can have partitions with different file formats. An example for such a table https://github.com/apache/impala/blob/5f4321373a3404867e170f7381db04a843aa7310/testdata/datasets/functional/functional_schema_template.sql#L979 In this case changing the column type may break partitions with other file formats. I am also not sure whether changing the column type desc_ is safe in general (at them moment I don't have an example on what could it break). I think that it is necessary to send this information to the backend and decide there whether to base64 decode or not. http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test File testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test: http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test@3 PS1, Line 3: lter table binary_tbl unset tblproperties This modifies a shared table and can leave it in a dirty state if the test fails. I would prefer to create a temporary table or use different tables created during dataload. http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test@12 PS1, Line 12: No valid table properties 'json.binary.format' (base64 or rawstring) provided for scanning binary column of json table '$DATABASE.binary_tbl' Can you add a test for the case when there is no valid json.binary.format, but no binary column is used in the query? My assumption is that we should return without an error in this case. -- To view, visit http://gerrit.cloudera.org:8080/22289 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Idf61fa3afc0f33caa63fbc05393e975733165e82 Gerrit-Change-Number: 22289 Gerrit-PatchSet: 1 Gerrit-Owner: Zihao Ye <eyiz...@163.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Comment-Date: Mon, 13 Jan 2025 14:26:07 +0000 Gerrit-HasComments: Yes