Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22289 )

Change subject: IMPALA-12927: Support specifying format for reading JSON BINARY 
columns
......................................................................


Patch Set 1:

(4 comments)

Thanks for working on this!
Added a few comments, with one affecting the basic design of the patch. Feel 
free to ask for pointers about how to implement these changes.

http://gerrit.cloudera.org:8080/#/c/22289/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22289/1//COMMIT_MSG@22
PS1, Line 22: perty unset or
            : incorrectly set, and will provide an error message.
It could be useful to add a flag that acts as a default for this property if it 
is not set (empty by default). While there is already an easy workaround for 
this case (set the table property) this may need some workloads to be updated 
so a global flag could be a quicker fix.


http://gerrit.cloudera.org:8080/#/c/22289/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/22289/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@572
PS1, Line 572:           slotDesc.setType(Type.STRING);
This doesn't look correct to me because even if it works for JSON, 
theoretically a table can have partitions with different file formats. An 
example for such a table 
https://github.com/apache/impala/blob/5f4321373a3404867e170f7381db04a843aa7310/testdata/datasets/functional/functional_schema_template.sql#L979

In this case changing the column type may break partitions with other file 
formats. I am also not sure whether changing the column type desc_ is safe in 
general (at them moment I don't have an example on what could it break).

I think that it is necessary to send this information to the backend and decide 
there whether to base64 decode or not.


http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test
File 
testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test:

http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test@3
PS1, Line 3: lter table binary_tbl unset tblproperties
This modifies a shared table and can leave it in a dirty state if the test 
fails. I would prefer to create a temporary table or use different tables 
created during dataload.


http://gerrit.cloudera.org:8080/#/c/22289/1/testdata/workloads/functional-query/queries/QueryTest/json-binary-format.test@12
PS1, Line 12: No valid table properties 'json.binary.format' (base64 or 
rawstring) provided for scanning binary column of json table 
'$DATABASE.binary_tbl'
Can you add a test for the case when there is no valid json.binary.format, but 
no binary column is used in the query? My assumption is that we should return 
without an error in this case.



--
To view, visit http://gerrit.cloudera.org:8080/22289
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Idf61fa3afc0f33caa63fbc05393e975733165e82
Gerrit-Change-Number: 22289
Gerrit-PatchSet: 1
Gerrit-Owner: Zihao Ye <eyiz...@163.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Comment-Date: Mon, 13 Jan 2025 14:26:07 +0000
Gerrit-HasComments: Yes

Reply via email to