Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 )
Change subject: IMPALA-10798: Prototype a simple JSON File reader ...................................................................... Patch Set 22: (9 comments) The patch is looking good! I mostly look into the error handling in this round. Not sure if we have tests about exceeding mem_limit. If not, we can add a test to write a json file with huge strings and read it in a query with a low mem_limit that will be exceeded. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@37 PS22, Line 37: Can we move this file into be/src/exec/json if it's only used by hdfs-json-scanner? Also do you plan to move the implementation codes into a json-parser.cc file? http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@59 PS22, Line 59: /// means return true on success. nit: "The following functions materialize output tuples. Functions with void return type must succeed. Functions with bool return type return true on succeed, and return false to stop parsing the whole scan range." http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json-parser.h@213 PS22, Line 213: be nit: "been" http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.h@47 PS22, Line 47: /// exactly one scanner. Let's also mention the error handling, i.e. how different kinds of errors are handled. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@188 PS22, Line 188: << desc->col_pos() - scan_node_->num_partition_keys() It'd be more helpful to print the column name and table name, e.g. Error converting column 'key1' of table my_tbl to bigint http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@211 PS22, Line 211: return Status::OK(); It seems we should return a non-ok status for unrecoverable situations, e.g. running out of memory. http://gerrit.cloudera.org:8080/#/c/19699/22/be/src/exec/json/hdfs-json-scanner.cc@300 PS22, Line 300: if (LIKELY(text_converter_->WriteSlot(slot_desc, &aux_type, tuple_, data, len, true, A follow-up question for https://gerrit.cloudera.org/c/19699/18/be/src/exec/json-parser.h#303 This can return false in copying strings when we run out of memory: https://github.com/apache/impala/blob/af3f56e6d1605a56f7bd02b0af35be980a7e4c63/be/src/exec/text-converter.inline.h#L96 It seems we will return true in HandleConvertError() and let RapidJSON continue parsing. Can we stop it and report the mem issue? Or did I miss something? http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/22/testdata/data/json_test/malformed.json@2 PS22, Line 2: {"bool_col":False,"int_col":1,"float_col":0.1,"string_col":abc123} Can we also add these cases? { } [ ] ( ) {"string_col":"abc123"} ["string_col", "abc123"] http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 > > Just curious, what the behavior if this is parsed as a numeric Ack -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 22 Gerrit-Owner: Zihao Ye <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Mon, 24 Jul 2023 11:32:52 +0000 Gerrit-HasComments: Yes
