Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 )
Change subject: IMPALA-10798: Prototype a simple JSON File reader ...................................................................... Patch Set 21: (13 comments) http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22 PS21, Line 22: converting and materializing the Parser's parsing results into RowBatch. Could you mention that numeric values are parsed from strings using the same functionality of text-scanners? This is an advantage of using RapidJSON directly. So we don't need to worry about inconsistency in converting decimals (e.g. losing precisions). http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25 PS21, Line 25: , nit: period "." http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831 PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) { I think don't need this for JSON tables. We can make sure 'skip_header_line_count' is always 0 for JSON scan nodes. It comes from FE: https://github.com/apache/impala/blob/97e44c11923f3d28e08aba1b5dd66b8a35465deb/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java#L290 http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188 PS21, Line 188: kParseStopWhenDoneFlag Does this mean only parsing the first json object in each line and skip the others (if there are)? http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189 PS21, Line 189: reader_.Parse<parse_flags>(stream_, *this); Could you add a comment above this for the parsing mechanism? E.g. Reads characters from the stream, and publishes events to this handler (JsonParser). http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169 PS18, Line 169: while (!stream_.Eos()) { > > What about kParseInsituFlag? Can we use it in our scenario? I see. Thanks for the explanation! http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303 PS18, Line 303: current_field_idx_ = -1; What's the behavior if we return false here? Will RapidJSON skip the whole json object or just skip this value? http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39 PS21, Line 39: This nit: "this" http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288 PS21, Line 288: if (LIKELY(text_converter_->WriteSlot(slot_desc, &aux_type, tuple_, data, len, true, : false, current_pool_))) return true; nit: please write this in multi-lines (with brackets) http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1 PS21, Line 1: {"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"} Could you add a line that misses the right bracket, and one more line with duplicated keys? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1 PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"} Could you add a line of two json objects? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 Just curious, what the behavior if this is parsed as a numeric column, e.g. bigint? http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388 PS21, Line 1388: LOAD DATA LOCAL INPATH '{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE {db_name}{db_suffix}.{table_name}; Can we add a similar table for json and test the overflow behaviors? -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Wed, 19 Jul 2023 12:02:44 +0000 Gerrit-HasComments: Yes
