Zihao Ye has posted comments on this change. ( http://gerrit.cloudera.org:8080/19699 )
Change subject: IMPALA-10798: Prototype a simple JSON File reader ...................................................................... Patch Set 21: (12 comments) http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22 PS21, Line 22: converting and materializing the Parser's parsing results into RowBatch. > Could you mention that numeric values are parsed from strings using the sam Done http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25 PS21, Line 25: , > nit: period "." Done http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831 PS21, Line 831: if (scan_node_->skip_header_line_count() > 1) { > I think don't need this for JSON tables. We can make sure 'skip_header_line Done, this part is indeed unnecessary, I have copied the rest of the code into HandleConvertError directly. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188 PS21, Line 188: kParseStopWhenDoneFlag > Does this mean only parsing the first json object in each line and > skip the others (if there are)? It's not like that, if this flag is not added, the parser will check if the stream has ended after parsing an object, and if it hasn't, it will report an error "kParseErrorDocumentRootNotSingular". The purpose of this flag is to skip this check and allow the parser to parse multiple objects in a stream without reporting this error. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189 PS21, Line 189: reader_.Parse<parse_flags>(stream_, *this); > Could you add a comment above this for the parsing mechanism? E.g. Done http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h File be/src/exec/json-parser.h: http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303 PS18, Line 303: current_field_idx_ = -1; > What's the behavior if we return false here? Will RapidJSON skip > the whole json object or just skip this value? This part will return false only when HandleConvertError returns false (i.e., must have abort_on_error is true). In such cases, the query will be aborted. If abort_on_error is false, this part will never return false. Even if this slot has a convert error, we just set it to null and return true, see HandleConvertError function. http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h File be/src/exec/json/hdfs-json-scanner.h: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39 PS21, Line 39: This > nit: "this" Done http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc File be/src/exec/json/hdfs-json-scanner.cc: http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288 PS21, Line 288: if (LIKELY(text_converter_->WriteSlot(slot_desc, &aux_type, tuple_, data, len, true, : false, current_pool_))) return true; > nit: please write this in multi-lines (with brackets) Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json File testdata/data/json_test/malformed.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1 PS21, Line 1: {"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"} > Could you add a line that misses the right bracket, and one more line with Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json File testdata/data/json_test/multiline.json: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1 PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"} > Could you add a line of two json objects? Done http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4 PS21, Line 4: 1234 : 567 > Just curious, what the behavior if this is parsed as a numeric > column, e.g. bigint? We will get a bigint 1234 and two error reporting, one is Missing '}' after '1234', another is Invalid value '567}'. Because when parsing numbers, the parser considers the number to end when it encounters a newline character. http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388 PS21, Line 1388: LOAD DATA LOCAL INPATH '{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE {db_name}{db_suffix}.{table_name}; > Can we add a similar table for json and test the overflow behaviors? Done -- To view, visit http://gerrit.cloudera.org:8080/19699 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569 Gerrit-Change-Number: 19699 Gerrit-PatchSet: 21 Gerrit-Owner: Zihao Ye <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Zihao Ye <[email protected]> Gerrit-Comment-Date: Fri, 21 Jul 2023 07:06:46 +0000 Gerrit-HasComments: Yes
