Zihao Ye has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
......................................................................


Patch Set 21:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22
PS21, Line 22: converting and materializing the Parser's parsing results into 
RowBatch.
> Could you mention that numeric values are parsed from strings using the sam
Done


http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25
PS21, Line 25: ,
> nit: period "."
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831
PS21, Line 831:     if (scan_node_->skip_header_line_count() > 1) {
> I think don't need this for JSON tables. We can make sure 'skip_header_line
Done, this part is indeed unnecessary, I have copied the rest of the code into 
HandleConvertError directly.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188
PS21, Line 188: kParseStopWhenDoneFlag
> Does this mean only parsing the first json object in each line and
 > skip the others (if there are)?

It's not like that, if this flag is not added, the parser will check if the 
stream has ended after parsing an object, and if it hasn't, it will report an 
error "kParseErrorDocumentRootNotSingular". The purpose of this flag is to skip 
this check and allow the parser to parse multiple objects in a stream without 
reporting this error.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189
PS21, Line 189:       reader_.Parse<parse_flags>(stream_, *this);
> Could you add a comment above this for the parsing mechanism? E.g.
Done


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303
PS18, Line 303:       current_field_idx_ = -1;
> What's the behavior if we return false here? Will RapidJSON skip
 > the whole json object or just skip this value?

This part will return false only when HandleConvertError returns false (i.e., 
must have abort_on_error is true). In such cases, the query will be aborted. If 
abort_on_error is false, this part will never return false. Even if this slot 
has a convert error, we just set it to null and return true, see 
HandleConvertError function.


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39
PS21, Line 39: This
> nit: "this"
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288
PS21, Line 288:   if (LIKELY(text_converter_->WriteSlot(slot_desc, &aux_type, 
tuple_, data, len, true,
              :       false, current_pool_))) return true;
> nit: please write this in multi-lines (with brackets)
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1
PS21, Line 1: 
{"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"}
> Could you add a line that misses the right bracket, and one more line with
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json
File testdata/data/json_test/multiline.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1
PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"}
> Could you add a line of two json objects?
Done


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4
PS21, Line 4: 1234
            : 567
> Just curious, what the behavior if this is parsed as a numeric
 > column, e.g. bigint?

We will get a bigint 1234 and two error reporting, one is Missing '}' after 
'1234', another is Invalid value '567}'. Because when parsing numbers, the 
parser considers the number to end when it encounters a newline character.


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388
PS21, Line 1388: LOAD DATA LOCAL INPATH 
'{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE 
{db_name}{db_suffix}.{table_name};
> Can we add a similar table for json and test the overflow behaviors?
Done



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Fri, 21 Jul 2023 07:06:46 +0000
Gerrit-HasComments: Yes

Reply via email to