Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/19699 )

Change subject: IMPALA-10798: Prototype a simple JSON File reader
......................................................................


Patch Set 21:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@22
PS21, Line 22: converting and materializing the Parser's parsing results into 
RowBatch.
Could you mention that numeric values are parsed from strings using the same 
functionality of text-scanners? This is an advantage of using RapidJSON 
directly. So we don't need to worry about inconsistency in converting decimals 
(e.g. losing precisions).


http://gerrit.cloudera.org:8080/#/c/19699/21//COMMIT_MSG@25
PS21, Line 25: ,
nit: period "."


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/hdfs-scanner.cc@831
PS21, Line 831:     if (scan_node_->skip_header_line_count() > 1) {
I think don't need this for JSON tables. We can make sure 
'skip_header_line_count' is always 0 for JSON scan nodes. It comes from FE:
https://github.com/apache/impala/blob/97e44c11923f3d28e08aba1b5dd66b8a35465deb/fe/src/main/java/org/apache/impala/catalog/FeFsTable.java#L290


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@188
PS21, Line 188: kParseStopWhenDoneFlag
Does this mean only parsing the first json object in each line and skip the 
others (if there are)?


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json-parser.h@189
PS21, Line 189:       reader_.Parse<parse_flags>(stream_, *this);
Could you add a comment above this for the parsing mechanism? E.g.

 Reads characters from the stream, and publishes events to this handler 
(JsonParser).


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h
File be/src/exec/json-parser.h:

http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@169
PS18, Line 169:     while (!stream_.Eos()) {
> > What about kParseInsituFlag? Can we use it in our scenario?
I see. Thanks for the explanation!


http://gerrit.cloudera.org:8080/#/c/19699/18/be/src/exec/json-parser.h@303
PS18, Line 303:       current_field_idx_ = -1;
What's the behavior if we return false here? Will RapidJSON skip the whole json 
object or just skip this value?


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h
File be/src/exec/json/hdfs-json-scanner.h:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.h@39
PS21, Line 39: This
nit: "this"


http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc
File be/src/exec/json/hdfs-json-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/19699/21/be/src/exec/json/hdfs-json-scanner.cc@288
PS21, Line 288:   if (LIKELY(text_converter_->WriteSlot(slot_desc, &aux_type, 
tuple_, data, len, true,
              :       false, current_pool_))) return true;
nit: please write this in multi-lines (with brackets)


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json
File testdata/data/json_test/malformed.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/malformed.json@1
PS21, Line 1: 
{"bool_col":true,"int_col":0,"float_col":"abc","string_col":"abc123"}
Could you add a line that misses the right bracket, and one more line with 
duplicated keys?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json
File testdata/data/json_test/multiline.json:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@1
PS21, Line 1: {"Id": 1, "Key": "normal object", "Value": "abcdefg"}
Could you add a line of two json objects?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/data/json_test/multiline.json@4
PS21, Line 4: 1234
            : 567
Just curious, what the behavior if this is parsed as a numeric column, e.g. 
bigint?


http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19699/21/testdata/datasets/functional/functional_schema_template.sql@1388
PS21, Line 1388: LOAD DATA LOCAL INPATH 
'{impala_home}/testdata/data/overflow.txt' OVERWRITE INTO TABLE 
{db_name}{db_suffix}.{table_name};
Can we add a similar table for json and test the overflow behaviors?



--
To view, visit http://gerrit.cloudera.org:8080/19699
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I31309cb8f2d04722a0508b3f9b8f1532ad49a569
Gerrit-Change-Number: 19699
Gerrit-PatchSet: 21
Gerrit-Owner: Zihao Ye <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Zihao Ye <[email protected]>
Gerrit-Comment-Date: Wed, 19 Jul 2023 12:02:44 +0000
Gerrit-HasComments: Yes

Reply via email to