[ 
https://issues.apache.org/jira/browse/DRILL-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115064#comment-16115064
 ] 

Jinfeng Ni commented on DRILL-5464:
-----------------------------------

Run the above query with the patch for DRILL-5546, the umbrella jira for schema 
change issues related to NULL dataset.  The query was finished successfully.

{code}
 select stars, count(*) as cnt from dfs.tmp.yelp group by stars;
+--------+---------+
| stars  |   cnt   |
+--------+---------+
| 2      | 102737  |
| 1      | 110772  |
| 4      | 342143  |
| 5      | 406045  |
| 3      | 163761  |
+--------+---------+
{code} 

Physical plan for the query; 
{code}
00-00    Screen
00-01      Project(stars=[$0], cnt=[$1])
00-02        UnionExchange
01-01          HashAgg(group=[{0}], cnt=[$SUM0($1)])
01-02            Project(stars=[$0], cnt=[$1])
01-03              HashToRandomExchange(dist0=[[$0]])
02-01                UnorderedMuxExchange
03-01                  Project(stars=[$0], cnt=[$1], 
E_X_P_R_H_A_S_H_F_I_E_L_D=[hash32AsDouble($0, 1301011)])
03-02                    HashAgg(group=[{0}], cnt=[COUNT()])
03-03                      Scan(groupscan=[EasyGroupScan 
[selectionRoot=file:/tmp/yelp, numFiles=2, columns=[`stars`], 
files=[file:/tmp/yelp/empty.json, 
file:/tmp/yelp/yelp_academic_dataset_review.json]]])
{code}

> Fix JSON reader when it deals with empty file
> ---------------------------------------------
>
>                 Key: DRILL-5464
>                 URL: https://issues.apache.org/jira/browse/DRILL-5464
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>
> An empty json file is the one without any json object.  If we query an empty 
> json file asking it to return column 'A',  Drill's JSON record reader would 
> return a batch with 0 row, and put column 'A' as a nullable int column. A 
> better name for such column might be phantom columns, as the record reader 
> does not have any knowledge of the column schema, and the nullable int column 
> is just a guessed schema. 
> However, that processing could introduce many issues. Consider if we have a 
> directory consisting of multiple json files and at least one of them is 
> empty.  If column 'A' is returned as nullable-int column from the reader over 
> the empty file, while the other json files contains a real typed column 'A', 
> that would cause query hit many issues, including 1) SchemaChangeException, 
> 2) failed in certain operator which does not detect SchemaChange, 3) or 
> incorrect query result, since the run-time code is generated over a phantom 
> column type, not a real type.
> For instance, the following query against yelp json file run successfully.
> {code}
> select count(*), stars  from 
> dfs.`/tmp/yelp/yelp_academic_dataset_review.json` group by stars;
> {code}
> If an empty json file is added to the directory,  the query would fail with 
> the following error (which falls into the 2nd category : PartitionSender did 
> not detect schema change properly).  
> {code}
> select count(*), stars  from dfs.`/tmp/yelp` group by stars;
> Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.  
> Expected vector class of org.apache.drill.exec.vector.NullableIntVector but 
> was holding vector class org.apache.drill.exec.vector.NullableBigIntVector, 
> field= stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to