[ https://issues.apache.org/jira/browse/DRILL-5464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016484#comment-16016484 ]
Jinfeng Ni commented on DRILL-5464: ----------------------------------- The above failure is actually intermittent; it depends on the race condition between the minor fragment scanning the empty file and the minor fragment scanning non-empty file. If the former completes and arrives at HashAgg earlier than the late one, then error would happen. Otherwise, the query would run successful. {code} select count(*), stars from dfs.`/drill/testdata/schema/yelpEmpty` group by stars; +---------+--------+ | EXPR$0 | stars | +---------+--------+ | 406045 | 5 | | 342143 | 4 | | 110772 | 1 | | 163761 | 3 | | 102737 | 2 | +---------+--------+ {code} > Fix JSON reader when it deals with empty file > --------------------------------------------- > > Key: DRILL-5464 > URL: https://issues.apache.org/jira/browse/DRILL-5464 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > > An empty json file is the one without any json object. If we query an empty > json file asking it to return column 'A', Drill's JSON record reader would > return a batch with 0 row, and put column 'A' as a nullable int column. A > better name for such column might be phantom columns, as the record reader > does not have any knowledge of the column schema, and the nullable int column > is just a guessed schema. > However, that processing could introduce many issues. Consider if we have a > directory consisting of multiple json files and at least one of them is > empty. If column 'A' is returned as nullable-int column from the reader over > the empty file, while the other json files contains a real typed column 'A', > that would cause query hit many issues, including 1) SchemaChangeException, > 2) failed in certain operator which does not detect SchemaChange, 3) or > incorrect query result, since the run-time code is generated over a phantom > column type, not a real type. > For instance, the following query against yelp json file run successfully. > {code} > select count(*), stars from > dfs.`/tmp/yelp/yelp_academic_dataset_review.json` group by stars; > {code} > If an empty json file is added to the directory, the query would fail with > the following error (which falls into the 2nd category : PartitionSender did > not detect schema change properly). > {code} > select count(*), stars from dfs.`/tmp/yelp` group by stars; > Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector. > Expected vector class of org.apache.drill.exec.vector.NullableIntVector but > was holding vector class org.apache.drill.exec.vector.NullableBigIntVector, > field= stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)] > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)