paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet URL: https://github.com/apache/drill/pull/1955#discussion_r366673114
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java ########## @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) { } else if (nonInterestingColStats != null) { tableRowCount = TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata()); } else { - return 0; // returns 0 if the column doesn't exist in the table. + return Statistic.NO_COLUMN_STATS; Review comment: I'm more confused. If this is a structured (complex) column, then it can have nested columns. The nested columns don't add information about this column. (Knowing the number of values in an array of maps does not tell us the cardinality of the map.) Again, if the Map is at the top level, then the value count is row count. If this stat is NDV, then we don't know the NDV if we don't have metadata. I'd even argue that NDV makes no sense for a complex column; it only makes sense for the members of the column. Now, back to Arina's point. The info here tells us something about scans. If I ask only for column `x`, and the table does not contain column `x`, then I don't even need to scan at all, I can just return *n* copies of NULL. (Most query engines would fail the query because the column is undefined. Drill will run the query and return nulls.) However, in practice, the only way to know the correct value of *n* is to do the scan (stats can be out of date.) Still, I don't get why we need *column* value counts. If we do a scan, we want the table row count, we don't care about the column value count. So, I wonder if there is some additional problem here where our use of stats needs some adjusting. If we want to estimate the row count after filtering (that is, the row count seen by, say, a join or sort), then we need the NDV which we can estimate only if we have stats, otherwise we should fall back on heuristic selectivity values. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services