[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

GitBox Tue, 14 Jan 2020 19:04:29 -0800

paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect 
count() returned for complex types in parquet
URL: https://github.com/apache/drill/pull/1955#discussion_r366673114


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractGroupScanWithMetadata.java
 ##########
 @@ -180,7 +180,7 @@ public long getColumnValueCount(SchemaPath column) {
     } else if (nonInterestingColStats != null) {
       tableRowCount = 
TableStatisticsKind.ROW_COUNT.getValue(getNonInterestingColumnsMetadata());
     } else {
-      return 0; // returns 0 if the column doesn't exist in the table.
+      return Statistic.NO_COLUMN_STATS;
 
 Review comment:
   I'm more confused. If this is a structured (complex) column, then it can 
have nested columns. The nested columns don't add information about this 
column. (Knowing the number of values in an array of maps does not tell us the 
cardinality of the map.) Again, if the Map is at the top level, then the value 
count is row count. If this stat is NDV, then we don't know the NDV if we don't 
have metadata. I'd even argue that NDV makes no sense for a complex column; it 
only makes sense for the members of the column.
   
   Now, back to Arina's point. The info here tells us something about scans. If 
I ask only for column `x`, and the table does not contain column `x`, then I 
don't even need to scan at all, I can just return *n* copies of NULL. (Most 
query engines would fail the query because the column is undefined. Drill will 
run the query and return nulls.) However, in practice, the only way to know the 
correct value of *n* is to do the scan (stats can be out of date.)
   
   Still, I don't get why we need *column* value counts. If we do a scan, we 
want the table row count, we don't care about the column value count.
   
   So, I wonder if there is some additional problem here where our use of stats 
needs some adjusting.
   
   If we want to estimate the row count after filtering (that is, the row count 
seen by, say, a join or sort), then we need the NDV which we can estimate only 
if we have stats, otherwise we should fall back on heuristic selectivity values.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #1955: DRILL-7491: Incorrect count() returned for complex types in parquet

Reply via email to