[ https://issues.apache.org/jira/browse/PHOENIX-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230500#comment-14230500 ]
James Taylor commented on PHOENIX-1453: --------------------------------------- That's true, [~lhofhansl]. When we do a raw scan, like this, don't we know that each List<Cell> will be a new row? {code} innerScanner.nextRaw(results); {code} and then the other entry point is in our StatisticsScanner for these two method: {code} public boolean next(List<Cell> result) throws IOException { public boolean next(List<Cell> result, int limit) throws IOException { {code} If we changed StatisticsCollector.updateStatistic(KeyValue kv) to pass in a List<Cell>, would we no longer need to do the key comparison, but could just increment a counter? Having the row count is useful, but not essential. More important are the equal width guideposts since this indicates how much data will be scanned. The row count would be used to control the optimization we do for a LIMIT query. We currently estimate the row size based on the schema, multiply by the LIMIT and if it's estimated to be less than one region's worth then we run the query serially (see ScanPlan.isSerial()). Having the row count would let us estimate the average row size more accurately. Maybe there's a better way to do that? Or maybe it's fine as-is, since it's kind of squishy already. I'd guess that we'd use the row count for other optimizations down the road, but I'm not positive. If we think it's important to have the row count and need to do the key comparison, it'd be good to get a realistic measure of the overhead for doing the key comparison. > Collect row counts per region in stats table > -------------------------------------------- > > Key: PHOENIX-1453 > URL: https://issues.apache.org/jira/browse/PHOENIX-1453 > Project: Phoenix > Issue Type: Sub-task > Reporter: James Taylor > Assignee: ramkrishna.s.vasudevan > Attachments: Phoenix-1453.patch, Phoenix-1453_1.patch, > Phoenix-1453_2.patch, Phoenix-1453_3.patch > > > We currently collect guideposts per equal chunk, but we should also capture > row counts. Should we have a parallel array with the guideposts that count > rows per guidepost, or is it enough to have a per region count? -- This message was sent by Atlassian JIRA (v6.3.4#6332)