[ https://issues.apache.org/jira/browse/PHOENIX-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054914#comment-14054914 ]
Gabriel Reid commented on PHOENIX-539: -------------------------------------- {quote}The lease timeout is a different issue, I believe. It's cause primarily if you're doing a group by or order by on too big a chunk of data. The client in that case doesn't hear back from the server for a long time b/c it's busy trying to sort/group. I believe the best solution for that is to improve the parallelization such that smaller chunks are operated on so that the client always hears back before the timeout occurs.{quote} The issue I had in mind with the potential lease timeout was that that there could be too much time between accessing each scanner if you're doing some kind of processing on the records while iterating over the ResultSet (rather than simply streaming the rows). For example, consider code like this: {code} ResultSet rs = stmt.executeQuery("SELECT * FROM mytable"); while (rs.next()) { // Do something that takes a few milliseconds doSomethingExpensive(rs.getInt(1)); } {code} If each scanner is buffering 1000 rows at a time, and there are 10 parallel scanners, then the {{doSomething()}} method can't take more than 6 milliseconds per call. Six milliseconds is obviously a long time, but if the number of scanners or size of the buffer increases by an order of magnitude, this will drop by an order of magnitude. This might not be something that we need to worry about -- it was actually my assumption that something like this was part of the reason that the whole spooling thing was done in the first place. About the GROUP BY not using the ChunkedResultIterator, I believe this is already the case. I'm pretty sure that the only case where the ChunkedResultIterator can be used is via a ScanPlan, and (if I'm not mistaken) due to GROUP BYs being executed via an AggregatePlan, I'm think it's ok there. In any case, all the integration tests pass with the current patch. If you know of any situations where this might not be the case (i.e. GROUP BY not using an AggregatePlan), let me know and I'll add some tests for that. I'm going to be (mostly) offline for the coming 7 days -- do you think it's worth committing this now, or better to wait and consider going for the approach that [~lhofhansl] outlined? In any case, if/when I commit this I'll certainly add the JIRA ticket for not clearing out the hash cache so that this could work for hash joins too. > Implement parallel scanner that does not spool to disk > ------------------------------------------------------ > > Key: PHOENIX-539 > URL: https://issues.apache.org/jira/browse/PHOENIX-539 > Project: Phoenix > Issue Type: Task > Reporter: James Taylor > Assignee: larsh > Attachments: PHOENIX-539.1.patch, PHOENIX-539.patch > > > In scenarios where a LIMIT is not present on a non aggregate query that will > return a lot of results, Phoenix spools the results to disk. This is less > than ideal in these situations. @larsh has created a very good and relatively > simple implementation that is queue based to replace this. -- This message was sent by Atlassian JIRA (v6.2#6252)