[jira] [Commented] (CASSANDRA-2062) Better control of iterator consumption

Jonathan Ellis (JIRA) Mon, 13 Jun 2011 19:19:52 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13048954#comment-13048954
 ]


Jonathan Ellis commented on CASSANDRA-2062:
-------------------------------------------

It looks like the core problem is that (because CollatingIterator does not 
require that its inputs be unique?), it calls hasNext() on its child iterator 
immediately after pulling off the least value from one.

So I think this is fixable by creating a PQ-using UniquesCollatingIterator, for 
instance (better name as an exercise for the reader :).

But, the MergingIterator approach with the reduce incorporated has a kind of 
elegance to it as well. The reduce logic is certainly simpler in MI. So I'm 
fine w/ replacing CI+RI with MI if that's the approach you want, but I'd like 
to make it a clean replace -- we still have the RI use in 
collectCollatedColumns, as you noted above, as well as LazyColumnIterator.

It looks to me like the main obstacle to using MI there is making MI.Reducer 
support customizable isEqual?

> Better control of iterator consumption
> --------------------------------------
>
>                 Key: CASSANDRA-2062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2062
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: 
> 0001-CASSANDRA-2062-0001-Improved-iterator-for-merging-sort.txt, 
> 0002-CASSANDRA-2062-0002-Port-all-collating-consumers-to-Me.txt
>
>
> The core reason for this ticket is to gain control over the consumption of 
> the lazy nested iterators in the read path.
> {quote}We survive now because we write the size of the row at the front of 
> the row (via some serious acrobatics at write time), which gives us hasNext() 
> for rows for free. But it became apparent while working on the block-based 
> format that hasNext() will not be cheap unless the current item has been 
> consumed. "Consumption" of the row is easy, and blocks will be framed so that 
> they can be very easily skipped, but you don't want to have to seek to the 
> end of the row to answer hasNext, and then seek back to the beginning to 
> consume the row, which is what CollatingIterator would have forced us to 
> do.{quote}
> While we're at it, we can also improve efficiency: for {{M}} iterators 
> containing {{N}} total items, commons.collections.CollatingIterator performs 
> a {{O(M*N)}} merge, and calls hasNext multiple times per returned value. We 
> can do better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (CASSANDRA-2062) Better control of iterator consumption

Reply via email to