[ 
https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Turner updated ACCUMULO-775:
----------------------------------

    Assignee:     (was: Keith Turner)

> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
>                 Key: ACCUMULO-775
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-775
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Christopher Tubbs
>            Priority: Minor
>              Labels: iterator, scan, seek
>
> At present, seeking is a very expensive operation. Yet, it is a very common 
> case, especially when writing filtering/consuming/skipping iterators to seek 
> to the next possible match (perhaps in the next row, when matching a column 
> family with a regular expression), rather than continuing to iterate. A 
> common solution to the problem of whether to scan or seek is to continue to 
> scan for some threshold (~10-20 entries), hoping to just "run into" the next 
> possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators 
> on top don't have to do this. They can seek, and the underlying source 
> iterator can simply consume the next X entries when it makes sense, rather 
> than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine 
> that the places where this would make the most sense is if the data currently 
> being sought (seek'd) is in the current compressed block from the underlying 
> file, especially if it is forward, relative to the current pointer. A better 
> seek method should be able to tell where one currently is, and whether the 
> requested data is within reach without doing all the expensive operations to 
> re-seek to the same compressed block that is already loaded, reload it, 
> decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to 
> calibrate their own such scan vs. seek optimization based on guessing whether 
> their data is in the current block or another one, while still getting that 
> same performance benefit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to