[
https://issues.apache.org/jira/browse/ACCUMULO-775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Keith Turner updated ACCUMULO-775:
----------------------------------
Assignee: (was: Keith Turner)
> Optimize iterator seek() method when seeking forward
> ----------------------------------------------------
>
> Key: ACCUMULO-775
> URL: https://issues.apache.org/jira/browse/ACCUMULO-775
> Project: Accumulo
> Issue Type: Improvement
> Components: tserver
> Reporter: Christopher Tubbs
> Priority: Minor
> Labels: iterator, scan, seek
>
> At present, seeking is a very expensive operation. Yet, it is a very common
> case, especially when writing filtering/consuming/skipping iterators to seek
> to the next possible match (perhaps in the next row, when matching a column
> family with a regular expression), rather than continuing to iterate. A
> common solution to the problem of whether to scan or seek is to continue to
> scan for some threshold (~10-20 entries), hoping to just "run into" the next
> possible match, rather than waste resources seeking directly to it.
> This pattern can be rolled in to the lower level iterator, so that iterators
> on top don't have to do this. They can seek, and the underlying source
> iterator can simply consume the next X entries when it makes sense, rather
> than waste resources seeking.
> I could be wrong (please comment and correct me below if I am), but I imagine
> that the places where this would make the most sense is if the data currently
> being sought (seek'd) is in the current compressed block from the underlying
> file, especially if it is forward, relative to the current pointer. A better
> seek method should be able to tell where one currently is, and whether the
> requested data is within reach without doing all the expensive operations to
> re-seek to the same compressed block that is already loaded, reload it,
> decompress it, and scan to the requested starting point.
> Having such an optimization would eliminate the need for users to try to
> calibrate their own such scan vs. seek optimization based on guessing whether
> their data is in the current block or another one, while still getting that
> same performance benefit.
--
This message was sent by Atlassian JIRA
(v6.2#6252)