Hi Accumulo wizards,

TL;DR - this is a question about custom iterators and saving state (or seeking 
backwards) in order to filter / mask data during major compaction.

For a project I'm working on, we would like to be able to use one entry to 
filter other entries in the same row. (I will call the first entry the 
'filtering key.') To do this, we would ensure that this 'filtering key' 
lexicographically precedes the other entries it would be used on.

There is, of course, a "snag" with this idea: the iterator could simply read 
and save in memory the entry and then use it for subsequent filtering, were it 
not for the fact that the iterator stack can be dropped and re-initialized at 
any point in the row, including cf's/cq's that are already past the 'filtering 
key.' Our understanding is that the tserver processes can (and do!) restart and 
re-initialize the iterator stack at any point. When this happens, the tserver 
will "seek(...)" the newly re-initialized iterator stack back to the same 
row/cf/cq that the previous incarnation of the stack was on when it got 
re-initialized.

When this teardown/re-init happens, the tserver doesn't call deepCopy(...) on 
the iterator stack; it just calls init(...). (At least, this is our experience 
in Accumulo 1.6.2.) For this reason, it is seen as a risky proposition to try 
to keep state in the iterators. (Josh Elser acknowledges this in his 
presentation on designing and testing custom iterators for Accumulo, 
https://www.slideshare.net/je2451/designing-and-testing-accumulo-iterators).

Nevertheless, for the scantime scope, I believe we can use WholeRowIterator to 
ensure that we don't ever return data for a row until we've read the entire 
row, thus avoiding the need to keep state in the iterators. (If the iterator 
stack gets re-initialized, we should start over from the beginning of the row.)

Our problem comes when we want to use this filter in majc.compaction scope to 
actually filter the masked data out of the system entirely. In this case, the 
WholeRowIterator approach wouldn't seem to be usable (because Accumulo only 
allows us to set filters for compaction time but not iterators).

Here are our questions:

(1) Has Accumulo's behavior when tearing down and re-initializing an iterator 
stack changed between 1.6.2 and the latest version? (I.e. is deepCopy now 
called?)

(2) Are there any other ways in which storing state across iterator stack 
teardowns has been made any easier?

(3) If not, are there any other tricks/hacks which we might consider using 
(albeit with caution) to store state or otherwise accomplish this? (Options 
we've mused about include figuring out another way for the iterators to store 
state beyond themselves -- can iterators write to the IteratorEnvironment to 
influence future iterator instantiations? -- and/or allowing the iterators to 
seek backwards to get the 'filtering key' they need.)

(4) Also: any downsides to using the WholeRowIterator we should keep in mind?

Thanks in advance,

Jonathan

Reply via email to