Hi, Christopher, This is how I see it working, let me know if I'm missing something. The query is, "tell me about prop_b for all rows where prop_a < 9"
1. I set up a query from (-inf,+inf) on my partitioned index table with my PartitionedRangeSearchIterator in scan scope. I configure the PRSI with some encoded representation of my query. 2. The PRSI seeks to (p0, i, (prop_a)) 3. For each key less than (p0, i, (prop_a, 9)), - Extract the row_id from the last field in the CQ tuple - Seek to (p0, d, (row_id, prop_b)) - If prop_b exists for row_id, emit (p0, d, (row_id, prop_b)), <value of prop_b> [repeat for p1, p2 etc... all partitions available to iterator] It's the seek from the index entries to the data entries that could cause the data entries to be emitted out of order. The index entries are ordered by prop_a, the data entries are ordered by row id, the 2 orders will hardly ever correspond. -Russ On Thu, Apr 2, 2015 at 1:18 PM, Christopher <[email protected]> wrote: > I'm not sure I understand how they'd be out of order in the iterator > if they aren't out of order in the underlying source. > How would your iterator return: > ((p0, d, (r15, prop_b)), "just testing"), > ((p0, d, (r8, prop_b)), "hello, world") > > when the underlying data is: > p0 | d | (r8, prop_b) | hello, world > p0 | d | (r15, prop_b) | just testing > > ? > > Why would it reorder the existing entries? > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Thu, Apr 2, 2015 at 1:52 PM, Russ Weeks <[email protected]> > wrote: > > Thanks for your response, Christopher. > > > > Yes, I see what you mean by promoting the CQ to the CF. I thought that > > would simplify things but if not I could definitely return (k,v) pairs > like, > > > > ((p0, d, (r15, prop_b)), "just testing"), > > ((p0, d, (r8, prop_b)), "hello, world") > > > > (leaving the timestamps and visibilities out for clarity, and assuming > that > > r15 and r8 are encoded such that their lexical order matches their > numeric > > order) > > > > Leaving the existing schema intact is not a problem for me but it doesn't > > get around the fact that the (k,v) pairs would be returned out-of-order > by > > the iterator. I guess another option would be to embed the "data" kv > pairs > > into the "index" kv pairs somehow, such as: > > > > ((p0, i, ((prop_a, 7, r15), prop_b)), "just testing") > > ((p0, i, ((prop_a, 8, r8), prop_b)), "hello, world") > > > > I'm not too keen on that solution but if you're telling me that I > shouldn't > > rely on scan-time iterators being able to emit data out of order... I > think > > it's the least bad option. > > > > Regards, > > -Russ > > > > On Thu, Apr 2, 2015 at 12:04 AM, Christopher <[email protected]> > wrote: > > > >> So in your example, you actually did return then in order (lexically, > not > >> numerically), but I grok the idea that they might not be. > >> > >> The problem is that your transformation promotes a portion of the cq to > the > >> cf. That's fine if what your iterator is returning includes only that > from > >> a single cf (day, the 'data' cf). But otherwise, you could get > duplicates > >> or out of order results, which can mess up the client's expectations > when > >> retrieving batches from the servers. It could work in some limited > cases, > >> but I'd avoid it. > >> > >> Instead, why not preserve order by preserving the existing schema, and > just > >> ignore the unused cf in the client? > >> > >> On Thu, Apr 2, 2015, 00:28 Russ Weeks <[email protected]> wrote: > >> > >> > Thanks, Christopher. It's nice to hear an unambiguous point of view :) > >> > > >> > Do you see any alternative way of implementing a range scan on a > >> > partitioned index? The problem does not exist for exact-match scans > >> because > >> > the row ID in the index entry CQ provides the correct ordering. > >> > > >> > Thanks, > >> > -Russ > >> > > >> > On Wed, Apr 1, 2015 at 9:11 PM, Christopher <[email protected]> > wrote: > >> > > >> > > You should definitely not rely on this behavior. It goes against > best > >> > > practices and is prone to error. It is not recommended. > >> > > > >> > > On Wed, Apr 1, 2015, 20:03 Russ Weeks <[email protected]> > >> wrote: > >> > > > >> > > > A wonderful property of scan-time iterators is that they can emit > row > >> > IDs > >> > > > in arbitrary order. Before I go off and build an index that > relies on > >> > > this > >> > > > behaviour, I'd like to get a sense of how likely it is to exist in > >> > future > >> > > > versions of Accumulo. > >> > > > > >> > > > I'd like to build an index like this (hopefully the ascii comes > >> > through, > >> > > if > >> > > > not check here < > >> https://gist.github.com/anonymous/1a64114da4b68a2ec822 > >> > > >): > >> > > > > >> > > > > >> > > > row | cf | cq | val > >> > > > ------------------------------------------------- > >> > > > p0 | i | (prop_a, 7, r15) | 1 > >> > > > p0 | i | (prop_a, 8, r8) | 1 > >> > > > p0 | i | (prop_a, 9, r19) | 1 > >> > > > [...snip...] > >> > > > p0 | d | (r8, prop_a) | 8 > >> > > > p0 | d | (r8, prop_b) | hello, world > >> > > > p0 | d | (r15, prop_a) | 7 > >> > > > p0 | d | (r15, prop_b) | just testing > >> > > > p0 | d | (r19, prop_a) | 9 > >> > > > p0 | d | (r19, prop_b) | something else > >> > > > > >> > > > Which is a pretty conventional partitioned index. I'd like to be > able > >> > to > >> > > > issue a query like, "Tell me about prop_b for all documents where > >> > prop_a > >> > > < > >> > > > 9" but I'm pretty sure that the only way this could work at scale > is > >> if > >> > > > it's OK for the iterator to return (p0, r15, prop_b, "just > testing") > >> > > > followed by (p0, r8, prop_b, "hello, world"). > >> > > > > >> > > > This works today - if you folks see any flaws in my reasoning > please > >> > let > >> > > me > >> > > > know - my question is, do you see this as functionality that > should > >> be > >> > > > preserved in the future? > >> > > > > >> > > > Thanks, > >> > > > -Russ > >> > > > > >> > > > >> > > >> >
