This may be better on the dev list. I have a couple of questions about what's going on please....
On Tue, Jun 12, 2012 at 2:54 PM, Stephen Allen <[email protected]> wrote: > ARQ currently materializes the RHS into an ArrayList which it then > uses for a linear search on each LHS iteration. It's materialized? Is there a streaming option if the RHS gets too big? If not streaming, then perhaps a store-to-disk option? Better yet, can the result of an index lookup be represented as a QueryIterator? (I realize that last question may be hopelessly naive, but I'm very new to this codebase). > An easyish > performance improvement would be to replace that with a HashSet > containing bindings made up of the shared variables. It would be helpful at this point to know all the variables that can appear in either side. However, that's not possible here, is it? It looks like that information is in bindingLeft and the iterator for Bindings in tableRight, and you'd have to iterate exhaustively to find them all. Or is there some other way to get this information? (The query structure has this, but I can't see that it's readily accessible from here). Unless I'm mistaken, if I can't get hold of the variables that may appear in the bindings, then a simple HashSet approach won't be adequate. > A more advanced change would be do what you've done and not > materialize the RHS when it has a fast index search, taking scope > differences and shared variables into account. Does scope matter for Minus? The iterator for the right hand side takes a null context. (That's one of the main differences from NOT EXISTS, isn't it?) > Although as you > mention this becomes essentially the same query optimization decision > as hash vs nested loop join at this point. That's the tradeoff between working with what indexes you have, and building a new one for the current situation. Of course, either approach will be better than the current code. Thinking out loud... I haven't tried it myself for this, but there's also a hybrid approach where you use the existing index, and cache in a hash as you go. That'd help for data that you revisit, but it may be that most queries won't revisit things that get subtracted very often. Paul
