[ 
https://issues.apache.org/jira/browse/JENA-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112436#comment-13112436
 ] 

Andy Seaborne commented on JENA-121:
------------------------------------

Bindings are central to query evaluation; I'd want a great deal of caution 
before changing them.  This equates to building experimental systems off-trunk 
and being very sure that that changes are for the better for the wide range of 
uses that get made of ARQ.  Improvements to one set of users may be 
disadvantageous to another set (it would't be the first time :-().

Binding parents save memory and time in creation.  They may introduce a lookup 
cost though so if that can be reduced it would be good (I say "may" because 
there is a theoretical cost but is it visible?)

They save memory because one parent can be shared by several query solutions as 
the answers are built up.  By pointing to the parent, from an earlier stage in 
the execution, no copy is done and the same variable bindings are shared.  With 
sorting, instead of needing to handle all the data for all the results, common 
parts are shared.  There can be a lot of rows all sharing the same parent.

They save time by avoiding copying down the elements from a binding earlier in 
the execution.

This is working together with the substitution merge style execution 
(effectively and index join) that ARQ uses.  Only Sorting and grouping cause an 
accumulation of bindings, other queries have one or two results in flight at 
any one time.

A copy-down scheme would be expensive because elements may be copied more then 
once. A binding found early in some execution will be copied into an 
intermediate result, then further copied into further intermediate results.  

(This is like the Java ArrayList where to increase it's size a new version of 
1.5 times the size is allocated, and the old copied into the new; but that 
means the first few elements are copied over and over again, every time the 
array is expanded.  It becomes worse than O(N).)



(2) only addresses the iteration case.  It might be nice to have in the 
interface but not as the implementation; the implementation is adding a new 
class and the associated overhead.

I don't see how (2a) fits with the current codebase.

In (2b) a more incremental approach might be to cache the calculated variable 
list (and invalided if updated but currently does any important iteration 
happen in core execution, or just modifiers?

For (3), I agree a better implementation and avoid a hash map would be good.  
It's a local change to an implementation, not an interface system wide change.


Proposal:

(A) Remove .add(?,?) and .addAll(?) from the binding interface.

I have done this change on a copy of the code to check it all works but I ran 
out of time to tidy up all the details.  I'll report back when I've finished.

(B) Better implementation where BindingMap is used.

The most important inner loop place is 
QueryIterTriplePattern.TripleMapper.mapper.  There are quite a few uses in 
property functions but TripleMapper.mapper is the performance critical point.  

We have 
  BindingFactory.create{1,2,3}(...)
These create bindings that have a fixed one, two or three slots bindings.  

Writing out the cases on TripleMapper.mapper to have 1/2/3 bindings will be 
tedious code but it's just code and the saving avoiding general BindingMap is 
well worth having.

The trick to coding will be to avoid intermediate object creation.  Stack 
variables and testing for nulls is messy (even ugly) but fast.

(C) Change Binding.create() to return BindingMap.  It's used in some places 
(e.g. parsed result sets).

BindingFactory.create1([parent,] Var, Node) already exists as 
BindingFactory.binding([parent,] Var, Node)






> Improvements to Bindings
> ------------------------
>
>                 Key: JENA-121
>                 URL: https://issues.apache.org/jira/browse/JENA-121
>             Project: Jena
>          Issue Type: Improvement
>          Components: ARQ
>            Reporter: Stephen Allen
>            Priority: Minor
>
> The Binding interface is a key object for query execution.  It has some 
> issues such that it may be a good idea to think about tweaking the design a 
> bit.  Some thoughts:
> 1) Bindings should be immutable (in the strong Java sense: 
> http://www.ibm.com/developerworks/java/library/j-jtp02183/index.html)
> 2) Add a BindingPair class that represents a Var/Node value (could be called 
> something else, BindingValue?)
> 2a) Binding constructor/factory method takes an Iterable<BindingPair> to 
> initialize it
> 2b) Binding can now implement Iterable<BindingPair> which would be more 
> efficient than iterating over the variables then looking up each node
> 3) An implementation that has better memory usage than BindingMap (a HashMap 
> may be overkill here, if we can use the iterator from 2b in more places)
> 4) An implementation that copies parent BindingPairs instead of maintaining a 
> reference.  If the parent bindings are not held onto by themselves after 
> being incorporated into a child, we can save memory by copying and letting 
> the parent be GCed (indeed in the common join case, this appears to be what 
> happens).  We would also get speed benefits from storing the BindingPairs in 
> a single data structure, making iterating and looking up values faster.  
> Additionally, more Binding objects die young instead of being held as part of 
> a higher level algebra collection (like sort or distinct), which can help 
> with GC overhead.
> 5) Expose an iterator of BindingPairs ordered by variable.  This is needed 
> for BindingComparator, and may be an option for Algebra.merge()/compatible() 
> if we eliminate fast get(Var) lookups of nodes (as a consequence of 3).  The 
> ordering could be determined at construction or be initialized lazily.
> 6) Method for estimating memory size for the binding.  Would be very useful 
> for setting threshold policies for DataBags.  Although this may be tough to 
> do, especially if Nodes are shared between bindings.
> Some of these points need some more investigation, and some good profiling to 
> ensure that they are beneficial, especially 3, 4, and 5.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to