Another thought... I would love to setup some performance tests to validate these ideas as well.
Alex On Fri, Mar 28, 2008 at 10:53 AM, Alex Karasulu <[EMAIL PROTECTED]> wrote: > > > On Fri, Mar 28, 2008 at 10:25 AM, Alex Karasulu <[EMAIL PROTECTED]> > wrote: > > > On Fri, Mar 28, 2008 at 7:06 AM, Emmanuel Lecharny <[EMAIL PROTECTED]> > > wrote: > > > > > Considering we are searching in the US, the scope could be > > > c=US,dc=acme,dc=com, we will limit the search in this branch. This is > > > done by adding internally a new assertion in the filter : > > > > > > (&(ou=java)(ou=engineering)(scope.subtree="c=US,dc=acme,dc=com")) > > > > > > <note> > > > This is not exactly how it is done in the server, it's just a 10K feet > > > presentation of the principles > > > </note> > > > > > > > You're on target yeah we add this node to the AST. BTW with alias > > dereferencing while searching the number of search bases expand in the scope > > node as we encounter aliases which "expand" the search space. Still we add > > one scope node but that scope node performs multiple assertions. > > > > Now, the search engine do a computation on this filter to determinate > > > which index should be used. Suppose that the ou=java links to 10000 > > > persons, and that the ou=engineering links to 1000 persons, we will > > > use > > > the ou=engineering attribute to limit the number of candidate to bring > > > back from the backend, and check with the two other assertions to > > > eliminate the bad candidates. We do that by enumerating the number of > > > elements associated with eac assertion : > > > > > > (&(10000:ou=java)(1000:ou=engineering)(MAX_INT:scope.subtree= > > > "c=US,dc=acme,dc=com")) > > > > > > > Yep this "annotation" of the modified filter AST is done by the > > optimizer. And BTW we'll be adding a new index which will give us an exact > > count for the scope node so we don't have to use MAX_INT anymore. > > > > Currently, the scope assertion is just used to eliminate the candidates > > > which are found by the search engine, we don't 'count' the number of > > > candidates under a position on the tree (so the MAX_INT value). In our > > > sample, if we suppose that the scope is containing 100 entries, we > > > immediatly see that we are not using the correct assertion to bring > > > back > > > the entries from the backend. > > > > > > Right this will change. I will split the current hierarchy index (a > > misnomer btw since it's really the parent-child index) into a onelevel and > > sublevel index. > > > > Moreover if we apply the scope to the 'ou' index, we may see that we > > > have only 10 java coders under the c=US,dc=acme,dc=com branch, instead > > > of 10000 (not likely, but who knows ? :) > > > > > > So the idea we had was to use the hierarchy index to add some > > > information about the other index. Here, we may have the number of > > > java > > > coder in the c=US,dc=acme,dc=com branch stored under the hierarchy > > > index > > > for the key 'c=US,dc=acme,dc=com'. We store all the counst for each > > > index into this Hierarcy index. > > > As it will be modified when we modify the other index, this might cost > > > a > > > bit on modifications, deletions and additions, but this will have a > > > huge > > > impact on the search requests, which is what we target. > > > > > > In our example, if we suppose that US has only 10 java coders, the > > > annotated filter will be : > > > > > > (&(10:ou=java)(200:ou=engineering)(500:scope.subtree= > > > "c=US,dc=acme,dc=com")) > > > > > > assuming that we also have 200 engineering in the branch, and that the > > > scope will bring 500 entries. > > > > > > Now, we will just bring back 10 candidates from the backend instead of > > > 1000. > > > > > > This is really a gross example, but I think it gives the idea. > > > > > > Is it totally insane, or does it makes some sense ? > > > > > > > Let me rephrase in my own words to see if I understand the idea. You're > > suggesting the use of a kind of a scope based LUT or separate index used to > > return candidates matched by an assertion with a specific attribute and > > value limited by scope. > > > > Effectively we're going to have this (I think - please correct me) with > > the two separate system indices for hierarchical relationships instead of > > just having one parent-child index as we do now. If you look at the > > documentation here you'll see a description of the onelevel and sublevel > > indices: > > > > > > http://cwiki.apache.org/confluence/display/DIRxSRVx11/Structure+and+Organization > > > > ======== From Doco ======= > > > > "The one level index maps parent entry identifiers from the master to > > the identifiers of their children. This index is used to facilitate various > > name space operations like renaming, and moving branches which impacts > > children. It is also used by the search engine to conduct one level search > > requests." > > > > This index maps ancestor entry identifiers from the master to the > > identifiers of all their descendants including immediate children. It does > > not map the descendants of the context entry (at the root) of the partition. > > This is just unnecessary since all other entries in the partition satisfy > > this condition. If this is something we desire to enumerate we can get a > > reverse Cursor on the ndn index and advance past the first entry, to start > > enumerating all the descendant identifiers of the context entry. > > > > This index plays a critical role in subtree search. It allows the search > > optimizer to detect a count and annotate the modified search filter for > > subtree scope nodes. It also allows a Cursor to enumerate the set of > > descendants associated with an entry. Without this index, the search engine > > must resort to expensive DN based operations for subtree scope constraint > > checking." > > ======== From Doco ======= > > > > Now with these two indices as opposed to the one we had we can better > > constrain subtree search. The subtree search will limit the candidates as > > will the assertions in the filter as it was given by the user. > > > > But I think you mean more and I think I am catching on to it. You want > > to constrain individual assertions off an index with scope at each level > > rather than at just the top perhaps. I have not thought about this at all > > so this is a good time to explore it. Here's what the search engine does > > to introduce scope into the AST: > > > > > > http://cwiki.apache.org/confluence/display/DIRxSRVx11/SearchEngine+and+Optimizer > > > > So what would taking the conjunction ('AND' or intersection) of the > > scope node with each leaf assertion do to influence the search plan and cost > > of searching instead of taking the conjunction only at the top? > > > > I guess it depends on the kind of logic in the filter. If we have a > > disjunction (OR) then applying the scope constraint to all child > > sub-expressions could seriously reduce the IO/processing of search > > operations while doing the same work. Take this example: > > > > (| (ou=engineering)[100] (c=US)[1000] (l=Jacksonville)[10]) > > > > => 1110 worst case for entire filter > > > > If under scope constraints these numbers are reduced by a factor of 10: > > > > (| (ou=engineering)[10] (c=US)[100] (l=Jacksonville)[1]) > > > > => 111 worst case for entire filter > > > > More thoughts ... > > So if 1000 entries are in scope. Then the filter above with the first set > of scan counts would result in a Cursor over the subtree index (presuming > subtree search scope parameter value). This is because 1000 < 1110. This > is what happens when the scope node is AND'ed up at the top instead of being > pushed deeper into the AST in multiple places. > > If we push the scope node down to constrain each assertion in 3 places as > well as at the top then the Cursor will still be built on the scope node at > the top most conjunction. This is because and'ing the scope node in lower > layers still results in the same scan counts. For example > > (& (ou=engineering)[100] (scope)[1000] ) > > still has a total scan count of 100. So the filter AST with scope > constraints applied all over will still build the candidate generating > cursor on the topmost conjunction's scope. It will require 1000 iterations > with 1000*6 lookups now. Without pushing the scope into leaves we get 1000 > iterations with 1000*3 lookups. > > So this is not so much a great way to implement your idea. > > Alex >
