[ 
https://issues.apache.org/jira/browse/JCR-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546483
 ] 

Ard Schrijvers commented on JCR-1213:
-------------------------------------

I have tested all patches. All three patches seem to have similar performance 
improvements (compared to the former code they are all *very* much faster in 
the cached version). ATM I am reshuffle the indexes to have more  parent 
lookups in other indexes, and then will really see if all have the same 
performance. It might boil down to which patch has the nicest solution. Mine 
with the WeakReferences is obviously outdated, and can be discarded. 

I think you have to decide which one you want to choose. I think Christoph's 
patch uses a little more memory because it keeps MultiIndexReaderDoc objects in 
memory, but, instead of keeping the MultiIndexReaderDoc  in memory, you could 
store the int and long in the DocUUID (just like instead of a UUID isntance we 
store msb and lsb). Though, we are talking only about the object overhead (so 1 
million MultiIndexReaderDoc  would imply some 12 Mb extra memory, not really 
shocking)


But still...during the test, having 1.200.000 nodes in the repository, I 
realized we are still doing something 'irrational'. It won't be easy to 
implement I think, because it also depends/involves wether people have 
implemented an AccessManager, but if I have the following test:

Query q = qm.createQuery("stuff//[EMAIL PROTECTED]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);
}

Since my "stuff//[EMAIL PROTECTED]" gives me 1.200.000, it makes perfect sense 
to users I think, that even with our patches and a working cache, that 
retaining them all would be slow. But if I set the limit to 1 or 10, I would 
expect to have performance (certainly when you have not implemented any 
AccessManager).

But, if I set limit to 1, why would we have to check all 1.200.000 parents 
wether the path is correct? 

If I get a sorted hits by lucene, I would want to start with the first one, and 
check the parent, then the second, etc, untill I have a hit that is correct 
according its path. If I have a limit of 10, we would need to get 10 successes. 
Obviously, in the worst case scenario, we would still have to check every hit 
for its parents, but this would be rather exceptional i think.

Ofcourse, when people have a custom AccessManager impl, you only know after the 
access manager wether the hit was a real hit. But when having 

Query q = qm.createQuery("stuff//[EMAIL PROTECTED]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);
}

and I have > 1.000.000 hits, and I have to wait, even in the cached version, a 
few seconds, but changing "stuff//[EMAIL PROTECTED]" into "//[EMAIL PROTECTED]" 
reduces it to a couple of ms, that does not make sense. 

I think we should consider wether we could do the DescendantSelfAxisQuery or 
ChildAxisQuery as some sort of lazy filter. In the end, when users want to also 
 have the total hits for "stuff//[EMAIL PROTECTED]", we obviously are still 
facing a slow query. WDOT?  This though obviously might belong to a new jira 
issue, or to the existing one about the DescendantSelfAxisQuery  and 
ChildAxisQuery  performance.


> UUIDDocId cache does not work properly because of weakReferences in 
> combination with new instance for combined indexreader 
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JCR-1213
>                 URL: https://issues.apache.org/jira/browse/JCR-1213
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3.3
>            Reporter: Ard Schrijvers
>             Fix For: 1.4
>
>         Attachments:  JCR-1213-ckiehl.txt, JCR-1213-mreutegg.patch, 
> JCR-1213.patch, JCR1213Test.java
>
>
> Queries that use ChildAxisQuery or DescendantSelfAxisQuery make use of 
> getParent() functions to know wether the parents are correct and if the 
> result is allowed. The getParent() is called recursively for every hit, and 
> can become very expensive. Hence, in DocId.UUIDDocId, the parents are cached. 
> Currently,  docId.UUIDDocId's are cached by having a WeakRefence to the 
> CombinedIndexReader, but, this CombinedIndexReader is recreated all the time, 
> implying that a gc() is allowed to remove the 'expensive' cache.
> A much better solution is to not have a weakReference to the 
> CombinedIndexReader, but to a reference of each indexreader segment. This 
> means, that in getParent(int n) in SearchIndex the return 
> return id.getDocumentNumber(this) needs to be replaced by return 
> id.getDocumentNumber(subReaders[i]); and something similar in 
> CachingMultiReader. 
> That is all. Obviously, when a node/property is added/removed/changed, some 
> parts of the cached DocId.UUIDDocId will be invalid, but mainly small indexes 
> are updated frequently, which obviously are less expensive to recompute.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to