[
https://issues.apache.org/jira/browse/JCR-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546483
]
Ard Schrijvers commented on JCR-1213:
-------------------------------------
I have tested all patches. All three patches seem to have similar performance
improvements (compared to the former code they are all *very* much faster in
the cached version). ATM I am reshuffle the indexes to have more parent
lookups in other indexes, and then will really see if all have the same
performance. It might boil down to which patch has the nicest solution. Mine
with the WeakReferences is obviously outdated, and can be discarded.
I think you have to decide which one you want to choose. I think Christoph's
patch uses a little more memory because it keeps MultiIndexReaderDoc objects in
memory, but, instead of keeping the MultiIndexReaderDoc in memory, you could
store the int and long in the DocUUID (just like instead of a UUID isntance we
store msb and lsb). Though, we are talking only about the object overhead (so 1
million MultiIndexReaderDoc would imply some 12 Mb extra memory, not really
shocking)
But still...during the test, having 1.200.000 nodes in the repository, I
realized we are still doing something 'irrational'. It won't be easy to
implement I think, because it also depends/involves wether people have
implemented an AccessManager, but if I have the following test:
Query q = qm.createQuery("stuff//[EMAIL PROTECTED]", Query.XPATH);
if (q instanceof QueryImpl) {
// limit the result set
((QueryImpl) q).setLimit(1);
}
Since my "stuff//[EMAIL PROTECTED]" gives me 1.200.000, it makes perfect sense
to users I think, that even with our patches and a working cache, that
retaining them all would be slow. But if I set the limit to 1 or 10, I would
expect to have performance (certainly when you have not implemented any
AccessManager).
But, if I set limit to 1, why would we have to check all 1.200.000 parents
wether the path is correct?
If I get a sorted hits by lucene, I would want to start with the first one, and
check the parent, then the second, etc, untill I have a hit that is correct
according its path. If I have a limit of 10, we would need to get 10 successes.
Obviously, in the worst case scenario, we would still have to check every hit
for its parents, but this would be rather exceptional i think.
Ofcourse, when people have a custom AccessManager impl, you only know after the
access manager wether the hit was a real hit. But when having
Query q = qm.createQuery("stuff//[EMAIL PROTECTED]", Query.XPATH);
if (q instanceof QueryImpl) {
// limit the result set
((QueryImpl) q).setLimit(1);
}
and I have > 1.000.000 hits, and I have to wait, even in the cached version, a
few seconds, but changing "stuff//[EMAIL PROTECTED]" into "//[EMAIL PROTECTED]"
reduces it to a couple of ms, that does not make sense.
I think we should consider wether we could do the DescendantSelfAxisQuery or
ChildAxisQuery as some sort of lazy filter. In the end, when users want to also
have the total hits for "stuff//[EMAIL PROTECTED]", we obviously are still
facing a slow query. WDOT? This though obviously might belong to a new jira
issue, or to the existing one about the DescendantSelfAxisQuery and
ChildAxisQuery performance.
> UUIDDocId cache does not work properly because of weakReferences in
> combination with new instance for combined indexreader
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: JCR-1213
> URL: https://issues.apache.org/jira/browse/JCR-1213
> Project: Jackrabbit
> Issue Type: Improvement
> Components: query
> Affects Versions: 1.3.3
> Reporter: Ard Schrijvers
> Fix For: 1.4
>
> Attachments: JCR-1213-ckiehl.txt, JCR-1213-mreutegg.patch,
> JCR-1213.patch, JCR1213Test.java
>
>
> Queries that use ChildAxisQuery or DescendantSelfAxisQuery make use of
> getParent() functions to know wether the parents are correct and if the
> result is allowed. The getParent() is called recursively for every hit, and
> can become very expensive. Hence, in DocId.UUIDDocId, the parents are cached.
> Currently, docId.UUIDDocId's are cached by having a WeakRefence to the
> CombinedIndexReader, but, this CombinedIndexReader is recreated all the time,
> implying that a gc() is allowed to remove the 'expensive' cache.
> A much better solution is to not have a weakReference to the
> CombinedIndexReader, but to a reference of each indexreader segment. This
> means, that in getParent(int n) in SearchIndex the return
> return id.getDocumentNumber(this) needs to be replaced by return
> id.getDocumentNumber(subReaders[i]); and something similar in
> CachingMultiReader.
> That is all. Obviously, when a node/property is added/removed/changed, some
> parts of the cached DocId.UUIDDocId will be invalid, but mainly small indexes
> are updated frequently, which obviously are less expensive to recompute.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.