You could look at the createIndexes() method in this class:
https://github.com/Clay-Ferguson/meta64/blob/master/src/main/java/com/meta64/mobile/repo/OakRepository.java
That class creates some "property indexes" and some "fulltext indexes".
It's best to have property indexes on the things you sort by (order by) or
do exact match searches for (A=B), and make a fulltext indexes for
substring searches. Also if you can use ISDESCENDANTNODE to narrow your
search I think that may help, but may not be possible in your case. Also
check to be sure that if you do just the jcr:contains() all by itself your
search should be super fast even with 10million, to verify lucene is
working using an index on the 'content' property. I don't know why you
needed that 'order by', because i though score was always the default
ordering, and i am not expert enough to know what "//element(" does, but
here's an example of some JCR_SQL2 that I think works (although not tested
at scale of millions):
https://github.com/Clay-Ferguson/meta64/blob/master/src/main/java/com/meta64/mobile/service/NodeSearchService.java
Best regards,
Clay Ferguson
[email protected]
On Mon, Jan 2, 2017 at 6:27 AM, KÖLL Claus <[email protected]> wrote:
> Hi !
>
> First .. good new year to the whole community !
>
> We have in one of our workspace a lot of office documents (about 10
> Million) which are full text indexed.
> At the moment we have searches that take really long .. 1~3 Minutes ...
>
> The following xpath query will be executed ..
>
> "//element(*, dvt:document)[@dvt:referenceId = 'protid:123' and
> jcr:contains(jcr:content, 'tirol')] order by jcr:score()";
>
> Every node has a field called referenceid. In sum there are not really
> much documents who has this value.
> So the search "//element(*, dvt:document)[@dvt:referenceId = 'protid:123']
> order by jcr:score()";
> is quite fast.
>
> But in combination with the fulltextsearch on the content, it is really
> slow.
> The word 'tirol' is used in really much documents ...
>
> So I have tried to set a limit of 100 on the Query [query.setLimit(100)]
> but this has made the search not really faster.
> After struggling around the code I have found that the limit hint is
> handled in the TopFieldCollector.
> But the whole time is spent before the collector skips the results.
>
> I see a lot of info logs like this
> INFO 2017-01-02 13:13:10,610 org.apache.jackrabbit.core.
> query.lucene.DocNumberCache.size=107757/100000000, #accesses=428591,
> #hits=0, #misses=428591, cacheRatio=0%
>
> I think the BooleanScorer of the fulltext will be evaluated against all of
> its hits and therefore the docid to nodeid cache will be filled.
>
> Is it possible to give the limit hint to the fulltext scorer ? or maybe I
> do not understand it and somebody give me some hints how I can make the
> search faster ..
>
> thanks
> claus
>