RE: improving the scalability in searching

Ard Schrijvers Tue, 14 Aug 2007 03:05:32 -0700

Hello,

> 
> I agree with you that the current implementation is not 
> optimized for queries 
> that check the existence of a property. Your proposed 
> solution seems reasonable, 
> I would implement it the same way. There's just one minor 
> obstacle, how do we 
> implement this change in a backward compatible way? an 
> existing index without 
> this additional field should still work.


Apart from a possible solution, the policy is that moving some tag to the 
latest jackrabbit version should always be possible without having to re-index? 
Is it not an option to have some kind of warning that re-indexing is needed 
when mocing to version x ? 

My experience though with other repositories (slide) and a custom lucene 
indexing layer on top of it handling all searches, is that for efficient 
querying, I quite frequently had to change some indexing settings, which 
implied re-indexing the entire repository. IMO, when you need a performant 
search implementation, you need to be able to tune the parts you index, and you 
need to be able to query on these. I think a single property should be possible 
to index in different customizable ways. Might this be an option for the 
indexingConfiguration, to be able to index a single property in multiple ways? 
For example: each article(node) has an author property. I have 10.000.000 
nodes. Now, I want to see the number of documents for each author with his name 
starting with an "S". The only way to query this efficiently AFAICS, is 
querying for some indexed field that holds the starting letter of an author 
(perhaps configuring in the indexing configuration that the author name should 
also be indexed in a seperate property, for example with a configured analyzer 
that used the EdgeNGramTokenizer from lucene to index the first letter only. 
for example something like:

<global-index-rules>
    <property name="author">
         <copyField dest="author-starting-letter" 
analyzer="mypackage.FirstLetterAnalyzer"/>
    </property>
    <property name="publishdate">
         <copyField dest="publishdate-weeknumber" 
analyzer="mypackage.DateWeeknumberAnalyzer"/>
    </property>
</global-index-rules>

where for example the publishdate-weeknumber holds the week number of a date 
(if you need fast searching for all published articles in week X, but the 
weeknumber is not a propery of the document)

But this might complicate indexing configuration obviously quite a bit, and you 
might need to query on "virtual" properties not defined in .cnd files, which 
proabably is not possible (though, I do not yet know enough of that part...is 
this possible with the org.apache.jackrabbit.core.virtual package?)

Bottom line, before thinking about best way to find improved version for 
querying nodes for existing props, is it allowed that a new jackrabbit release 
forces people to re-index? IMO, it is quite a limitation if this is never 
allowed (AFAIK, a lucene index might also become corrupted or in clustered 
environments get out of sync, that makes a re-index needed )

Regards Ard

> 
> regards
>   marcel
>

RE: improving the scalability in searching

Reply via email to