RE: Re: improving the scalability in searching

Ard Schrijvers Mon, 20 Aug 2007 04:58:40 -0700

> Christoph Kiehl wrote: 
> I'm a bit indifferent about 1) because I think the change is 
> not fundamentally 
> enough to justify a new QueryHandler class. Do you have any 
> other plans with the 
> new QueryHandler implementation? If I were to implement a SQL 
> based QueryHandler 
> solution I would create a new QueryHandler implementation, 
> but not for a small 
> change like that.


Well, about other changes, I have some in mind, but I might be seeing the big 
picture wrong: I have been looking through the indexing code, and I just seem 
to be unable to understand why all properties are indexed within the same 
lucene field, '_:PROPERTIES'. AFAICS, it complicates queries. Are the reasons 
for this somewhere in the 'ChildAxisQuery', 'DerefQuery', 'ParentAxisQuery' or 
some other (I haven't looked at these classes yet, so do not know how they 
work)? 

But, for me it seems much more a natural lucene index fit to use a seperate 
lucene Field for *every* unique property name. So, indexing a propety 
modificationDate, does not result in a lucene Field:

<_PROPERTIES> 1:modificationDate?ms27115hc 

but 

<1:modificationDate> ms27115hc

This is IMO a much clearer way to index. I think it makes classes like 
SharedFieldSortComparator redundant, because we can use the standard lucene 
sort (it seems to me that this sort is more efficient than the current JR one. 
Although I did not investigate is, I know that the longer the field values you 
sort on in lucene, the higher the memory consumption. Certainly when sorting is 
done on large result sets, a string prefix like '1:modificationDate?' can 
differ *many* Mb's in memory. OTOH, perhaps the SharedFieldSortComparator takes 
care of this in JR, I am not sure)

Furthermore, indexing properties in lucene with there own property name makes 
you more flexible in implementing new kinds of searches. For example, give me 
all different 'authors' and do a count of how many articles each author has, ie 
facetted browsing. Facetted browsing is with the current indexing strategy much 
harder.  

And, as a possible add on to the indexing configuration class, but I need to 
know what you people think about it (and if it is possible to be jsr 170/283 
compliant), I have been thinking about enriching the index via the indexing 
configuration with 'virtual properties' (I am not sure by the way what this 
org.apache.jackrabbit.core.virtual does, haven't looked at it...perhaps it 
coincides with my ideas, but somebody else might know). Suppose I am having a 
property with a Calendar date. I want in the frontend to be able to search for 
articles in week X. I do not want to store week X as a property, because it is 
an implicit part of the date I already have. I would like to define in indexing 
configuration that myproperty also needs to be index as myproperty_weeknr for 
example (and specify an analyzer that does this for you), and that I can query 
on this one. Just like I would do with the first letter of each author, to 
efficiently query all authors starting with an "a". Could this be implemented 
according the jsr spec, or is this really not compatible?

So, WDOT about indexing properties in seperate lucene Fields, and about 
possibly indexing more information of one property. My experience with lucene, 
is that indexing tactically, eases querying a lot, and gains you lots of 
performance. So, if you do agree on these changes, which I can try to build in 
Jackrabbit, then I think these changes might validate a new QueryHandler class 
to be build aside the old one. WDOT? 

Regards Ard

>

RE: Re: improving the scalability in searching

Reply via email to