Implementing CMS search function using Lucene

Илья Казначеев Thu, 03 Apr 2008 05:24:14 -0700

Hello.

We've designing a CMS in Java, and I've trying to implement site search 
function using lucene.


The basic conception is that:
- Site features numerous objects that we'd like to throw into index: pages, 
various text blocks on those pages, descriptions and keyword lists of those 
pages, static bits of html, goods sections with goods inside them, etc, etc.
- There would be a search form that would be occasionally used by site 
visitors.
- Visitors are highly unlikely to use advanced queries. I assume 95% queries 
would be either a few keywords or a phrase to search. We have to find the 
best matches for such queries.
- The thing I want to introduce is "phrase in quotes" to search for exact 
phrase. Also, most our sites are in Russian, so some, even if rudimentary, 
support for Russian morphology is a plus.

I've dug into examples and have a following set of questions:
- Our objects are fairly structured, so I would like to introduce a lot of 
fields, something like five different for each object type.
But, as far as I see, all Queries are going to search only one field.
This is certainly bad because users surely want to search *all* the fields at 
once. The aren't going to bother with queries.
Maybe I can add queries over every field joined by 'or' operation, but 
wouldn't that be too slow? I don't want it to work more than half second on 
reasonable sized index. Also, I don't want to hard-code exact list of fields, 
I might add them as I develop the system. Is this doable, would that work? Or 
I'll have to stuff all text content from object into one blob-field and query 
that? Which way is more reasonable?
- Our objects have their hierarchy, e.g., blocks belong to page. Is there a 
way to make Lucene govern parent-child relation, somehow summing hits in all 
childs to find the best-matching parent? I assume, no, then is there a way 
for me to go thru matching documents list, reducing it by 'adding' blocks' 
scores to find the best matching page?
- Is there a way to set weights for different fields? Let's say, content have 
a weight of 1, title have a weight of 5 and picture subscribe have a weight 
of 0.5. If no, can I do that by hand?
- Is there something to support Russian morphology (it's all like "the last n 
letters of a word might change, we should match all forms") for either 
indexer or searcher? Maybe "inexact match", QueryParser's ~ operator, would 
be enough? I heard Nutch project have something like that, but I wonder if I 
would be able to reuse parts of Nutch, and I surely can't use Nutch as a 
whole.

If there are another considerations, they're welcome.

Thanks for your probable replies.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Implementing CMS search function using Lucene

Reply via email to