Re: Implementing CMS search function using Lucene

Grant Ingersoll Sun, 13 Apr 2008 03:20:44 -0700

For starters, you might have a look at Jackrabbit (Content Repo. builton Lucene) as I know it powers several CMS systems.


More below.


On Apr 3, 2008, at 8:24 AM, Илья Казначеев wrote:

Hello.
We've designing a CMS in Java, and I've trying to implement sitesearch
function using lucene.

The basic conception is that:
- Site features numerous objects that we'd like to throw into index:pages,various text blocks on those pages, descriptions and keyword listsof thosepages, static bits of html, goods sections with goods inside them,etc, etc.
- There would be a search form that would be occasionally used by site
visitors.
- Visitors are highly unlikely to use advanced queries. I assume 95%querieswould be either a few keywords or a phrase to search. We have tofind the
best matches for such queries.
- The thing I want to introduce is "phrase in quotes" to search forexactphrase. Also, most our sites are in Russian, so some, even ifrudimentary,
support for Russian morphology is a plus.

I've dug into examples and have a following set of questions:
- Our objects are fairly structured, so I would like to introduce alot of
fields, something like five different for each object type.
But, as far as I see, all Queries are going to search only one field.
This is certainly bad because users surely want to search *all* thefields at
once. The aren't going to bother with queries.

You can search whatever fields you want. It's all a question of howyou generate your queries. Typically, one has an "all" field as well.


Maybe I can add queries over every field joined by 'or' operation, but
wouldn't that be too slow?


Probably not.  I suppose if you are talking hundreds of fields

I don't want it to work more than half second on
reasonable sized index. Also, I don't want to hard-code exact listof fields,I might add them as I develop the system. Is this doable, would thatwork? OrI'll have to stuff all text content from object into one blob-fieldand query
that? Which way is more reasonable?

I'd probably do both, then you can handle generic queries as well asfield specific ones.

- Our objects have their hierarchy, e.g., blocks belong to page. Isthere away to make Lucene govern parent-child relation, somehow summinghits in allchilds to find the best-matching parent? I assume, no, then is therea wayfor me to go thru matching documents list, reducing it by 'adding'blocks'
scores to find the best matching page?

You have to manage parent-child yourself. I am pretty sure Jackrabbitdoes this.

- Is there a way to set weights for different fields? Let's say,content havea weight of 1, title have a weight of 5 and picture subscribe have aweight
of 0.5. If no, can I do that by hand?


Field.setBoost()

- Is there something to support Russian morphology (it's all like"the last n
letters of a word might change, we should match all forms") for either
indexer or searcher?

Check contrib/analyzers, I see some Russian analyzers in there, but Ican't speak to the quality.

Maybe "inexact match", QueryParser's ~ operator, would
be enough? I heard Nutch project have something like that, but Iwonder if Iwould be able to reuse parts of Nutch, and I surely can't use Nutchas a
whole.


Not sure.



If there are another considerations, they're welcome.

Thanks for your probable replies.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Implementing CMS search function using Lucene

Reply via email to