For starters, you might have a look at Jackrabbit (Content Repo. built
on Lucene) as I know it powers several CMS systems.
More below.
On Apr 3, 2008, at 8:24 AM, Илья Казначеев wrote:
Hello.
We've designing a CMS in Java, and I've trying to implement site
search
function using lucene.
The basic conception is that:
- Site features numerous objects that we'd like to throw into index:
pages,
various text blocks on those pages, descriptions and keyword lists
of those
pages, static bits of html, goods sections with goods inside them,
etc, etc.
- There would be a search form that would be occasionally used by site
visitors.
- Visitors are highly unlikely to use advanced queries. I assume 95%
queries
would be either a few keywords or a phrase to search. We have to
find the
best matches for such queries.
- The thing I want to introduce is "phrase in quotes" to search for
exact
phrase. Also, most our sites are in Russian, so some, even if
rudimentary,
support for Russian morphology is a plus.
I've dug into examples and have a following set of questions:
- Our objects are fairly structured, so I would like to introduce a
lot of
fields, something like five different for each object type.
But, as far as I see, all Queries are going to search only one field.
This is certainly bad because users surely want to search *all* the
fields at
once. The aren't going to bother with queries.
You can search whatever fields you want. It's all a question of how
you generate your queries. Typically, one has an "all" field as well.
Maybe I can add queries over every field joined by 'or' operation, but
wouldn't that be too slow?
Probably not. I suppose if you are talking hundreds of fields
I don't want it to work more than half second on
reasonable sized index. Also, I don't want to hard-code exact list
of fields,
I might add them as I develop the system. Is this doable, would that
work? Or
I'll have to stuff all text content from object into one blob-field
and query
that? Which way is more reasonable?
I'd probably do both, then you can handle generic queries as well as
field specific ones.
- Our objects have their hierarchy, e.g., blocks belong to page. Is
there a
way to make Lucene govern parent-child relation, somehow summing
hits in all
childs to find the best-matching parent? I assume, no, then is there
a way
for me to go thru matching documents list, reducing it by 'adding'
blocks'
scores to find the best matching page?
You have to manage parent-child yourself. I am pretty sure Jackrabbit
does this.
- Is there a way to set weights for different fields? Let's say,
content have
a weight of 1, title have a weight of 5 and picture subscribe have a
weight
of 0.5. If no, can I do that by hand?
Field.setBoost()
- Is there something to support Russian morphology (it's all like
"the last n
letters of a word might change, we should match all forms") for either
indexer or searcher?
Check contrib/analyzers, I see some Russian analyzers in there, but I
can't speak to the quality.
Maybe "inexact match", QueryParser's ~ operator, would
be enough? I heard Nutch project have something like that, but I
wonder if I
would be able to reuse parts of Nutch, and I surely can't use Nutch
as a
whole.
Not sure.
If there are another considerations, they're welcome.
Thanks for your probable replies.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]