I am a newbie to Lucene and I'm considering using it in an upcoming project.
I've read through the documentation but I still have a number of questions:
I'll do my best with some pointers below...
1. SEGMENTING AN INDEX & QUERIES BY SITE SCOPE
In my use case, I have a number of logical websites backed by the same
underlying content store. A Document may be ultimately end up "belonging"
to one or more logical sites, but at a distinct URL for each. The
simplistic solution is to maintain indices for each logical site, but this
will result in some unwanted duplication and the need to update multiple
indices on "shared" content changes. Other than that, can anyone suggest
approaches for how to segment a single index to accomodate multiple logical
sites and allow queries within a particlar site's scope? Are fields the
solution? How should the distinct per-site URLs be managed?
I don't think there is a definitive "best" way to do this. Per-site indexes is one option. Using a "site" field is another. Queries for a particular site could be done either by using QueryFilter or by wrapping all queries in a BooleanQuery with a required TermQuery for the "site".
Sites could share documents by simply adding multiple Field.Keyword("site", site) to the documents.
2. LOCALIZED CONTENT
I understand that at its core, Lucene can support content from any locale
and character set supported by Java. What is the best way of implementing
Lucene to handle a content base which includes numerous locales. One index
per locale or should all Documents be placed in a single index and tagged
with a "locale" field? Or is there another approach altogether?
Again, there isn't really a "best" way, I don't think. How does the locale situation relate to the previously mentioned site separation? A "locale" field is a perfectly reasonable way to go also. I don't know of any other approach.
3. DOCUMENT URLS
Is the URL at which the original document can be retrieved generally (i.e.,
for linking search results to the original doc) stored as a non-index,
non-tokenized, stored Field in the Document?
It depends on whether you want to query for it or not. Field.Keyword if you want to be able to query for it. Field.UnIndexed if you want it with the attributes you specified.
4. QUERY FILTERING & SORTING BY FIELD VALUE
In my application I have a pretty typical need to distinguish between
different document types (e.g., FAQs, Articles, Reviews, etc.) in order to
allow the user to restrict their results to particular types of documents or
to sort results by type. Are fields again the solution for this? Can
Queries filter or sort results/hits on exact field values (i.e.,
non-tokenized field values).
Fields are generally the solution :) What else is there? Documents have Fields. Fields are where you put metadata about documents. A document type makes perfect sense to put in a field.
QueryFilter or the BooleanQuery AND trick mentioned above would allow you to narrow results down to a particular set of types. Sorting works on exact values, yes, and you can write your own sorting implementation if lexicographic or numeric sorting are not sufficient which could key off external information if needed. To sort on a field, it needs to be indexed and non-tokenized (stored is irrelevant). There must be only a single term for that field in a document. Check the Javadocs for the Sort class for more details on the sorting requirements.
5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT
How is Lucene to be deployed in a clustered web-app environment? Do all
cluster nodes require access to a networked filesystem containing the index
files or is there another solution? How is concurrency managed when the
index is being incrementally updated?
This is entirely up to you to manage. I'm sure developers building solutions with Lucene have employed all sorts of various architectures.
Concurrency is managed via lock files that need to be shared among apps interacting with the index. The short answer is only a single process (but multiple threads sharing an IndexWriter) can index at a time. You would probably want to build some sort of queuing infrastructure and have a single indexer, or index into separate indexes and merge them.
Any answers and suggestions are much appreciated. Thanks.
I hope this helps some.
Erik
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
