Re: Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering

Erik Hatcher Mon, 31 May 2004 06:14:30 -0700

On May 30, 2004, at 10:34 PM, Sasha Haghani wrote:

I am a newbie to Lucene and I'm considering using it in an upcoming project. I've read through the documentation but I still have a number of questions:


I'll do my best with some pointers below...

1. SEGMENTING AN INDEX & QUERIES BY SITE SCOPE In my use case, I have a number of logical websites backed by the same underlying content store. A Document may be ultimately end up "belonging" to one or more logical sites, but at a distinct URL for each. The simplistic solution is to maintain indices for each logical site, but this will result in some unwanted duplication and the need to update multiple indices on "shared" content changes. Other than that, can anyone suggest approaches for how to segment a single index to accomodate multiple logical sites and allow queries within a particlar site's scope? Are fields the solution? How should the distinct per-site URLs be managed?

I don't think there is a definitive "best" way to do this. Per-site indexes is one option. Using a "site" field is another. Queries for a particular site could be done either by using QueryFilter or by wrapping all queries in a BooleanQuery with a required TermQuery for the "site".

Sites could share documents by simply adding multiple Field.Keyword("site", site) to the documents.

2. LOCALIZED CONTENT I understand that at its core, Lucene can support content from any locale and character set supported by Java. What is the best way of implementing Lucene to handle a content base which includes numerous locales. One index per locale or should all Documents be placed in a single index and tagged with a "locale" field? Or is there another approach altogether?

Again, there isn't really a "best" way, I don't think. How does the locale situation relate to the previously mentioned site separation? A "locale" field is a perfectly reasonable way to go also. I don't know of any other approach.

3. DOCUMENT URLS Is the URL at which the original document can be retrieved generally (i.e., for linking search results to the original doc) stored as a non-index, non-tokenized, stored Field in the Document?

It depends on whether you want to query for it or not. Field.Keyword if you want to be able to query for it. Field.UnIndexed if you want it with the attributes you specified.

4. QUERY FILTERING & SORTING BY FIELD VALUE In my application I have a pretty typical need to distinguish between different document types (e.g., FAQs, Articles, Reviews, etc.) in order to allow the user to restrict their results to particular types of documents or to sort results by type. Are fields again the solution for this? Can Queries filter or sort results/hits on exact field values (i.e., non-tokenized field values).

Fields are generally the solution :) What else is there? Documents have Fields. Fields are where you put metadata about documents. A document type makes perfect sense to put in a field.

QueryFilter or the BooleanQuery AND trick mentioned above would allow you to narrow results down to a particular set of types. Sorting works on exact values, yes, and you can write your own sorting implementation if lexicographic or numeric sorting are not sufficient which could key off external information if needed. To sort on a field, it needs to be indexed and non-tokenized (stored is irrelevant). There must be only a single term for that field in a document. Check the Javadocs for the Sort class for more details on the sorting requirements.

5. DEPLOYING LUCENE IN A CLUSTERED WEB-APP ENVIRONMENT How is Lucene to be deployed in a clustered web-app environment? Do all cluster nodes require access to a networked filesystem containing the index files or is there another solution? How is concurrency managed when the index is being incrementally updated?

This is entirely up to you to manage. I'm sure developers building solutions with Lucene have employed all sorts of various architectures.

Concurrency is managed via lock files that need to be shared among apps interacting with the index. The short answer is only a single process (but multiple threads sharing an IndexWriter) can index at a time. You would probably want to build some sort of queuing infrastructure and have a single indexer, or index into separate indexes and merge them.

Any answers and suggestions are much appreciated. Thanks.


I hope this helps some.

        Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Newbie Questions: Site Scoping, Page Type Filtering/Sorting, Localization, Clustering

Reply via email to