From: <[EMAIL PROTECTED]>
Reply-To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: RE: Full Text Search for MS Word and Excel files?
Date: Thu, 26 Feb 2004 14:31:43 +0100

In the "store driven" indexing framework (different to the "event driven" stuff,
we still have to look how to bring them together :-) it looks like:



==> ContentStore | PUT (UPDATE, DELETE) ==> ParentStore | ==> ContentIndexer


SEARCH ==> org.apache.slide.search ==> WordContentIndexer


So the content store is not affected in this scenario.


==> ContentStore ==> my.doc | PUT (UPDATE, DELETE) ==> ParentStore | ==> WordContentExtractor ==> ContentIndexer

The text your extractor produces is the input for Lucene. This is not content
data, it is only used for searching.



It sounds like I can index content using either the “store driven” approach or the “event driven” approach. So, is the only advantage of the “event driven” approach being that I can treat Word metadata (author, date, etc…) as PROPPATCH’d WebDAV properties?


This might be useful, but even if we did this I think we would need to allow users to override these properties from the UI they use to upload the file. Words notion of author will probably be different than our programs notion of author. Date uploaded is probably more important than Words notion of date created… etc…

This leaves me wondering whether extracting this metadata is better done in a preprocessing library outside of Slide, at least for my application.

A few more questions:

You said the Lucene Indexer implementation is the hold up right now. If I write Indexers for the Office docs, will I just be extracting the text and passing it on to the main Lucene Indexer? In other words, do I need to work on the Lucene Indexer before I try to work on Office Indexers, and would help be appreciated on this? (I know you said you are looking for a real expert… which is not me.)

Is org.apache.slide.search the right API to use to search from a web app?

Are indexes and full text search on word docs even possible with an RDBMS as the content store?

_________________________________________________________________
Store more e-mails with MSN Hotmail Extra Storage – 4 plans to choose from! http://click.atdmt.com/AVE/go/onm00200362ave/direct/01/



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to