Re: Full Text Search for MS Word and Excel files?

Daniel Florey Thu, 26 Feb 2004 22:16:29 -0800

Ryan Rhodes wrote:

From: <[EMAIL PROTECTED]>
Reply-To: "Slide Developers Mailing List" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Subject: RE: Full Text Search for MS Word and Excel files?
Date: Thu, 26 Feb 2004 14:31:43 +0100
In the "store driven" indexing framework (different to the "event driven" stuff, we still have to look how to bring them together :-) it looks like:
==> ContentStore
|
PUT (UPDATE, DELETE) ==> ParentStore
|
==> ContentIndexer
SEARCH ==> org.apache.slide.search ==> WordContentIndexer

So the content store is not affected in this scenario.
==> ContentStore ==> my.doc
|
PUT (UPDATE, DELETE) ==> ParentStore
|
==> WordContentExtractor ==> ContentIndexer
The text your extractor produces is the input for Lucene. This is not content data, it is only used for searching.
It sounds like I can index content using either the “store driven” approach or the “event driven” approach. So, is the only advantage of the “event driven” approach being that I can treat Word metadata (author, date, etc…) as PROPPATCH’d WebDAV properties?

Event driven indexing has the advantage that it is based on event collections. So if you do a webdav PUT the store method in slide is called very often (even more if versioning is enabled). So the events that are fired inside a transaction are collected and filtered afterwords so that only changes that really took place are recognized (e.g. if you create and remove the same document in a single transaction no index update is needed). The second advantage is that it can be used asynchrnously so that you can PUT a document very quick event if indexing consumes a lot of time.

This might be useful, but even if we did this I think we would need to allow users to override these properties from the UI they use to upload the file. Words notion of author will probably be different than our programs notion of author. Date uploaded is probably more important than Words notion of date created… etc…

You are mixing up indexing and extracting. These two things are totally separate. You can write an extractor for word documents and this will work with every indexer, because extracted properties are stored in slide a usual. There is no need to use the extractor to extract the author from the word document. You can do the normal PROPPATCH for setting the author as well.

This leaves me wondering whether extracting this metadata is better done in a preprocessing library outside of Slide, at least for my application.

If you really want to achieve that the user can set the author "by hand" you should better use PROPPATCH than extractor. Extractor can be used to grab the text from the word document to make it full text searchable.

A few more questions:

You said the Lucene Indexer implementation is the hold up right now. If I write Indexers for the Office docs, will I just be extracting the text and passing it on to the main Lucene Indexer? In other words, do I need to work on the Lucene Indexer before I try to work on Office Indexers, and would help be appreciated on this? (I know you said you are looking for a real expert… which is not me.)

See above. You don't need to write an indexer but just an extractor if lucene based indexing works. I have to improve the extractor interface so that you can use it to grad the document content (this is not possible at the moment).

Is org.apache.slide.search the right API to use to search from a web app?

You should better use webdavclient (DASL) or wvcm to access the search from your web app.

Are indexes and full text search on word docs even possible with an RDBMS as the content store?

I think it should be possible to index documents independantly from the underlying store. Martin and I are trying to find a way to enable this.


Regards,
Daniel

_________________________________________________________________ Store more e-mails with MSN Hotmail Extra Storage – 4 plans to choose from! http://click.atdmt.com/AVE/go/onm00200362ave/direct/01/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Full Text Search for MS Word and Excel files?

Reply via email to