Hi,
> I've read you proposal (and all email related to it). One thing I'd like to advise
>is to distinguish the crawler and the loader component.
> The crawler is responsible for gathering documents from several sources.
> The loader (or indexer) is responsible for loading the gathered documents to the
>index (I think in batch mode).
I see three different component types:
- file producer (crawler, database reader, Filesystem reader)
- Document Handler (knows the syntax (maybe semantic) of file-content)
- Indexer (Lucene)
Is batch mode really the way. I think of something like pipes (But maybe i'm wrong).
>
> I think it's redundant to hardcode the indexing logic into all crawler component
>(ftp, http, jdbc, filesys crawler). It's an interesting question how the components
>can communicate? (don't you think using avalon is a good way?)
I think, that the configuration of the indexing procedure, including work for all
three component types, is the real adventure. The components itself are relatively
easy to write. I first thought of ant as configuration framework. But i think that
would work only for batch mode. The main question is: What is the production
unit we are talking about. I don't think, that this should be simple files. I think it
must be records of String,Date,Integer,Binary-Fields, which could be mapped to lucene
fiels.
Ok, i will tell you some more details:
a crawler will produce something like
mime: application/word
created:12.1.2001
data: <binary>
url:http://www.sample.com/test.doc
the document handler for word docs will take and transform this to
mime: application/word
created:12.1.2001
url:http://www.sample.com/test.doc
author:Manfred Sch�fer
title:'77 secrets of indexing documents'
asText: '... the document as plain text ...'
now we come to lucene, the fields must be mapped to lucene fields
LUCENE-FIELDS -> DOCUMENT-FIELDS
mimetype->mime
created->created
url->url
author->author
default->author, asText
Working with ant in batch mode could make use of XML for the representation of the
records above. Configuring a pipe-system with a xml-config-file is not so simple.
I don't know avalon, so i cant't say anything about it. But i would favor to have at
least a possiblity to works only with configuration, without programming.
regards,
Manfred
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>