Re: Proposal for Lucene / new component

Andrew C. Oliver Sun, 24 Feb 2002 08:39:38 -0800

On Sun, 2002-02-10 at 09:45, Manfred Sch�fer wrote:
> Hi,
> 
> 
> > I've read you proposal (and all email related to it). One thing I'd like to advise 
>is to distinguish the crawler and the loader component.
> > The crawler is responsible for gathering documents from several sources.
> > The loader (or indexer) is responsible for loading the gathered documents to the 
>index (I think in batch mode).
> 
> I see three different component types:
>     - file producer (crawler, database reader, Filesystem reader)
>     - Document Handler (knows the syntax (maybe semantic) of file-content)
>     - Indexer (Lucene)
> 
> Is batch mode really the way. I think of something like pipes (But maybe i'm wrong).
>


I see something that smells a lot like awt style events  only no
threading (necessarily) 

> >
> > I think it's redundant to hardcode the indexing logic into all crawler component 
>(ftp, http, jdbc, filesys crawler). It's an interesting question how the components 
>can communicate? (don't you think using avalon is a good way?)
> 
> I think, that the configuration of the indexing procedure, including work for all 
>three component types, is the real adventure. The components itself are relatively 
>easy to write. I first thought of ant as configuration framework. But i think that 
>would work only for batch mode. The main question is: What is the production
> unit we are talking about. I don't think, that this should be simple files. I think 
>it must be records of String,Date,Integer,Binary-Fields, which could be mapped to 
>lucene fiels.
> 
> Ok, i will tell you some more details:
> 
> a crawler will produce something like
> 
> mime: application/word
> created:12.1.2001
> data: <binary>
> url:http://www.sample.com/test.doc
> 
> 
> the document handler for word docs will take and transform this to
> 
> mime: application/word
> created:12.1.2001
> url:http://www.sample.com/test.doc
> author:Manfred Sch�fer
> title:'77 secrets of indexing documents'
> asText: '... the document as plain text ...'
> 
> now we come to lucene, the fields must be mapped to lucene fields
> 
> LUCENE-FIELDS -> DOCUMENT-FIELDS
> mimetype->mime
> created->created
> url->url
> author->author
> default->author, asText
> 

right... did the proposal not say that?  If not can you patch it and
make it a bit more clear?

> Working with ant in batch mode could make use of XML for the representation of the 
>records above. Configuring a pipe-system with a xml-config-file is not so simple.
> I don't know avalon, so i cant't say anything about it. But i would favor to have at 
>least a possiblity to works only with configuration, without programming.
> 

I'm trying to learn enough about avalon to do this.  I'm having a hard
time of it.  After I read the conceptual documentation and see a couple
of code samples I'm like "now what?"  I need a "hello avalon" tutorial
to help me.. . U/f I can't write one (chicken and the egg kind of
thing).  I still am having trouble figuring out how to do something like
this via ant or even if ant is the right tool..  (I mean I love it for
builds but for this??) MAybe I have a mind block :-)

> regards,
> 
> Manfred
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
> 
-- 
http://www.superlinksoftware.com
http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document 
                            format to java
http://developer.java.sun.com/developer/bugParade/bugs/4487555.html 
                        - fix java generics!
The avalanche has already started. It is too late for the pebbles to
vote.
-Ambassador Kosh


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Proposal for Lucene / new component

Reply via email to