On Sun, 2002-02-10 at 09:45, Manfred Sch�fer wrote: > Hi, > > > > I've read you proposal (and all email related to it). One thing I'd like to advise >is to distinguish the crawler and the loader component. > > The crawler is responsible for gathering documents from several sources. > > The loader (or indexer) is responsible for loading the gathered documents to the >index (I think in batch mode). > > I see three different component types: > - file producer (crawler, database reader, Filesystem reader) > - Document Handler (knows the syntax (maybe semantic) of file-content) > - Indexer (Lucene) > > Is batch mode really the way. I think of something like pipes (But maybe i'm wrong). >
I see something that smells a lot like awt style events only no threading (necessarily) > > > > I think it's redundant to hardcode the indexing logic into all crawler component >(ftp, http, jdbc, filesys crawler). It's an interesting question how the components >can communicate? (don't you think using avalon is a good way?) > > I think, that the configuration of the indexing procedure, including work for all >three component types, is the real adventure. The components itself are relatively >easy to write. I first thought of ant as configuration framework. But i think that >would work only for batch mode. The main question is: What is the production > unit we are talking about. I don't think, that this should be simple files. I think >it must be records of String,Date,Integer,Binary-Fields, which could be mapped to >lucene fiels. > > Ok, i will tell you some more details: > > a crawler will produce something like > > mime: application/word > created:12.1.2001 > data: <binary> > url:http://www.sample.com/test.doc > > > the document handler for word docs will take and transform this to > > mime: application/word > created:12.1.2001 > url:http://www.sample.com/test.doc > author:Manfred Sch�fer > title:'77 secrets of indexing documents' > asText: '... the document as plain text ...' > > now we come to lucene, the fields must be mapped to lucene fields > > LUCENE-FIELDS -> DOCUMENT-FIELDS > mimetype->mime > created->created > url->url > author->author > default->author, asText > right... did the proposal not say that? If not can you patch it and make it a bit more clear? > Working with ant in batch mode could make use of XML for the representation of the >records above. Configuring a pipe-system with a xml-config-file is not so simple. > I don't know avalon, so i cant't say anything about it. But i would favor to have at >least a possiblity to works only with configuration, without programming. > I'm trying to learn enough about avalon to do this. I'm having a hard time of it. After I read the conceptual documentation and see a couple of code samples I'm like "now what?" I need a "hello avalon" tutorial to help me.. . U/f I can't write one (chicken and the egg kind of thing). I still am having trouble figuring out how to do something like this via ant or even if ant is the right tool.. (I mean I love it for builds but for this??) MAybe I have a mind block :-) > regards, > > Manfred > > > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > -- http://www.superlinksoftware.com http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document format to java http://developer.java.sun.com/developer/bugParade/bugs/4487555.html - fix java generics! The avalanche has already started. It is too late for the pebbles to vote. -Ambassador Kosh -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
