On Fri, 2002-02-08 at 05:26, Manfred Schäfer wrote: > Hi, > > i would suggest two sub-projects: >
I think "packages" would be more appropriate of a description, I wouldn't call them "subprojects" so to speak. > 1.Crawler - retrieving docs, wherever they are..... > > 2. DocumentHandler extract Text, create apropriate fields etc.. > +1 thats what I was getting at in the proposal about DocumentFactory etc. > The second is a layer on top of lucene. First is a autonomous package, wich > should be nicely integrated with lucene/Document-Handler, but should also be > usable for other projects. > hummm...I'm not entirely sure I'd go that far. Well encapsulated for sure but How usable by other projects is up to them not us... > I've included my code, to show you, what i've done. It isn't too useful yet, > because it is integrated in our product, but you can get the idea. Actually i've > written two things: > > 1: A robot for crawling a remote server via http and writing all the data to > local filesystem, then importing it into our db and > (at the same time) replacing all links with internal links. So we could emulate > a web-Site from this crawled Data! > [com.synformation.script.utilities.importtool] > I looked through this! Great stuff! Do you own this code? Are you able to donate it to Lucene (APL and all)? It looks like a great starting point. We'd have to do some refactoring but it looks pretty dern good to me. I haven't tried running it, just skimmed through. > 2: (I've rewritten some of the code from 1 for that, so this is much cleaner) A > customer needs a tool for importing local mini-Websites on the file-system via > an applet, send it to the Web-Server and import it as described in point 1. I've > tried to write it in a way, that it could include the functionality of point 1 > (retrieving vie http), but that is mostly untested. > [com.synformation.script.utilities.fileimport] > My brain didn't parse that.. > I don't say, that you(we) should use this. But i think it's time to come to a > more concrete plans. I'm interested to help on that for the crawler. > If you're able to donate it (legally) I kinda think there is a lot here. It of course needs to be refactored to meet some of the objectives we've outlined, but a darn good starting point IMHO! > > mfg, > > manfred > > > > > ---- > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- http://www.superlinksoftware.com http://jakarta.apache.org - port of Excel/Word/OLE 2 Compound Document format to java http://developer.java.sun.com/developer/bugParade/bugs/4487555.html - fix java generics! The avalanche has already started. It is too late for the pebbles to vote. -Ambassador Kosh -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>