I'd like to add my +1 to the proposal and my +1 to keeping the Lucene as a library that can exist separately from the applications. Perhaps the applications should be separate targets in the Lucene project (and build process) or perhaps they can be separate projects. I think keeping them together would be good because Lucene's APIs may need to evolve to support these applications better and because this will help ensure that changes to Lucene API are reflected in the applications as soon as they are made and not with a lag that can come about if the applications are treated as separate, dependent projects.
See below for some additional ideas for the crawler. Mark Tucker wrote: >I like what you included in your proposal and suggest doing all that (over time) and >taking the following into consideration: > >Indexers/Crawlers > > General Settings > SleeptimeBetweenCalls - can be used to avoid flooding a machine with >too many requests > IndexerTimeout - kill this crawler thread after long period of >inactivity > IncludeFilter - include only items matching filter > ExcludeFilter - exclude items matching filter (can be used with >IncludeFilter) > I'm working on a crawler right now actually, but it is a derivative of WebSPHINX. The original WebSPHINX has not changed since a very long time ago, but it is licensed under LGPL at the moment. Perhaps we can get permission from the copyright holders to transfer it to APL (or do we even need to?). I made a number of bug fixes to it, added support for cookies (rudimentary) and support for HTTP redirects. One thing that I like in WebSPHINX is that it has a forgiving HTML parser that can deal with many kinds of broken HTML. Also, it has a very interesting framework for analyzing parsed content, but this goes beyound the requirements for use with Lucene. I use the crawler with Lucene, but there is a layer of application classes between the two, so the kind of integration that has been proposed here has not yet been done. Anyway, I found that in addition to the Include and Exclude filters, it is helpful to be able to say that you want some page "expanded" (i.e. parsed and links followed), but not "indexed" (i.e. added to Lucene's index). And vice versa, it seems useful to index a page but not expand it, somethimes. Also, filters can be evaluated on links before they are followed, and then the second time on final URLs of pages retrieved. Normally the two are the same, but HTTP redirects can force the final URL to be something very different from the original link. Perhaps one way to represent these conditions is to have the following "language" instead of include and exclude filters: "include:" regex "exclude:" regex "noindex": regex "noexpand": regex The first two work as the include/exclude, but for things that pass these two, the others add handling properties that are used in processing the link and the page. Disclaimer: I'm experimenting with this now and these ideas are only about two days old, so please take them as such. Since we got into the discussion, I figured I'd put them on the table. > > MaxItems - stops indexing after x items > MaxMegs - stops indexing after x MB of data > > File System Indexer > URLReplacePrefix - can crawl c:\ but expose URL as http://mysever/docs/ > Question: does this information really belong in the index? Perhaps the root path should be specified, and the documents tagged with a relative path to that path, but I think that, maybe, the URL to prefix the document paths with should be given once per entire index and be easy to change. > > > Web Indexer > HTTPUser > HTTPPassword > HTTPUserAgent > ProxyServer > ProxyUser > ProxyPassword > HTTPSCertificate > HTTPSPrivateKey > Apache Commons has HTTPClient package that has some similar concepts and even implements them to some degree. I found it a bit rough still and dependent on JDK 1.3, but it can be fixed easier than a new one written I believe. It uses a notion of an HttpState, which is a state container for an HTTP user agent, containing things like authentication credentials and cookies. HTTPS support is easy to add with JSSE (which is the approach taken by the HttpClient from the Commons). > > > Other Possible Indexers > Microsoft Exchange 5.5/2000 > Lotus Notes > Newsgroup (NNTP) > Documentum > ODBC/OLEDB > XML - index single XML that represents multiple documents > One idea that might prove useful is to add a "DocumentFetcher" in addition to the DocumentIndexer. The two would go hand in hand, and document entries created in Lucene by a particular Indexer can be understood by a corresponding Fetcher. The Fetcher would then encapsulate retrieval of source documents or creating useful pointers to them (like URLs). Another idea is to split the document storage and "envelope" from its content. The content is subject to a MIME type and can be handed to a parser, passed to a document factory, mapped to fields, etc. However, the logic of retrieving a PDF file from a Lotus Notes database (and creating a URL to point back to it), is different than getting the same PDF file from the file system. The same parser and a document factory can still be used though. > > >Document Factory > General > The minimum properties for each document should be: > URL > Title > Abstract > Full Text > Score > > HTML > Support for META tags including Dublic Core syntax > > Other Possible Document Factories > Office Docs - DOC, XLS, PPT > PDF > > >Thanks for the great proposal. > Yes! Absolutely! Great proposal! --Dmitry -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>