acoliver 02/02/23 14:02:55 Modified: xdocs whoweare.xml xdocs/stylesheets project.xml Added: xdocs luceneplan.xml Log: added application extensions plan and myself as a comitter Revision Changes Path 1.3 +1 -0 jakarta-lucene/xdocs/whoweare.xml Index: whoweare.xml =================================================================== RCS file: /home/cvs/jakarta-lucene/xdocs/whoweare.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -u -r1.2 -r1.3 --- whoweare.xml 28 Sep 2001 20:47:31 -0000 1.2 +++ whoweare.xml 23 Feb 2002 22:02:55 -0000 1.3 @@ -38,6 +38,7 @@ <li><b>Dave Kor</b> (davekor at apache.org)</li> <li><b>Jon Stevens</b> (jon at latchkey.com)</li> <li><b>Tal Dayan</b> (zapta at apache.org)</li> +<li><a href="http://www.trilug.org/~acoliver">Andrew C. Oliver</a> (acoliver at apache dot org)</li> </ul> </section> 1.1 jakarta-lucene/xdocs/luceneplan.xml Index: luceneplan.xml =================================================================== <?xml version="1.0" encoding="UTF-8"?> <document> <properties> <title>Plan for enhancements to Lucene</title> <authors> <person email="[EMAIL PROTECTED]" name="Andrew C. Oliver" id="AO"/> </authors> </properties> <body> <section name="Purpose"> <p> The purpose of this document is to outline plans for making <a href="http://jakarta.apache.org/lucene"> Jakarta Lucene</a> work as a more general drop-in component. It makes the assumption that this is an objective for the Lucene user and development community. </p> <p> The best reference is <a href="http://www.htdig.org"> htDig</a>, though it is not quite as sophisticated as Lucene, it has a number of features that make it desireable. It however is a traditional c-compiled app which makes it somewhat unpleasent to install on some platforms (like Solaris!). </p> <p> This plan is being submitted to the Lucene developer community for an initial reaction, advice, feedback and consent. Following this it will be submitted to the Lucene user community for support. Although, I'm (Andy Oliver) capable of providing these enhancements by myself, I'd of course prefer to work on them in concert with others. </p> <p> While I'm outlaying a fairly large featureset, these can be implemented incrementally of course (and are probably best if done that way). </p> </section> <section name="Goal and Objectives"> <p> The goal is to provide features to Lucene that allow it to be used as a dropin search engine. It should provide many of the features of projects like <a href="http://www.htdig.org">htDig</a> while surpassing them with unique Lucene features and capabillities such as easy installation on and java-supporting platform, and support for document fields and field searches. And of course, <a href="http://apache.org/LICENSE"> a pragmatic software license</a>. </p> <p> To reach this goal we'll implement code to support the following objectives that augment but do not replace the current Lucene featureset. </p> <ul> <li> Document Location Independance - meaning mapping real contexts to runtime contexts. Essentially, if the document is at /var/www/htdocs/mydoc.html, I probably want it indexed as http://www.bigevilmegacorp.com/mydoc.html. </li> <li> Standard methods of creating central indicies - file system indexing is probably less useful in many environments than is *remote* indexing (for instance http). I would suggest that most folks would prefer that general functionality be suppored by Lucene instead of having to write code for every indexing project. Obviously, if what they are doing is *special* they'll have to code, but general document indexing accross webservers would not qualify. </li> <li> Document interperatation abstraction - currently one must handle document object construction via custom code. A standard interface for plugging in format handlers should be supported. </li> <li> Mime and file-extension to document interperatation mapping. </li> </ul> </section> <section name="Indexers"> <p> Indexers are standard crawlers. They go crawl a file system, ftp site, web site, etc. to create the index. These standard indexers may not make ALL of Lucene's functionality available, though they should be able to make most of it available through configuration. </p> <!--<section name="AbstractIndexer">--> <p> <b> Abstract Indexer </b> </p> <p> The Abstract indexer is basically the parent for all Indexer classes. It provides implementation for the following functions/properties: </p> <ul> <li> index path - where to write the index. </li> <li> cui - create or update the index </li> <li> root context - the start of the pathname that should be replaced by the replace with property or dropped entirely. Example: /opt/tomcat/webapps </li> <li> replace with - when specified replaces the root context. Example: http://jakarta.apache.org. </li> <li> replacement type - the type of replacewith path: relative, url or path. </li> <li> location - the location to start indexing at. </li> <li> doctypes - only index documents with these doctypes. If not specified all registered mime-types are used. Example: "xml,doc,html" </li> <li> recursive - if not specified is turned off. </li> <li> level - optional level of directory or links to traverse. By default is assumed to be infinite. Recursive must be turned on or this is ignored. Range: 0 - Long.MAX_VALUE. </li> <li> properties - in addition to the settings (probably from the command line) read this properties file and get them from it. Command line options override the properties file in the case of duplicates. There should also be an enivironment variable or VM parameter to set this. </li> </ul> <!--</section>--> <!--<s2 title="FileSystemIndexer">--> <p> <b>FileSystemIndexer</b> </p> <p> This should extend the AbstractIndexer and support any addtional options required for a filesystem index. </p> <!--</s2>--> <!--<s2 title="HTTPIndexer">--> <p> <b>HTTP Indexer </b> </p> <p> Supports the AbstractIndexer options as well as: </p> <ul> <li> span hosts - Wheter to span hosts or not, by default this should be no. </li> <li> restrict domains - (ignored if span hosts is not enabled). Whether all spanned hosts must be in the same domain (default is off). </li> <li> try directories - Whether to attempt directory listings or not (so if you recurse and go to /nextcontext/index.html this option says to also try /nextcontext to get the dir lsiting) </li> <li> map extensions - (always/default/never/fallback). Wether to always use extension mapping, by default (fallback to mime type), NEVER or fallback if mime is not available (default). </li> <li> ignore robots - ignore robots.txt, on or off (default - off) </li> </ul> <!-- </s2> --> </section> <section name="MIMEMap"> <p> A configurable registry of document types, their description, an identifyer, mime-type and file extension. This should map both MIME -> factory and extension -> factory. </p> <p> This might be configured at compile time or by a properties file, etc. For example: </p> <table> <tr> <td>Description</td> <td>Identifier</td> <td>Extensions</td> <td>MimeType</td> <td>DocumentFactory</td> </tr> <tr> <td>"Word Document"</td> <td>"doc"</td> <td>"doc"</td> <td>"vnd.application/ms-word"</td> <td>POIWordDocumentFactory</td> </tr> <tr> <td>"HTML Document"</td> <td>"html"</td> <td>"html,htm"</td> <td></td> <td>HTMLDocumentFactory</td> </tr> </table> </section> <section name="DocumentFactory"> <p> An interface for classes which create document objects for particular file types. Examples: HTMLDocumentFactory, DOCDocumentFactory, XLSDocumentFactory, XML DocumentFactory. </p> </section> <section name="FieldMapping classes"> <p> A class taht maps standard fields from the DocumentFactories into *fields* in the Document objects they create. I suggest that a regular expression system or xpath might be the most universal way to do this. For instance if perhaps I had an XML factory that represented XML elements as fields, I could map content from particular fields to ther fields or supress them entirely. We could even make this configurable. </p> <p> for example: </p> <ul> <li> htmldoc.properties </li> <li> suppress=* </li> <li> author=content:g/author\:\ ........................................./ </li> <li> author.suppress=false </li> <li> title=content:g/title\:\ ........................................./ </li> <li> title.suppress=false </li> </ul> <p> In this example we map html documents such that all fields are suppressed but author and title. We map author and title to anything in the content matching author: (and x characters). Okay my regular expresions suck but hopefully you get the idea. </p> </section> <section name="Final Thoughts"> <p> We might also consider eliminating the DocumentFactory entirely by making an AbstractDocument from which the current document object would inherit from. I experimented with this locally, and it was a relatively minor code change and there was of course no difference in performance. The Document Factory classes would instead be instances of various subclasses of AbstractDocument. </p> <p> My inspiration for this is HTDig (http://www.htdig.org/). While this goes slightly beyond what HTDig provides by providing field mapping (where HTDIG is just interested in Strings/numbers wherever they are found), it provides at least what I would need to use this as a dropin for most places I contract at (with the obvious exception of a default set of content handlers which would of course develop naturally over time). </p> <p> I am able to certainly contribute to this effort if the development community is open to it. I'd suggest we do it iteratively in stages and not aim for all of this at once (for instance leave out the field mapping at first). </p> <p> Anyhow, please give me some feedback, counter suggestions, let me know if I'm way off base or out of line, etc. -Andy </p> </section> </body> </document> 1.7 +4 -0 jakarta-lucene/xdocs/stylesheets/project.xml Index: project.xml =================================================================== RCS file: /home/cvs/jakarta-lucene/xdocs/stylesheets/project.xml,v retrieving revision 1.6 retrieving revision 1.7 diff -u -r1.6 -r1.7 --- project.xml 26 Jan 2002 16:15:22 -0000 1.6 +++ project.xml 23 Feb 2002 22:02:55 -0000 1.7 @@ -22,6 +22,10 @@ <item name="Javadoc" href="/api/index.html"/> <item name="Contributions" href="/contributions.html"/> </menu> + + <menu name="Plans"> + <item name="Application Extensions" href="/luceneplan.html"/> + </menu> <menu name="Download"> <item name="Binaries" href="/site/binindex.html"/>
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>