otis 2002/06/04 08:29:32 Modified: xdocs luceneplan.xml Log: - Fixed spelling a bit. - Nukes trailing blank spaces. Revision Changes Path 1.4 +68 -68 jakarta-lucene/xdocs/luceneplan.xml Index: luceneplan.xml =================================================================== RCS file: /home/cvs/jakarta-lucene/xdocs/luceneplan.xml,v retrieving revision 1.3 retrieving revision 1.4 diff -u -r1.3 -r1.4 --- luceneplan.xml 24 Feb 2002 16:33:21 -0000 1.3 +++ luceneplan.xml 4 Jun 2002 15:29:32 -0000 1.4 @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="UTF-8"?> - + <document> <properties> <title>Plan for enhancements to Lucene</title> @@ -8,7 +8,7 @@ </authors> </properties> <body> - + <section name="Purpose"> <p> The purpose of this document is to outline plans for @@ -21,8 +21,8 @@ The best reference is <a href="http://www.htdig.org"> htDig</a>, though it is not quite as sophisticated as Lucene, it has a number of features that make it - desireable. It however is a traditional c-compiled app - which makes it somewhat unpleasent to install on some + desirable. It however is a traditional c-compiled app + which makes it somewhat unpleasant to install on some platforms (like Solaris!). </p> <p> @@ -30,42 +30,42 @@ community for an initial reaction, advice, feedback and consent. Following this it will be submitted to the Lucene user community for support. Although, I'm (Andy - Oliver) capable of providing these enhancements by - myself, I'd of course prefer to work on them in concert + Oliver) capable of providing these enhancements by + myself, I'd of course prefer to work on them in concert with others. </p> <p> - While I'm outlaying a fairly large featureset, these can + While I'm outlaying a fairly large feature set, these can be implemented incrementally of course (and are probably best if done that way). </p> </section> - + <section name="Goal and Objectives"> <p> The goal is to provide features to Lucene that allow it - to be used as a dropin search engine. It should provide + to be used as a drop-in search engine. It should provide many of the features of projects like <a href="http://www.htdig.org">htDig</a> while surpassing - them with unique Lucene features and capabillities such as + them with unique Lucene features and capabilities such as easy installation on and java-supporting platform, - and support for document fields and field searches. And + and support for document fields and field searches. And of course, <a href="http://apache.org/LICENSE"> a pragmatic software license</a>. </p> <p> To reach this goal we'll implement code to support the following objectives that augment but do not replace - the current Lucene featureset. + the current Lucene feature set. </p> <ul> <li> - Document Location Independance - meaning mapping + Document Location Independence - meaning mapping real contexts to runtime contexts. Essentially, if the document is at /var/www/htdocs/mydoc.html, I probably want it indexed as - http://www.bigevilmegacorp.com/mydoc.html. + http://www.bigevilmegacorp.com/mydoc.html. </li> <li> Standard methods of creating central indicies - @@ -73,21 +73,21 @@ many environments than is *remote* indexing (for instance http). I would suggest that most folks would prefer that general functionality be - suppored by Lucene instead of having to write + supported by Lucene instead of having to write code for every indexing project. Obviously, if what they are doing is *special* they'll have to - code, but general document indexing accross - webservers would not qualify. + code, but general document indexing across + web servers would not qualify. </li> <li> - Document interperatation abstraction - currently + Document interpretation abstraction - currently one must handle document object construction via custom code. A standard interface for plugging - in format handlers should be supported. + in format handlers should be supported. </li> <li> Mime and file-extension to document - interperatation mapping. + interpretation mapping. </li> </ul> </section> @@ -128,7 +128,7 @@ </li> <li> replacement type - the type of - replacewith path: relative, url or + replace with path: relative, URL or path. </li> <li> @@ -153,8 +153,8 @@ 0 - Long.MAX_VALUE. </li> <li> - SleeptimeBetweenCalls - can be used to - avoid flooding a machine with too many + SleeptimeBetweenCalls - can be used to + avoid flooding a machine with too many requests </li> <li> @@ -163,12 +163,12 @@ inactivity. </li> <li> - IncludeFilter - include only items - matching filter. (can occur mulitple + IncludeFilter - include only items + matching filter. (can occur multiple times) </li> <li> - ExcludeFilter - exclude only items + ExcludeFilter - exclude only items matching filter. (can occur multiple times) </li> @@ -196,9 +196,9 @@ (probably from the command line) read this properties file and get them from it. Command line options override - the properties file in the case of + the properties file in the case of duplicates. There should also be an - enivironment variable or VM parameter to + environment variable or VM parameter to set this. </li> </ul> @@ -209,8 +209,8 @@ </p> <p> This should extend the AbstractCrawler and - support any addtional options required for a - filesystem index. + support any additional options required for a + file system index. </p> <!--</s2>--> <!--<s2 title="HTTPIndexer">--> @@ -218,12 +218,12 @@ <b>HTTP Crawler </b> </p> <p> - Supports the AbstractCrawler options as well as: + Supports the AbstractCrawler options as well as: </p> <ul> <li> - span hosts - Wheter to span hosts or not, - by default this should be no. + span hosts - Whether to span hosts or not, + by default this should be no. </li> <li> restrict domains - (ignored if span @@ -237,11 +237,11 @@ recurse and go to /nextcontext/index.html this option says to also try /nextcontext to get the dir - lsiting) + listing) </li> <li> map extensions - - (always/default/never/fallback). Wether + (always/default/never/fallback). Whether to always use extension mapping, by default (fallback to mime type), NEVER or fallback if mime is not available @@ -254,12 +254,12 @@ </ul> <!-- </s2> --> </section> - + <section name="MIMEMap"> <p> A configurable registry of document types, their - description, an identifyer, mime-type and file - extension. This should map both MIME -> factory + description, an identifier, mime-type and file + extension. This should map both MIME -> factory and extension -> factory. </p> <p> @@ -287,7 +287,7 @@ <td>"html,htm"</td> <td></td> <td>HTMLDocumentFactory</td> - </tr> + </tr> </table> </section> <section name="DocumentFactory"> @@ -300,17 +300,17 @@ </section> <section name="FieldMapping classes"> <p> - A class taht maps standard fields from the + A class that maps standard fields from the DocumentFactories into *fields* in the Document objects they create. I suggest that a regular expression system or xpath might be the most universal way to do this. For instance if perhaps I had an XML factory that represented XML elements as fields, I could map content - from particular fields to ther fields or supress them + from particular fields to their fields or suppress them entirely. We could even make this configurable. </p> <p> - + for example: </p> <ul> @@ -333,48 +333,48 @@ title.suppress=false </li> </ul> - <p> - In this example we map html documents such that all - fields are suppressed but author and title. We map - author and title to anything in the content matching - author: (and x characters). Okay my regular expresions + <p> + In this example we map html documents such that all + fields are suppressed but author and title. We map + author and title to anything in the content matching + author: (and x characters). Okay my regular expresions suck but hopefully you get the idea. </p> </section> <section name="Final Thoughts"> <p> - We might also consider eliminating the DocumentFactory - entirely by making an AbstractDocument from which the - current document object would inherit from. I - experimented with this locally, and it was a relatively - minor code change and there was of course no difference - in performance. The Document Factory classes would - instead be instances of various subclasses of + We might also consider eliminating the DocumentFactory + entirely by making an AbstractDocument from which the + current document object would inherit from. I + experimented with this locally, and it was a relatively + minor code change and there was of course no difference + in performance. The Document Factory classes would + instead be instances of various subclasses of AbstractDocument. </p> <p> - My inspiration for this is HTDig (http://www.htdig.org/). - While this goes slightly beyond what HTDig provides by - providing field mapping (where HTDIG is just interested - in Strings/numbers wherever they are found), it provides - at least what I would need to use this as a dropin for - most places I contract at (with the obvious exception of - a default set of content handlers which would of course + My inspiration for this is HTDig (http://www.htdig.org/). + While this goes slightly beyond what HTDig provides by + providing field mapping (where HTDIG is just interested + in Strings/numbers wherever they are found), it provides + at least what I would need to use this as a drop-in for + most places I contract at (with the obvious exception of + a default set of content handlers which would of course develop naturally over time). </p> <p> - I am able to certainly contribute to this effort if the - development community is open to it. I'd suggest we do - it iteratively in stages and not aim for all of this at + I am able to certainly contribute to this effort if the + development community is open to it. I'd suggest we do + it iteratively in stages and not aim for all of this at once (for instance leave out the field mapping at first). </p> <p> - - Anyhow, please give me some feedback, counter - suggestions, let me know if I'm way off base or out of + + Anyhow, please give me some feedback, counter + suggestions, let me know if I'm way off base or out of line, etc. -Andy </p> </section> - + </body> </document>
-- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>