DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=27326 [PATCH] minor performance enhancements for DocumentWriter.invertDocument() Summary: [PATCH] minor performance enhancements for DocumentWriter.invertDocument() Product: Lucene Version: unspecified Platform: All OS/Version: All Status: NEW Severity: Enhancement Priority: Other Component: Index AssignedTo: [EMAIL PROTECTED] ReportedBy: [EMAIL PROTECTED] This patch includes two small performance improvements: 1. switch from Hashtable to HashMap and preset the capacity to avoid resizing the HashMap (barely measurable improvement, but easy). 2. add a new Analyzer.tokenStream() method that takes a String instead of a Reader, and call this from within DocumentWriter.invertDocument(). This allows subclasses of Analyzer to provide a more efficient tokenizer for Strings. (The default implementation just uses a StringReader.) I was able to write a variant on LowercaseAnalyzer (not included) that's about 10% faster for my dataset. It works by converting the entire field value with String.toLowerCase() and then using String.substring() to extract the string for each token. This avoids allocating individual char[] arrays inside String for each token, because String.substring() shares its char[] array with the original. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]