[ https://issues.apache.org/jira/browse/LUCENE-510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-510: -------------------------------------- Attachment: LUCENE-510.take2.patch New rev of the patch. I think it's ready to commit. I'll wait a few days. I made some performance improvements by factoring out a new UnicodeUtil class that does not allocate new objects for every conversion to/from UTF8. One new issue I fixed is the handling of invalid UTF-16 strings. Specifically if the UTF16 text has invalid surrogate pairs, UTF-8 is unable to represent it (unlike the current modified UTF-8 Lucene format). I changed DocumentsWriter & UnicodeUtil to substitute the replacement char U+FFFD for such invalid surrogate characters. This affects terms, stored String fields and term vectors. Indexing performance has a small slowdown (3.5%); details are below. Unfortunately, time to enumerate terms was more affected. I made a simple test that enumerates all terms from the index (= ~3.3 million terms) created below: public class TestTermEnum { public static void main(String[] args) throws Exception { IndexReader r = IndexReader.open(args[0]); TermEnum terms = r.terms(); int count = 0; long t0 = System.currentTimeMillis(); while(terms.next()) count++; long t1 = System.currentTimeMillis(); System.out.println(count + " terms in " + (t1-t0) + " millis"); r.close(); } } On trunk with current index format this takes 3104 msec (best of 5). With the patch with UTF8 index format it takes 3443 msec = 10.9% slower. I don't see any further ways to make this faster. Details on the indexing performance test: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.add.log.step=2000 directory=FSDirectory autocommit=false compound=false ram.flush.mb=64 { "Rounds" ResetSystemErase { "BuildIndex" CreateIndex { "AddDocs" AddDoc > : 200000 - CloseIndex } NewRound } : 5 RepSumByPrefRound BuildIndex I ran it on a quad-core Intel Mac Pro, with 4 drive RAID 0 array, running OS 10.4.11, java 1.5, run with these command-line args: -server -Xbatch -Xms1024m -Xmx1024m Best of 5 with current trunk is 921.2 docs/sec and with patch it's 888.7 = 3.5% slowdown. > IndexOutput.writeString() should write length in bytes > ------------------------------------------------------ > > Key: LUCENE-510 > URL: https://issues.apache.org/jira/browse/LUCENE-510 > Project: Lucene - Java > Issue Type: Improvement > Components: Store > Affects Versions: 2.1 > Reporter: Doug Cutting > Assignee: Michael McCandless > Attachments: LUCENE-510.patch, LUCENE-510.take2.patch, > SortExternal.java, strings.diff, TestSortExternal.java > > > We should change the format of strings written to indexes so that the length > of the string is in bytes, not Java characters. This issue has been > discussed at: > http://www.mail-archive.com/java-dev@lucene.apache.org/msg01970.html > We must increment the file format number to indicate this change. At least > the format number in the segments file should change. > I'm targetting this for 2.1, i.e., we shouldn't commit it to trunk until > after 2.0 is released, to minimize incompatible changes between 1.9 and 2.0 > (other than removal of deprecated features). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]