[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700460#action_12700460 ]
Michael McCandless commented on LUCENE-1591: -------------------------------------------- For the record, here's the patch I had applied to XercesJ 2.9.1 sources: {code} --- UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100 +++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2008-04-04 00:40:58.000000000 +0200 @@ -534,6 +534,16 @@ invalidByte(4, 4, b2); } + // check if output buffer is large enough to hold 2 surrogate chars + if( out + 1 >= offset + length ){ + fBuffer[0] = (byte)b0; + fBuffer[1] = (byte)b1; + fBuffer[2] = (byte)b2; + fBuffer[3] = (byte)b3; + fOffset = 4; + return out - offset; + } + // decode bytes into surrogate characters int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003); if (uuuuu > 0x10) { {code} > Enable bzip compression in benchmark > ------------------------------------ > > Key: LUCENE-1591 > URL: https://issues.apache.org/jira/browse/LUCENE-1591 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Fix For: 2.9 > > Attachments: commons-compress-dev20090413.jar, > commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, > LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch, > LUCENE-1591.patch > > > bzip compression can aid the benchmark package by not requiring extracting > bzip files (such as enwiki) in order to index them. The plan is to add a > config parameter bzip.compression=true/false and in the relevant tasks either > decompress the input file or compress the output file using the bzip streams. > It will add a dependency on ant.jar which contains two classes similar to > GZIPOutputStream and GZIPInputStream which compress/decompress files using > the bzip algorithm. > bzip is known to be superior in its compression performance to the gzip > algorithm (~20% better compression), although it does the > compression/decompression a bit slower. > I wil post a patch which adds this parameter and implement it in > LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the > capability to DocMaker or some of the super classes, so it can be inherited > by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org