[ https://issues.apache.org/jira/browse/LUCENE-1591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12699024#action_12699024 ]
Shai Erera commented on LUCENE-1591: ------------------------------------ Well ... that worries me ... when I open the bz2 file (with notepad++), I see the same line, but on my machine, readLine() fails with that MIE. It's as if on my machine the readLine() call attempts to fill the buffer of BR, and then hits the exception, while on your machine it just stops in the middle. So I wonder how to fix it - LineDocMaker's logic is ok - makeDocument() just reads lines.. There's no point adding code which tries to compensate on any OS specific weridness. Perhaps we can change the 'else' part (which assigns title, body, date to "") to throw a RuntimeException (or MIE) in that case, since obviously this shouldn't happen and if it does - it's really a bug in the file format? Or, I can just remove the test ... but I think the above suggestion makes sense, and will solve it. Mike, if you agree, can you quickly apply that to your env. and note if the test fails? (it must fail, but I just want to be sure). > Enable bzip compression in benchmark > ------------------------------------ > > Key: LUCENE-1591 > URL: https://issues.apache.org/jira/browse/LUCENE-1591 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/benchmark > Reporter: Shai Erera > Fix For: 2.9 > > Attachments: commons-compress-dev20090413.jar, > commons-compress-dev20090413.jar, LUCENE-1591.patch, LUCENE-1591.patch, > LUCENE-1591.patch, LUCENE-1591.patch, LUCENE-1591.patch > > > bzip compression can aid the benchmark package by not requiring extracting > bzip files (such as enwiki) in order to index them. The plan is to add a > config parameter bzip.compression=true/false and in the relevant tasks either > decompress the input file or compress the output file using the bzip streams. > It will add a dependency on ant.jar which contains two classes similar to > GZIPOutputStream and GZIPInputStream which compress/decompress files using > the bzip algorithm. > bzip is known to be superior in its compression performance to the gzip > algorithm (~20% better compression), although it does the > compression/decompression a bit slower. > I wil post a patch which adds this parameter and implement it in > LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the > capability to DocMaker or some of the super classes, so it can be inherited > by all sub-classes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org