Hi, I'm the author of CLucene (a c++ port of lucene). I've been following the 'using byte count as prefix' discussion and I think this discussion sort of ties into something we are trying to achieve.
We are trying to optimise the way the index writing works, and we also want to be able to index & store fields which are using a Reader object. The second part is in theory a very easy solution, we can use a streamfilter to buffer the reads that the analyser makes, and integrate the FieldsWriter into the invertDocument function so that the buffers are written while the analysers are run. Since there is no way of knowing the length of the reader, we would then have to go back and write the field length. Here is where the problem is, though: this is not possible currently because we are using a VInt for the field data length. If we can use non variable length integers for the field data length it makes it much easier for two things: 1) memory optimisations like the compressed field can benefit from this: we don't have to store the entire compressed output in memory, but can rather write it directly to the fields output stream. 2) it makes it possible to store AND index a field using a reader in a single pass, thus removing the need to read twice (which might not always be possible for some reader implementations). The second feature is very important for us! So I would like to propose a discussion on how this could be achieved: My idea is to set a bit in the config like FIELD_DONT_USE_VINT. I dont think using a static Int for every field is necessary, this few extra (unnecessary) bytes for each field would add up to a lot. A static Int is only used when completely necessary, and the implementation could decide when to use this. These are the rough changes that i think would need to be made: final Document doc(int n) throws IOException { ... byte bits = fieldsStream.readByte(); boolean dontUseVint = (bits & FieldsWriter.FIELD_DONT_USE_VINT) != 0; ... <<Binary fields like compressed or binary is an easy change...>> if ((bits & FieldsWriter.FIELD_IS_BINARY) != 0) { final byte[] b = new byte[dontUseVint? fieldsStream.readInt(): fieldsStream.readVInt()]; << CHANGE HERE ... if (compressed) { final byte[] b = new byte[dontUseVint? fieldsStream.readInt(): fieldsStream.readVInt()]; << CHANGE HERE ... <<Reading a field value as a string>> string value; if ( dontUseVint ){ << I'm not completely sure about this section, since changes relating to 'bytecount as prefix' would affect this >> int length = readInt(); char[] chars = new char[length]; readChars(chars, 0, length); value = new String(chars, 0, length); }else value = fieldsStream.readString() Field f = new Field(fi.name, // name value, // read value << CHANGE HERE - use different string length store, index, termVector); ... Now is probably the best time to implement something like this before lucene 2.0 is released. I think it wouldn't be a complicated change; for now, we don't need to make any changes to the FieldWriter (optimisations using this can be done later). ben On 5/7/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
Got it. This was the problem, in TermInfosWriter.writeTerm(): - lastTerm = term; + lastBytes = bytes; } Without lastTerm being updated, the auxiliary term dictionary got screwed up. This problem only manifested on large tests because small tests never moved past the first entry, which is always a field number of -1 and an empty string. I'll post a full working patch to JIRA as soon as I'm at a location where I can connect my laptop to the net. Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]