No progress yet. I think my next move is to do what I did when trying to get KinoSearch to write Lucene-compatible indexes:
1) Generate an optimized split-file format Lucene index from a pathological test corpus. 2) Hack KinoSearch so that it ought to produce an index which is identical to the Lucene-generated index except for the segments file (which has a timestamp). This involves overriding the segment-naming routine, setting the termIndexInterval to 128, and thwarting the attempts of CompoundFileWriter to merge the index files. Also, it's tricky to get multiple fields to match up number-wise, so I generally just use one... Then generate the KinoSearch index. 3) Run a script which performs a byte-by-byte comparison of each index file and reports the first byte where something differs. 4) Dive in with a hexdumper. Calculate VInts mentally. Memorize the data formats for each index file. Think like a TermInfosWriter. Twiddle the test corpus so that it produces the smallest possible index while still exposing differences. 5) Consume many aspirin. 6) Tweak 'n' repeat until the indexes are identical. 7) Tweak 'n' repeat until identical searches produce identical results. The only differences will be that this time KinoSearch will provide the authoritative index, since it already uses bytecounts (I'll use version 0.05, since the current version 0.10 has changes to .fdt) and that I won't be able to use Luke to verify the search results. Maybe some version of the pathological test corpus and the sample index should be provided as a help for implementers. Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]