On Jan 15, 2006, at 3:34 PM, Robert Kirchgessner wrote:
There was even a patch to that problem: http://issues.apache.org/jira/browse/LUCENE-211
This is a large and somewhat hard-to-read patch. Some stuff looks familiar. Looks like he's concatenating fieldname along with tokentext, which is sort-of the right idea, though you need to take some precautions for field names of differing lengths I didn't immediately detect. (KinoSearch uses field number (which corresponds to lexically sorted field name at index-time), encoded as a big- endian 16-bit int.)
The interesting thing to me is that it doesn't seem to feed an external sorter. If I understand the concept correctly, he's feeding a sortpool for minMergeDocuments documents; creating a small mini- index (minMergeDocuments in size), then falling back to the primary merge model. If that isn't what that patch does, well... that concept would still work, and it would be nice not to need an external sorter.
Yes, the binary format is fully compatible to that of Lucene, as is the read/write/search logic.
So... * You use Sun's "Modified UTF-8" (not true UTF-8) to encode character data. * The VInt counts at the head of strings represent Java chars, not Unicode code points or bytes. * You've run tests with source material containing null bytes, Unicode characters outside the Basic Multilingual Plane, and corrupt character data (e.g., broken UTF-8), and you are confident that indexes produced by Lucene and PHPLucene from such data are mutually compatible.
By the way, though the project emerged as a lucene implementation in PHP I soon switched to writing a pure C-library with a binding to PHP. Now its mostly a C-project.
KinoSearch has taken a similar path of late, adding more and more XS (Perl's C API).
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]