On Jan 15, 2006, at 3:34 PM, Robert Kirchgessner wrote:

There was even a patch to that problem:

http://issues.apache.org/jira/browse/LUCENE-211

This is a large and somewhat hard-to-read patch. Some stuff looks familiar. Looks like he's concatenating fieldname along with tokentext, which is sort-of the right idea, though you need to take some precautions for field names of differing lengths I didn't immediately detect. (KinoSearch uses field number (which corresponds to lexically sorted field name at index-time), encoded as a big- endian 16-bit int.)

The interesting thing to me is that it doesn't seem to feed an external sorter. If I understand the concept correctly, he's feeding a sortpool for minMergeDocuments documents; creating a small mini- index (minMergeDocuments in size), then falling back to the primary merge model. If that isn't what that patch does, well... that concept would still work, and it would be nice not to need an external sorter.

Yes, the binary format is fully compatible to that of Lucene, as
is the read/write/search logic.

So...

   * You use Sun's "Modified UTF-8" (not true UTF-8) to
     encode character data.
   * The VInt counts at the head of strings represent Java
     chars, not Unicode code points or bytes.
   * You've run tests with source material containing
     null bytes, Unicode characters outside the Basic
     Multilingual Plane, and corrupt character data (e.g.,
     broken UTF-8), and you are confident that indexes produced
     by Lucene and PHPLucene from such data are mutually compatible.

By the way, though the project
emerged as a lucene implementation in PHP I soon switched
to writing a pure C-library with a binding to PHP. Now its
mostly a C-project.

KinoSearch has taken a similar path of late, adding more and more XS (Perl's C API).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to