On Mon, Sep 22, 2008 at 12:35:53AM -0600, Nathan Kurz wrote:
>> Let's set a goal of implementing PForDelta, and figure out how to get there
>> from here.
>
> Sure, that seems like a fine goal. I'm not sure if you meant it this
> way, but I think it would be great to add support for PForDelta to the
> existing VByte support rather than just replacing it. While PForDelta
> might fully replace VByte eventually, it would be good to design an
> architecture that allows for multiple file formats, so that tricks
> like using Lucene index files directly are theoretically possible,
Supporting the Lucene index file format would be very costly -- it's complex,
fragile, opaque, and a moving target. It is not designed for interchange; the
spec doc is a post hoc description of what Lucene does rather than the product
of a goal-oriented design process. It contains no human readable or editable
components, making it difficult to debug or spelunk. It gets updated
frequently to accommodate minor optimizations, the rationale being that Lucene
is a low-level component of many other projects and the widespread trickle-down
performance benefits justify the increased maintentance costs and complexity.
Instead, we should strive to spec out a better index format ourselves.
* Human readable metadata.
* More files but simpler binary formats.
* Global field semantics.
* Explicit mechanisms for extensibility.
It would be easier to write some Lucene contrib modules to support a
well-designed Lucy format than to chase Lucene's format with C code. And if
we nail it and write a really good spec, we may be able to persuade the Java
Lucene community to come to us.
I'll add a new directory in SVN, trunk/devel/file_format, where we can put the
spec documents. For now, HTML is probably best; we can switch to XML to
facilitate transforms once the spec starts to mature. I'll also fire up JIRA
issues for discussing each section as it gets added.
Here's a provisional TOC:
1. Conventions
2. Overview
3. Virtual File System (VFS)
4. Segment Files
5. Schema
6. Primitives
7. File Format Details
a. snapshot
b. write.lock
c. schema
d. segment metadata
e. compound file
f. document storage
g. lexicon
h. postings
8. Extensibility
We can kick things off with the VFS section, since I've already written a draft.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/