On Jun 29, 2006, at 2:22 PM, Michael Busch wrote:
- Is there a concrete design?
Not that I am aware of.
I have the feeling, that many people are interested in having a
flexible index format. There are already various use cases:
- Efficient parametric search
This comes at the expense of a significant file size increase and
performance hit. Think a book index that not only lists page number
but also category.
axle => 3, 67, 89, 244
vs...
axle => 3 cars, 67 cars, 89 trucks, 244 cars
Scanning through the latter is going to be more expensive. It might
be worth it in specific cases, but it's not the long-hoped-for
panacea that would give Lucene all the features of an RDBMS without
incurring any costs. :)
- Part Of Speech (POS) annotations with each position
This is an example of where it might be worth it... to Grant, and
Grant only.
Personally, I'm less interested in adding new features than I am in
solidifying and improving the core.
The benefits I care about are:
* Decouple Lucene from it's file format.
o Make back-compatibility easier.
o Make refactoring easier.
o All the other goodness that comes with loose coupling.
* Improve IR precision, by writing a Boolean Scorer that
takes position into account, a la Brin/Page '98.
* Decrease time to launch a Searcher from rest.
* Simplify Lucene, conceptually.
o Indexes would have three parts: Term dictionary,
Postings, and Storage.
o Each part could be pluggable, following this format:
<header><object>+
* The de-serialization for each object is determined by
a plugin spec'd in the header.
* It's probably better to have separate header and data
files.
I would suggest to split up the whole work to have smaller work items
and to have clearly defined milestones. Thus I suggest the
following steps:
1. Introduce postings file with the following format:
<DocDelta, Payload>*
DocDelta --> VInt
DocDelta/2 is the difference between this document number and
the previous document number.
Payload --> Byte, if DocDelta is even
Payload --> <Payload_Length, Payload_Data>, if DocDelta is odd
Payload_Length --> VInt
Payload_Data --> Byte^Payload_Length
Good stuff! Now, if you put that whole thing in a plugin, you'll
have the chance to refine it even after deployment if you think of a
way to improve it -- by adding another plugin. And, if it becomes
too unwieldy and inflexible, you're not stuck with it.
Furthermore, it should be possible to enabled/disable payloads
on field level.
Maybe each field should get its own file, and its own encoding/
decoding object. Then you don't have to check each object/record to
see which codec to use.
Or maybe there should be an array of codec objects, indexed by field
number.
fieldNum = input->readVint();
decoders[fieldNum].read(input);
2. Add multilevel skipping (tree structure) for the postings-file.
One-level skipping, as being used now in Lucene, is probably
not efficient enough for the new postings file, because it can
be very big. Question: Should we include skipping information
directly in the postings file, or should we introduce a new file
containing the skipping infos? I think it should improve cache
performance to have the skip tree in a different file.
Interesting. I think I'd punt and leave it up to the plugin. Maybe
you'd have an extra large header if there was a lot of stuff to be
cached.
3. Optional: Add a type-system for the payloads to make it
easier to develop PostingsWriter/Reader plugins.
IMO, this should wait. It's going to be freakishly difficult to get
this stuff to work and maintain the commitments that Doug has laid
out for backwards compatibility. There's also going to be trade-
offs, and so I'd anticipate contentious, interminable debate along
the lines of the recent Java 1.4/1.5 thread once there's real code
and it becomes clear who's lost a clock tick or two.
Actually, I think pushing this forward is going to be so difficult,
that I'll be focusing my attentions on implementing it elsewhere.
4. Make the PostingsWriter/Reader pluggable and develop default
PostingsWriter/Reader plugins, that store frequencies, positions,
and norms as payloads in the postings file. Should be configurable,
to enable the different options Doug suggested:
a. <doc>+
b. <doc, boost>+
c. <doc, freq, <position>+ >+
d. <doc, freq, <position, boost>+ >+
Got any ideas as to how the Field constructors should look?
5. Develop new or extend existing PostingsWriter/Reader plugins for
desired features like XML search, POS, multi-faceted search, ...
People will definitely want to scratch their own itches, but I'd
argue that this stuff should start out private. And maybe stay that
way!
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]