Re: Flexible index format / Payloads Cont'd

robert engels Mon, 31 Jul 2006 08:28:57 -0700

Doing this beak compatibility with non-Java Lucene implementations.Not sure it matters, but I thought I would point it out. I havealways thought that Lucene should be compatible at an API level only,and MAYBE create a network access protocol for queries and updates.


On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:

Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :

On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:

In fact, that was my first implementaion. The problem with that is
you can
only store one value. But thinking a little more about it, storing
one or
more value is not an issue, because with the solution I proposed,
no space is
saved at all.
In fact, when I thought about this format of field metadata, I was
thinking
about a way to make the Lucene user specify how to store it in the
Lucene
index format. For instance, the simple one would specify that it's
a pointeur
on some metadata (as you proposed), another one would specify that
there are
two pointeurs (in my use case, one for type, the other one for the
language),

and another one whould specify that it will be store directly asit is

actually an integer (so no need to make a pointer on integer. But
it was just
a thought, I don't know if it is possible. WDYT ?


I'm thinking that there would be a codecs file, say with the
extension .cdx and this format:

   Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
   CodecCount     --> Uint32
   CodecClassName --> String

That file would be read in its entirety when the index was
initialized and expanded into an array of codec objects, one per
CodecClassName.

The .fdx file would add an additional int per doc...

   FieldIndex (.fdx) -->  <FieldValuesPosition,
                           FieldValuesCodecNumber>SegSize
   FieldValuesPosition    --> Uint64
   FieldValuesCodecNumber --> Uint32

Now, before you read any data from the .fdt file, you know how to
interpret it.  You seek the .fdt IndexInput to the right spot, then
feed it to the appropriate codec object from the codecs array.  The
codec does the rest.  In your case, you might write a codec that
would read a few bytes and strings of metadata up front.  Or you
might have several different codecs, the identity of which indicates
fixed values for certain metadata fields: FrenchDocument,
ArabicDocument, etc.

Would that scheme meet your needs?

That looks good, but there is one restriction : it have to be perdocument.

Let's explain a lit bit more my needs.

In fact my app have to index some data which is structured in a RDFgraph.Each rdf resource have a title and a description, each title anddescriptionbeing in different languages. The model we choose is to map a rdfresource ona document. Then the field name is the URI of the rdf property, andthe field

value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc

So, in a document I will have several fields with differentlanguages. For myuse case, in fact I need only one "codec". It is a codec that willget 3

values, 2 of them being optionnal : a language, a type, and a value.

In fact I was thinking about a more generic version that will allowthe format

compatibility, keeping .fdx as is :

FieldData (.fdt) -->  <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount

And a default FieldsDataWriter will be the actual one, it will readthe

RawData as Bits, Value, with Value -->  String | BinaryValue,....

Then, for my app, I will provide some custom FieldsDataWriter thatwill do

exactly what I want.

What I don't know yet is how it breaks that API... because if Iwant to

provide my own FieldsDataWriter, I would also want to have my own
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Flexible index format / Payloads Cont'd

Reply via email to