Doing this beak compatibility with non-Java Lucene implementations. Not sure it matters, but I thought I would point it out. I have always thought that Lucene should be compatible at an API level only, and MAYBE create a network access protocol for queries and updates.

On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:

Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
In fact, that was my first implementaion. The problem with that is
you can
only store one value. But thinking a little more about it, storing
one or
more value is not an issue, because with the solution I proposed,
no space is
saved at all.
In fact, when I thought about this format of field metadata, I was
thinking
about a way to make the Lucene user specify how to store it in the
Lucene
index format. For instance, the simple one would specify that it's
a pointeur
on some metadata (as you proposed), another one would specify that
there are
two pointeurs (in my use case, one for type, the other one for the
language),
and another one whould specify that it will be store directly as it is
actually an integer (so no need to make a pointer on integer. But
it was just
a thought, I don't know if it is possible. WDYT ?

I'm thinking that there would be a codecs file, say with the
extension .cdx and this format:

   Codecs (.cdx)  --> CodecCount, <CodecClassName>CodecCount
   CodecCount     --> Uint32
   CodecClassName --> String

That file would be read in its entirety when the index was
initialized and expanded into an array of codec objects, one per
CodecClassName.

The .fdx file would add an additional int per doc...

   FieldIndex (.fdx) -->  <FieldValuesPosition,
                           FieldValuesCodecNumber>SegSize
   FieldValuesPosition    --> Uint64
   FieldValuesCodecNumber --> Uint32

Now, before you read any data from the .fdt file, you know how to
interpret it.  You seek the .fdt IndexInput to the right spot, then
feed it to the appropriate codec object from the codecs array.  The
codec does the rest.  In your case, you might write a codec that
would read a few bytes and strings of metadata up front.  Or you
might have several different codecs, the identity of which indicates
fixed values for certain metadata fields: FrenchDocument,
ArabicDocument, etc.

Would that scheme meet your needs?

That looks good, but there is one restriction : it have to be per document.
Let's explain a lit bit more my needs.

In fact my app have to index some data which is structured in a RDF graph. Each rdf resource have a title and a description, each title and description being in different languages. The model we choose is to map a rdf resource on a document. Then the field name is the URI of the rdf property, and the field
value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com   title:[en]foo   title:[fr]truc
So, in a document I will have several fields with different languages. For my use case, in fact I need only one "codec". It is a codec that will get 3
values, 2 of them being optionnal : a language, a type, and a value.

In fact I was thinking about a more generic version that will allow the format
compatibility, keeping .fdx as is :

FieldData (.fdt) -->  <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount

And a default FieldsDataWriter will be the actual one, it will read the
RawData as Bits, Value, with Value -->  String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that will do
exactly what I want.

What I don't know yet is how it breaks that API... because if I want to
provide my own FieldsDataWriter, I would also want to have my own
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.

cheers,
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to