Doing this beak compatibility with non-Java Lucene implementations.
Not sure it matters, but I thought I would point it out. I have
always thought that Lucene should be compatible at an API level only,
and MAYBE create a network access protocol for queries and updates.
On Jul 31, 2006, at 10:25 AM, Nicolas Lalevée wrote:
Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit :
On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote:
In fact, that was my first implementaion. The problem with that is
you can
only store one value. But thinking a little more about it, storing
one or
more value is not an issue, because with the solution I proposed,
no space is
saved at all.
In fact, when I thought about this format of field metadata, I was
thinking
about a way to make the Lucene user specify how to store it in the
Lucene
index format. For instance, the simple one would specify that it's
a pointeur
on some metadata (as you proposed), another one would specify that
there are
two pointeurs (in my use case, one for type, the other one for the
language),
and another one whould specify that it will be store directly as
it is
actually an integer (so no need to make a pointer on integer. But
it was just
a thought, I don't know if it is possible. WDYT ?
I'm thinking that there would be a codecs file, say with the
extension .cdx and this format:
Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount
CodecCount --> Uint32
CodecClassName --> String
That file would be read in its entirety when the index was
initialized and expanded into an array of codec objects, one per
CodecClassName.
The .fdx file would add an additional int per doc...
FieldIndex (.fdx) --> <FieldValuesPosition,
FieldValuesCodecNumber>SegSize
FieldValuesPosition --> Uint64
FieldValuesCodecNumber --> Uint32
Now, before you read any data from the .fdt file, you know how to
interpret it. You seek the .fdt IndexInput to the right spot, then
feed it to the appropriate codec object from the codecs array. The
codec does the rest. In your case, you might write a codec that
would read a few bytes and strings of metadata up front. Or you
might have several different codecs, the identity of which indicates
fixed values for certain metadata fields: FrenchDocument,
ArabicDocument, etc.
Would that scheme meet your needs?
That looks good, but there is one restriction : it have to be per
document.
Let's explain a lit bit more my needs.
In fact my app have to index some data which is structured in a RDF
graph.
Each rdf resource have a title and a description, each title and
description
being in different languages. The model we choose is to map a rdf
resource on
a document. Then the field name is the URI of the rdf property, and
the field
value is the litteral or other resource.
for instance :
doc1 : URI:http://foo.com title:[en]foo title:[fr]truc
So, in a document I will have several fields with different
languages. For my
use case, in fact I need only one "codec". It is a codec that will
get 3
values, 2 of them being optionnal : a language, a type, and a value.
In fact I was thinking about a more generic version that will allow
the format
compatibility, keeping .fdx as is :
FieldData (.fdt) --> <DocFieldData>SegSize
DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
And a default FieldsDataWriter will be the actual one, it will read
the
RawData as Bits, Value, with Value --> String | BinaryValue,....
Then, for my app, I will provide some custom FieldsDataWriter that
will do
exactly what I want.
What I don't know yet is how it breaks that API... because if I
want to
provide my own FieldsDataWriter, I would also want to have my own
implementation of Fieldable...
If you think this is a good idea, I will try to implement it.
cheers,
Nicolas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]