Le Vendredi 21 Juillet 2006 12:37, Marvin Humphrey a écrit : > On Jul 21, 2006, at 1:23 AM, Nicolas Lalevée wrote: > > In fact, that was my first implementaion. The problem with that is > > you can > > only store one value. But thinking a little more about it, storing > > one or > > more value is not an issue, because with the solution I proposed, > > no space is > > saved at all. > > In fact, when I thought about this format of field metadata, I was > > thinking > > about a way to make the Lucene user specify how to store it in the > > Lucene > > index format. For instance, the simple one would specify that it's > > a pointeur > > on some metadata (as you proposed), another one would specify that > > there are > > two pointeurs (in my use case, one for type, the other one for the > > language), > > and another one whould specify that it will be store directly as it is > > actually an integer (so no need to make a pointer on integer. But > > it was just > > a thought, I don't know if it is possible. WDYT ? > > I'm thinking that there would be a codecs file, say with the > extension .cdx and this format: > > Codecs (.cdx) --> CodecCount, <CodecClassName>CodecCount > CodecCount --> Uint32 > CodecClassName --> String > > That file would be read in its entirety when the index was > initialized and expanded into an array of codec objects, one per > CodecClassName. > > The .fdx file would add an additional int per doc... > > FieldIndex (.fdx) --> <FieldValuesPosition, > FieldValuesCodecNumber>SegSize > FieldValuesPosition --> Uint64 > FieldValuesCodecNumber --> Uint32 > > Now, before you read any data from the .fdt file, you know how to > interpret it. You seek the .fdt IndexInput to the right spot, then > feed it to the appropriate codec object from the codecs array. The > codec does the rest. In your case, you might write a codec that > would read a few bytes and strings of metadata up front. Or you > might have several different codecs, the identity of which indicates > fixed values for certain metadata fields: FrenchDocument, > ArabicDocument, etc. > > Would that scheme meet your needs?
That looks good, but there is one restriction : it have to be per document. Let's explain a lit bit more my needs. In fact my app have to index some data which is structured in a RDF graph. Each rdf resource have a title and a description, each title and description being in different languages. The model we choose is to map a rdf resource on a document. Then the field name is the URI of the rdf property, and the field value is the litteral or other resource. for instance : doc1 : URI:http://foo.com title:[en]foo title:[fr]truc So, in a document I will have several fields with different languages. For my use case, in fact I need only one "codec". It is a codec that will get 3 values, 2 of them being optionnal : a language, a type, and a value. In fact I was thinking about a more generic version that will allow the format compatibility, keeping .fdx as is : FieldData (.fdt) --> <DocFieldData>SegSize DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount And a default FieldsDataWriter will be the actual one, it will read the RawData as Bits, Value, with Value --> String | BinaryValue,.... Then, for my app, I will provide some custom FieldsDataWriter that will do exactly what I want. What I don't know yet is how it breaks that API... because if I want to provide my own FieldsDataWriter, I would also want to have my own implementation of Fieldable... If you think this is a good idea, I will try to implement it. cheers, Nicolas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]