On Thu, Apr 30, 2009 at 7:17 PM, Marvin Humphrey <[email protected]> wrote:
> On Sat, Apr 11, 2009 at 11:16:28AM -0400, Michael McCandless wrote:
>
>> > But then, let's consider discrete vs. compound with regards to 
>> > transparency.
>> >
>> > When we're talking about discrete segment files, we're only talking about
>> > binary data -- because the metadata is all in segmeta.json.  Those binary
>> > files are hard to examine without a tool anyway -- hexdumping is hard 
>> > core. :)
>> >
>> > So, transparency-wise, perhaps not so much is gained by going discrete.
>>
>> You can list their size, and see their presence or not.
>
> Well, in the current KS format, there are two "real" files which make up the
> compound system:
>
>  * cf.dat -- binary data.
>  * cfmeta.json -- list of file names mapped to offset and length.
>
> So, opening the cfmeta.json file is analogous to a directory listing, though
> slightly less information-rich and intuitive.

Nice!

>> > FWIW...  I've already implemented a ByteBufDocReader proof-of-concept 
>> > class in
>> > pure Perl; instead of serializing all fields marked as "stored", it stores 
>> > one
>> > fixed-width byte array per document -- so doc storage is essentially a
>> > flatfile.  I'm also pretty close to finishing a ZlibDocReader that uses 
>> > Zlib
>> > compression. (The "compressed" field spec flag has been removed.)
>>
>> How can doc storage be fixed width?  (text fields have different
>> length).
>
> It's not real doc storage.  The Stored() attribute is ignored by this
> implementation; only one fixed length byte array gets written for each doc.
>
> The main use case for this is when documents are stored externally -- perhaps
> in a database, or potentially, on separate doc servers.  For large search
> clusters, dedicated doc/highlight servers are a good idea, and this is a start
> in that direction.

I see.

>> So you removed "compressed" from FieldSpec and instead the user swaps
>> out the DocReader component?  I wonder how compression compares if you
>> did column-stride body text vs row stride body text plus all other
>> fields.
>
> Let a thousand flowers bloom -- make doc storage pluggable, and let people
> experiment.

Here here!

Mike

Reply via email to