On Thu, Apr 30, 2009 at 7:17 PM, Marvin Humphrey <[email protected]> wrote: > On Sat, Apr 11, 2009 at 11:16:28AM -0400, Michael McCandless wrote: > >> > But then, let's consider discrete vs. compound with regards to >> > transparency. >> > >> > When we're talking about discrete segment files, we're only talking about >> > binary data -- because the metadata is all in segmeta.json. Those binary >> > files are hard to examine without a tool anyway -- hexdumping is hard >> > core. :) >> > >> > So, transparency-wise, perhaps not so much is gained by going discrete. >> >> You can list their size, and see their presence or not. > > Well, in the current KS format, there are two "real" files which make up the > compound system: > > * cf.dat -- binary data. > * cfmeta.json -- list of file names mapped to offset and length. > > So, opening the cfmeta.json file is analogous to a directory listing, though > slightly less information-rich and intuitive.
Nice! >> > FWIW... I've already implemented a ByteBufDocReader proof-of-concept >> > class in >> > pure Perl; instead of serializing all fields marked as "stored", it stores >> > one >> > fixed-width byte array per document -- so doc storage is essentially a >> > flatfile. I'm also pretty close to finishing a ZlibDocReader that uses >> > Zlib >> > compression. (The "compressed" field spec flag has been removed.) >> >> How can doc storage be fixed width? (text fields have different >> length). > > It's not real doc storage. The Stored() attribute is ignored by this > implementation; only one fixed length byte array gets written for each doc. > > The main use case for this is when documents are stored externally -- perhaps > in a database, or potentially, on separate doc servers. For large search > clusters, dedicated doc/highlight servers are a good idea, and this is a start > in that direction. I see. >> So you removed "compressed" from FieldSpec and instead the user swaps >> out the DocReader component? I wonder how compression compares if you >> did column-stride body text vs row stride body text plus all other >> fields. > > Let a thousand flowers bloom -- make doc storage pluggable, and let people > experiment. Here here! Mike
