On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey <mar...@rectangular.com> wrote: > On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: >> >> Also, will Lucy store the original stats? >> > >> > These? >> > >> > * Total number of tokens in the field. >> > * Number of unique terms in the field. >> > * Doc boost. >> > * Field boost. >> >> Also sum(tf). Robert can generate more :) > > Hmm, aren't "Total number of tokens in the field" and sum(tf) normally > equivalent? I guess there might be analyzers for which that isn't true, e.g. > those which perform synonym-injection? > > In any case, "sum(tf)" is probably a better definition, because it makes no > ancillary claims...
Sorry, yes they are. >> > Incidentally, what are you planning to do about field boost if it's not >> > always >> > 1.0? Are you going to store full 32-bit floats? >> >> For starters, yes. > > OK, how are those going to be encoded? IEEE 754? Big-endian? > > http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness For starters, I think so. Lucene's ints are bigendian today. >> We may (later) want to make a new attr that sets >> the #bits (levels/precision) you want... then uses packed ints to >> encode. > > I'm concerned that the bit-wise entropy of floats may make them a poor match > for compression via packed ints. We'll probably get a compressed > representation which is larger than the original. > > Are there any standard algorithms out there for compressing IEEE 754 floats? > RLE works, but only with certain data patterns. > > ... [ time passes ] ... > > Hmm, maybe not: > > > http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data Sorry, I was proposing a fixed-point boost, where you specify how many levels (in bits, powers of 2) you want. >> I was specifically asking if Lucy will allow the user to force true >> average to be recomputed, ie, at commit time from the writer. > > That's theoretically possible. We'd have to implement the reader the same way > we have DeletionsReader -- the most recent segment may contain data which > applies to older segments. OK. > Here's the DeletionsReader code, which searches backwards through the segments > looking for a particular file: > > /* Start with deletions files in the most recently added segments and work > * backwards. The first one we find which addresses our segment is the > * one we need. */ > for (i = VA_Get_Size(segments) - 1; i >= 0; i--) { > Segment *other_seg = (Segment*)VA_Fetch(segments, i); > Hash *metadata > = (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9); > if (metadata) { > Hash *files = (Hash*)CERTIFY( > Hash_Fetch_Str(metadata, "files", 5), HASH); > Hash *seg_files_data > = (Hash*)Hash_Fetch(files, (Obj*)my_seg_name); > if (seg_files_data) { > Obj *count = (Obj*)CERTIFY( > Hash_Fetch_Str(seg_files_data, "count", 5), OBJ); > del_count = (i32_t)Obj_To_I64(count); > del_file = (CharBuf*)CERTIFY( > Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF); > break; > } > } > } Hmm -- simililar to tombstones? But, different in that the most recently written file has *all* deletions for that old segment? Ie you don't have to OR together N generations of written deletions... only 1 file has all current deletions for the segment? This is somewhat wasteful of disk space though? Hmm unless your deletion policy can reclaim the now-stale deletions files from past flushed segments? > What we'd do is write the regenerated boost bytes for *all* segments to the > most recent segment. It would be roughly analogous to building up an NRT > reader. Right, except Lucy must go through the filesystem. >> > What's trickier is that Schemas are not normally mutable, and that they are >> > part of the index. You don't have to supply an Analyzer, or a Similarity, >> > or >> > anything else when opening a Searcher -- you just provide the location of >> > the >> > index, and the Schema gets deserialized from the latest schema_NNN.json >> > file. >> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty >> > much >> > a thing of the past for us. >> >> That's nice... though... is it too rigid? Do users even want to pick >> a different analyzer at search time? > > It's not common. > > To my mind, the way a field is tokenized is part of its field definition, thus > the Analyzer is part of the field definition, thus the analyzer is part of the > schema and needs to be stored with the index. OK. > Still, we support different Analyzers at search time by way of QueryParser. > QueryParser's constructor requires a Schema, but also accepts an optional > Analyzer which if supplied will be used instead of the Analyzers from the > Schema. Ahh OK there's an out. >> > Maybe aggressive automatic data-reduction makes more sense in the context >> > of >> > "flexible matching", which is more expansive than "flexible scoring"? >> >> I think so. Maybe it shouldn't be called a Similarity (which to me >> (though, carrying a heavy curse of knowledge burden...) means >> "scoring")? Matcher? > > Heh. "Matcher" is taken. It's a crucial class, too, roughly combining the > roles of Lucene's Scorer and DocIDSetIterator. > > The first alternative that comes to mind is "Relevance", because not only can > one thing's relevance to another be continuously variable (i.e. score), it can > also be binary: relevant/not-relevant (i.e. match). > > But I don't see why "Relevance", "Matcher", or anything else would be so much > better than "Similarity". I think this is your hang up. ;) Yeah OK. >> > I'm +0 (FWIW) on search-time Sim settability for Lucene. It's a nice >> > feature, >> > but I don't think we've worked out all the problems yet. If we can, I >> > might >> > switch to +1 (FWIW). >> >> What problems remain, for Lucene? > > Storage, formatting, and compression of boosts. > > I'm also concerned about making significant changes to the file format when > you've indicated they're "for starters". IMO, file format changes ought to > clear a higher bar than that. But I expect to to dissent on that point. I think we do dissent on this... progress not perfection ;) I see file format as an impl detail, not as a public API. It's free to change, and because it's easy to version, changing it isn't that bad. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org