--ok; (they already have configuration parameters). Leo if you can submit a patch to extend the BM25 similarity, I would welcome it.
On Sat, Jul 9, 2016 at 7:11 AM Robert Muir <[email protected]> wrote: > Our similarities do not need a boolean flag. Instead we should focus > on making them as simple as possible: there can always be alternative > implementations. > > On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]> > wrote: > > I agree that using one byte by default is questionable on modern machines > > and given common text field sizes as well. I think my understanding of > how > > norms are encoding/accessed may be wrong from what I had said. > > Lucene53NormsFormat supports Long, I see, and it's clever about observing > > the max bytes-per-value needed. No need for some new format. It's the > > Similarity impls (BM25 is one but others do this too) that choose to > encode > > a smaller value. It would be nice to have this be toggle-able! Maybe > just > > a boolean flag? > > > > On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote: > >> > >> Hi David, > >> > >> thank you for picking it up. Now we are having a more meaningful > >> discussion regarding the "waste". > >> > >>> Leo, > >>> There may be confusion here as to where the space is wasted. 1 vs 8 > >>> bytes > >>> per doc on disk is peanuts, sure, but in RAM it is not and that is the > >>> concern. AFAIK the norms are memory-mapped in, and we need to ensure > >>> it's > >>> trivial to know which offset to go to on disk based on a document id, > >>> which > >>> precludes compression but maybe you have ideas to improve that. > >> > >> > >> First, my understanding is that all essential parts of the Lucene index > >> are memory mapped, in particular, the inverted index (in the most common > >> scenario at least). Otherwise, the search performance is miserable. That > >> said, memory mapping a few extra bytes per document shouldn't make a > >> noticeable difference. > >> > >> Also, judging by the code in the class Lucene53NormsProducer and a debug > >> session, Lucene only maps a compressed segment containing norm values. > Norms > >> are stored using 1,2,4, or 8 bytes. They are uncompressed into an 8-byte > >> long. This is probably on a per-slice basis. > >> > >> Anyways, situations in which you will get more than 65536 words per > >> document are quite rare. Situations with documents having 4 billion > words > >> (or more) are exotic. If you have such enormous documents, again, > saving on > >> document normalization factors won't be your first priority. You would > >> probably think about the ways of splitting such a huge document > containing > >> every possible keyword into something more manageable. > >> > >> To sum up, for 99.999% of the users squeezing normalization factors > into > >> a single byte has absolutely no benefit. Memoization do seem to speed up > >> things a bit, but I suspect this may disappear with new generations of > CPUs. > >> > >>> > >>> To use your own norms encoding, see Codec.normsFormat. (disclaimer: I > >>> haven't used this but I know where to look) > >> > >> > >> Ok, thanks. > >> > >>> > >>> > >>> ~ David > >>> > >>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote: > >>> > >>> > Hi, > >>> > > >>> > for some reason I didn't get a reply from the mailing list directly, > so > >>> > I > >>> > have to send a new message. I appreciate if something can be fixed, > so > >>> > that > >>> > I get a reply as well. > >>> > > >>> > First of all, I don't buy the claim about the issue being > well-known. I > >>> > would actually argue that nobody except a few Lucene devs know about > >>> > it. > >>> > There is also a bug in Lucene's tutorial example. This needs to be > >>> > fixed as > >>> > well. > >>> > > >>> > Neither do I find your arguments convincing. In particular, I don't > >>> > think > >>> > that there is any serious waste of space. Please, see my detailed > >>> > comments > >>> > below. Please, note that I definitely don't know all the internals > >>> > well, so > >>> > I appreciate if you could explain them better. > >>> > > >>> > The downsides are documented and known. But I don't think you are > >>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit > long, > >>> >> you can use up to *8x more memory and disk space* than what lucene > >>> >> does by default, and that is per-field. > >>> > > >>> > > >>> > This is not true. First of all, the increase is only for the textual > >>> > fields. Simple fields like Integers don't use normalization factors. > >>> > So, > >>> > there is no increase for them. > >>> > > >>> > >>> > In the worst case, you will have 7 extra bytes for a *text* field. > >>> > >>> > >>> > However, this is not an 8x increase. > >>> > > >>> > >>> > If you do *compress* the length of the text field, then its size will > >>> > >>> > >>> > depend on the size of the text field. For example, one extra byte > will > >>> > be > >>> > required for fields that contain > >>> > more than 256 words, two extra bytes for fields having more than > 65536 > >>> > >>> > words, and so on so forth. *Compared to the field sizes, a several > >>> > byte* > >>> > increase is simply *laughable*. > >>> > > >>> > If Lucene saves normalization factor *without compression, *it should > >>> > now > >>> > >>> > >>> > use 8 bytes already. So, storing the full document length won't make > a > >>> > difference. > >>> > > >>> > > >>> >> So that is a real trap. Maybe > >>> >> throw an exception there instead if the boost != 1F (just don't > >>> >> support it), and add a guard for "supermassive" documents, so that > at > >>> >> most only 16 bits are ever used instead of 64. The document need not > >>> >> really be massive, it can happen just from a strange analysis chain > >>> >> (n-grams etc) that you get large values here. > >>> >> > >>> > > >>> > As mentioned above, storing a few extra bytes for supermassive > >>> > documents > >>> > doesn't affect the overall storage by more than a tiny fraction of a > >>> > percent. > >>> > > >>> > > >>> >> > >>> >> I have run comparisons in the past on standard collections to see > what > >>> >> happens with this "feature" and differences were very small. I > think > >>> >> in practice people do far more damage by sharding their document > >>> >> collections but not using a distributed interchange of IDF, causing > >>> >> results from different shards to be incomparable :) > >>> >> > >>> > > >>> > >>> > Ok, this is not what I see on my data. I see* more than* a 10% > >>> > >>> > >>> > degradation. This is not peanuts. Do we want to re-run experiments on > >>> > standard collections? Don't forget that Lucene is now used as a > >>> > baseline to > >>> > compare against. People claim to beat BM25 while they beat something > >>> > inferior. > >>> > > >>> > > >>> >> > >>> >> As far as the bm25 tableization, that is *not* the justification for > >>> >> using an 8 byte encoding. The 8 byte encoding was already there long > >>> >> ago (to save memory/space) when bm25 was added, that small > >>> >> optimization just takes advantage of it. The optimization exists > just > >>> >> so that bm25 will have comparable performance to ClassicSimilarity. > >>> >> > >>> > > >>> > Sorry, I don't understand this comment. What kind of 8-byte encoding > >>> > are > >>> > you talking about? Do you mean a single-byte encoding? This is what > the > >>> > current BM25 similarity seems to use. > >>> > > >>> > I also don't quite understand what is a justification for what, > please, > >>> > clarify. > >>> > > >>> > > >>> >> > >>> >> Either way, the document's length can be stored with more accuracy, > >>> >> without wasting space, especially if you don't use index-time > >>> >> boosting. But the default encoding supports features like that > because > >>> >> lucene supports all these features. > >>> >> > >>> > > >>> > Sorry, I don't get this again. Which features should Lucene support? > If > >>> > you like to use boosting in exactly the same way you used it before > >>> > (though > >>> > I won't recommend doing so), you can do this. In fact, my > >>> > implementation > >>> > tries to mimic this as much as possible. If you mean something else, > >>> > please, clarify. > >>> > > >>> > Also, how does one save document length with more accuracy? Is there > a > >>> > special API or something? > >>> > > >>> > Thank you! > >>> > > >>> > > >>> >> > >>> >> > >>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> > wrote: > >>> >> > Hi everybody, > >>> >> > > >>> >> > Some time ago, I had to re-implement some Lucene similarities (in > >>> >> particular > >>> >> > BM25 and the older cosine). I noticed that the re-implemented > >>> >> > version > >>> >> > (despite using the same formula) performed better on my data set. > >>> >> > The > >>> >> main > >>> >> > difference was that my version did not approximate document > length. > >>> >> > > >>> >> > Recently, I have implemented a modification of the current Lucene > >>> >> > BM25 > >>> >> that > >>> >> > doesn't use this approximation either. I compared the existing and > >>> >> > the > >>> >> > modified similarities (again on some of my quirky data sets). The > >>> >> results > >>> >> > are as follows: > >>> >> > > >>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower > >>> >> > (3-5% > >>> >> in my > >>> >> > tests). > >>> >> > 2) The modified Lucene BM25 it is also more accurate > >>> >> > (I don't see a good reason as to why memoization and document > >>> >> approximation > >>> >> > results in any efficiency gain at all, but this is what seems to > >>> >> > happen > >>> >> with > >>> >> > the current hardware.) > >>> >> > > >>> >> > If this potential accuracy degradation concerns you, additional > >>> >> experiments > >>> >> > using more standard collections can be done (e.g., some TREC > >>> >> collections). > >>> >> > > >>> >> > In any case, the reproducible example (which also links to a more > >>> >> detailed > >>> >> > explanation) is in my repo: > >>> >> > https://github.com/searchivarius/AccurateLuceneBM25 > >>> >> > > >>> >> > Many thanks! > >>> >> > > >>> >> > --- > >>> >> > Leo > >>> >> > >>> >> > >>> >> > >>> >> --- > >>> >> > >>> > Leo > >>> > > >>> > -- > >>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > >>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > >>> http://www.solrenterprisesearchserver.com > >>> > >>> > >>> > >>> > >>> --- > >>> Leo > > > > -- > > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > > http://www.solrenterprisesearchserver.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
