Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Robert Muir Sat, 09 Jul 2016 04:12:01 -0700

Our similarities do not need a boolean flag. Instead we should focus
on making them as simple as possible: there can always be alternative
implementations.


On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]> wrote:
> I agree that using one byte by default is questionable on modern machines
> and given common text field sizes as well. I think my understanding of how
> norms are encoding/accessed may be wrong from what I had said.
> Lucene53NormsFormat supports Long, I see, and it's clever about observing
> the max bytes-per-value needed.  No need for some new format.  It's the
> Similarity impls (BM25 is one but others do this too) that choose to encode
> a smaller value.  It would be nice to have this be toggle-able!  Maybe just
> a boolean flag?
>
> On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote:
>>
>> Hi David,
>>
>> thank you for picking it up. Now we are having a more meaningful
>> discussion regarding the "waste".
>>
>>> Leo,
>>> There may be confusion here as to where the space is wasted.  1 vs 8
>>> bytes
>>> per doc on disk is peanuts, sure, but in RAM it is not and that is the
>>> concern.  AFAIK the norms are memory-mapped in, and we need to ensure
>>> it's
>>> trivial to know which offset to go to on disk based on a document id,
>>> which
>>> precludes compression but maybe you have ideas to improve that.
>>
>>
>> First, my understanding is that all essential parts of the Lucene index
>> are memory mapped, in particular, the inverted index (in the most common
>> scenario at least). Otherwise, the search performance is miserable. That
>> said, memory mapping a few extra bytes per document shouldn't make a
>> noticeable difference.
>>
>> Also, judging by the code in the class Lucene53NormsProducer and a debug
>> session, Lucene only maps a compressed segment containing norm values. Norms
>> are stored using 1,2,4, or 8 bytes. They are uncompressed into an 8-byte
>> long. This is probably on a per-slice basis.
>>
>> Anyways, situations in which you will get more than 65536 words per
>> document are quite rare. Situations with documents having 4 billion words
>> (or more) are exotic. If you have such enormous documents, again, saving on
>> document normalization factors won't be your first priority. You would
>> probably think about the ways of splitting such a huge document containing
>> every possible keyword into something more manageable.
>>
>> To sum up,  for 99.999% of the users squeezing normalization factors into
>> a single byte has absolutely no benefit. Memoization do seem to speed up
>> things a bit, but I suspect this may disappear with new generations of CPUs.
>>
>>>
>>> To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
>>> haven't used this but I know where to look)
>>
>>
>> Ok, thanks.
>>
>>>
>>>
>>> ~ David
>>>
>>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:
>>>
>>> > Hi,
>>> >
>>> > for some reason I didn't get a reply from the mailing list directly, so
>>> > I
>>> > have to send a new message. I appreciate if something can be fixed, so
>>> > that
>>> > I get a reply as well.
>>> >
>>> > First of all, I don't buy the claim about the issue being well-known. I
>>> > would actually argue that nobody except a few Lucene devs know about
>>> > it.
>>> > There is also a bug in Lucene's tutorial example. This needs to be
>>> > fixed as
>>> > well.
>>> >
>>> > Neither do I find your arguments convincing. In particular, I don't
>>> > think
>>> > that there is any serious waste of space. Please, see my detailed
>>> > comments
>>> > below. Please, note that I definitely don't know all the internals
>>> > well, so
>>> > I appreciate if you could explain them better.
>>> >
>>> > The downsides are documented and known. But I don't think you are
>>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit long,
>>> >> you can use up to *8x more memory and disk space* than what lucene
>>> >> does by default, and that is per-field.
>>> >
>>> >
>>> > This is not true. First of all, the increase is only for the textual
>>> > fields. Simple fields like Integers don't use normalization factors.
>>> > So,
>>> > there is no increase for them.
>>> >
>>>
>>> > In the worst case, you will have 7 extra bytes for a *text* field.
>>>
>>>
>>> > However, this is not an 8x increase.
>>> >
>>>
>>> > If you do *compress* the length of the text field, then its size will
>>>
>>>
>>> > depend on the size of the text field. For example, one extra byte will
>>> > be
>>> > required for fields that contain
>>> > more than 256 words, two extra bytes for fields having more than 65536
>>>
>>> > words, and so on so forth. *Compared to the field sizes, a several
>>> > byte*
>>> > increase is simply *laughable*.
>>> >
>>> > If Lucene saves normalization factor *without compression, *it should
>>> > now
>>>
>>>
>>> > use 8 bytes already. So, storing the full document length won't make a
>>> > difference.
>>> >
>>> >
>>> >> So that is a real trap. Maybe
>>> >> throw an exception there instead if the boost != 1F (just don't
>>> >> support it), and add a guard for "supermassive" documents, so that at
>>> >> most only 16 bits are ever used instead of 64. The document need not
>>> >> really be massive, it can happen just from a strange analysis chain
>>> >> (n-grams etc) that you get large values here.
>>> >>
>>> >
>>> > As mentioned above, storing a few extra bytes for supermassive
>>> > documents
>>> > doesn't affect the overall storage by more than a tiny fraction of a
>>> > percent.
>>> >
>>> >
>>> >>
>>> >> I have run comparisons in the past on standard collections to see what
>>> >> happens with this "feature"  and differences were very small. I think
>>> >> in practice people do far more damage by sharding their document
>>> >> collections but not using a distributed interchange of IDF, causing
>>> >> results from different shards to be incomparable :)
>>> >>
>>> >
>>>
>>> > Ok, this is not what I see on my data. I see* more than* a 10%
>>>
>>>
>>> > degradation. This is not peanuts. Do we want to re-run experiments on
>>> > standard collections? Don't forget that Lucene is now used as a
>>> > baseline to
>>> > compare against. People claim to beat BM25 while they beat something
>>> > inferior.
>>> >
>>> >
>>> >>
>>> >> As far as the bm25 tableization, that is *not* the justification for
>>> >> using an 8 byte encoding. The 8 byte encoding was already there long
>>> >> ago (to save memory/space) when bm25 was added, that small
>>> >> optimization just takes advantage of it. The optimization exists just
>>> >> so that bm25 will have comparable performance to ClassicSimilarity.
>>> >>
>>> >
>>> > Sorry, I don't understand this comment. What kind of 8-byte encoding
>>> > are
>>> > you talking about? Do you mean a single-byte encoding? This is what the
>>> > current BM25 similarity seems to use.
>>> >
>>> > I also don't quite understand what is a justification for what, please,
>>> > clarify.
>>> >
>>> >
>>> >>
>>> >> Either way, the document's length can be stored with more accuracy,
>>> >> without wasting space, especially if you don't use index-time
>>> >> boosting. But the default encoding supports features like that because
>>> >> lucene supports all these features.
>>> >>
>>> >
>>> > Sorry, I don't get this again. Which features should Lucene support? If
>>> > you like to use boosting in exactly the same way you used it before
>>> > (though
>>> > I won't recommend doing so), you can do this. In fact, my
>>> > implementation
>>> > tries to mimic this as much as possible. If you mean something else,
>>> > please, clarify.
>>> >
>>> > Also, how does one save document length with more accuracy? Is there a
>>> > special API or something?
>>> >
>>> > Thank you!
>>> >
>>> >
>>> >>
>>> >>
>>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote:
>>> >> > Hi everybody,
>>> >> >
>>> >> > Some time ago, I had to re-implement some Lucene similarities (in
>>> >> particular
>>> >> > BM25 and the older cosine). I noticed that the re-implemented
>>> >> > version
>>> >> > (despite using the same formula) performed better on my data set.
>>> >> > The
>>> >> main
>>> >> > difference was that my version did not approximate document length.
>>> >> >
>>> >> > Recently, I have implemented a modification of the current Lucene
>>> >> > BM25
>>> >> that
>>> >> > doesn't use this approximation either. I compared the existing and
>>> >> > the
>>> >> > modified similarities (again on some of my quirky data sets). The
>>> >> results
>>> >> > are as follows:
>>> >> >
>>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower
>>> >> > (3-5%
>>> >> in my
>>> >> > tests).
>>> >> > 2) The modified Lucene BM25 it is also more accurate
>>> >> > (I don't see a good reason as to why memoization and document
>>> >> approximation
>>> >> > results in any efficiency gain at all, but this is what seems to
>>> >> > happen
>>> >> with
>>> >> > the current hardware.)
>>> >> >
>>> >> > If this potential accuracy degradation concerns you, additional
>>> >> experiments
>>> >> > using more standard collections can be done (e.g., some TREC
>>> >> collections).
>>> >> >
>>> >> > In any case, the reproducible example (which also links to a more
>>> >> detailed
>>> >> > explanation) is in my repo:
>>> >> > https://github.com/searchivarius/AccurateLuceneBM25
>>> >> >
>>> >> > Many thanks!
>>> >> >
>>> >> > ---
>>> >> > Leo
>>> >>
>>> >>
>>> >>
>>> >> ---
>>> >>
>>> > Leo
>>> >
>>> > --
>>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
>>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
>>> http://www.solrenterprisesearchserver.com
>>>
>>>
>>>
>>>
>>> ---
>>> Leo
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to