Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

David Smiley Wed, 06 Jul 2016 14:49:58 -0700

Leo,
There may be confusion here as to where the space is wasted.  1 vs 8 bytes
per doc on disk is peanuts, sure, but in RAM it is not and that is the
concern.  AFAIK the norms are memory-mapped in, and we need to ensure it's
trivial to know which offset to go to on disk based on a document id, which
precludes compression but maybe you have ideas to improve that.


To use your own norms encoding, see Codec.normsFormat.  (disclaimer: I
haven't used this but I know where to look)

~ David

On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote:

> Hi,
>
> for some reason I didn't get a reply from the mailing list directly, so I
> have to send a new message. I appreciate if something can be fixed, so that
> I get a reply as well.
>
> First of all, I don't buy the claim about the issue being well-known. I
> would actually argue that nobody except a few Lucene devs know about it.
> There is also a bug in Lucene's tutorial example. This needs to be fixed as
> well.
>
> Neither do I find your arguments convincing. In particular, I don't think
> that there is any serious waste of space. Please, see my detailed comments
> below. Please, note that I definitely don't know all the internals well, so
> I appreciate if you could explain them better.
>
> The downsides are documented and known. But I don't think you are
>> fully documenting the tradeoffs here, by encoding up to a 64-bit long,
>> you can use up to *8x more memory and disk space* than what lucene
>> does by default, and that is per-field.
>
>
> This is not true. First of all, the increase is only for the textual
> fields. Simple fields like Integers don't use normalization factors. So,
> there is no increase for them.
>
> In the worst case, you will have 7 extra bytes for a *text* field.
> However, this is not an 8x increase.
>
> If you do *compress* the length of the text field, then its size will
> depend on the size of the text field. For example, one extra byte will be
> required for fields that contain
> more than 256 words, two extra bytes for fields having more than 65536
> words, and so on so forth. *Compared to the field sizes, a several byte*
> increase is simply *laughable*.
>
> If Lucene saves normalization factor *without compression, *it should now
> use 8 bytes already. So, storing the full document length won't make a
> difference.
>
>
>> So that is a real trap. Maybe
>> throw an exception there instead if the boost != 1F (just don't
>> support it), and add a guard for "supermassive" documents, so that at
>> most only 16 bits are ever used instead of 64. The document need not
>> really be massive, it can happen just from a strange analysis chain
>> (n-grams etc) that you get large values here.
>>
>
> As mentioned above, storing a few extra bytes for supermassive documents
> doesn't affect the overall storage by more than a tiny fraction of a
> percent.
>
>
>>
>> I have run comparisons in the past on standard collections to see what
>> happens with this "feature"  and differences were very small. I think
>> in practice people do far more damage by sharding their document
>> collections but not using a distributed interchange of IDF, causing
>> results from different shards to be incomparable :)
>>
>
> Ok, this is not what I see on my data. I see* more than* a 10%
> degradation. This is not peanuts. Do we want to re-run experiments on
> standard collections? Don't forget that Lucene is now used as a baseline to
> compare against. People claim to beat BM25 while they beat something
> inferior.
>
>
>>
>> As far as the bm25 tableization, that is *not* the justification for
>> using an 8 byte encoding. The 8 byte encoding was already there long
>> ago (to save memory/space) when bm25 was added, that small
>> optimization just takes advantage of it. The optimization exists just
>> so that bm25 will have comparable performance to ClassicSimilarity.
>>
>
> Sorry, I don't understand this comment. What kind of 8-byte encoding are
> you talking about? Do you mean a single-byte encoding? This is what the
> current BM25 similarity seems to use.
>
> I also don't quite understand what is a justification for what, please,
> clarify.
>
>
>>
>> Either way, the document's length can be stored with more accuracy,
>> without wasting space, especially if you don't use index-time
>> boosting. But the default encoding supports features like that because
>> lucene supports all these features.
>>
>
> Sorry, I don't get this again. Which features should Lucene support? If
> you like to use boosting in exactly the same way you used it before (though
> I won't recommend doing so), you can do this. In fact, my implementation
> tries to mimic this as much as possible. If you mean something else,
> please, clarify.
>
> Also, how does one save document length with more accuracy? Is there a
> special API or something?
>
> Thank you!
>
>
>>
>>
>> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote:
>> > Hi everybody,
>> >
>> > Some time ago, I had to re-implement some Lucene similarities (in
>> particular
>> > BM25 and the older cosine). I noticed that the re-implemented version
>> > (despite using the same formula) performed better on my data set. The
>> main
>> > difference was that my version did not approximate document length.
>> >
>> > Recently, I have implemented a modification of the current Lucene BM25
>> that
>> > doesn't use this approximation either. I compared the existing and the
>> > modified similarities (again on some of my quirky data sets). The
>> results
>> > are as follows:
>> >
>> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5%
>> in my
>> > tests).
>> > 2) The modified Lucene BM25 it is also more accurate
>> > (I don't see a good reason as to why memoization and document
>> approximation
>> > results in any efficiency gain at all, but this is what seems to happen
>> with
>> > the current hardware.)
>> >
>> > If this potential accuracy degradation concerns you, additional
>> experiments
>> > using more standard collections can be done (e.g., some TREC
>> collections).
>> >
>> > In any case, the reproducible example (which also links to a more
>> detailed
>> > explanation) is in my repo:
>> > https://github.com/searchivarius/AccurateLuceneBM25
>> >
>> > Many thanks!
>> >
>> > ---
>> > Leo
>>
>>
>>
>> ---
>>
> Leo
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Re: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)

Reply via email to