Leo, There may be confusion here as to where the space is wasted. 1 vs 8 bytes per doc on disk is peanuts, sure, but in RAM it is not and that is the concern. AFAIK the norms are memory-mapped in, and we need to ensure it's trivial to know which offset to go to on disk based on a document id, which precludes compression but maybe you have ideas to improve that.
To use your own norms encoding, see Codec.normsFormat. (disclaimer: I haven't used this but I know where to look) ~ David On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote: > Hi, > > for some reason I didn't get a reply from the mailing list directly, so I > have to send a new message. I appreciate if something can be fixed, so that > I get a reply as well. > > First of all, I don't buy the claim about the issue being well-known. I > would actually argue that nobody except a few Lucene devs know about it. > There is also a bug in Lucene's tutorial example. This needs to be fixed as > well. > > Neither do I find your arguments convincing. In particular, I don't think > that there is any serious waste of space. Please, see my detailed comments > below. Please, note that I definitely don't know all the internals well, so > I appreciate if you could explain them better. > > The downsides are documented and known. But I don't think you are >> fully documenting the tradeoffs here, by encoding up to a 64-bit long, >> you can use up to *8x more memory and disk space* than what lucene >> does by default, and that is per-field. > > > This is not true. First of all, the increase is only for the textual > fields. Simple fields like Integers don't use normalization factors. So, > there is no increase for them. > > In the worst case, you will have 7 extra bytes for a *text* field. > However, this is not an 8x increase. > > If you do *compress* the length of the text field, then its size will > depend on the size of the text field. For example, one extra byte will be > required for fields that contain > more than 256 words, two extra bytes for fields having more than 65536 > words, and so on so forth. *Compared to the field sizes, a several byte* > increase is simply *laughable*. > > If Lucene saves normalization factor *without compression, *it should now > use 8 bytes already. So, storing the full document length won't make a > difference. > > >> So that is a real trap. Maybe >> throw an exception there instead if the boost != 1F (just don't >> support it), and add a guard for "supermassive" documents, so that at >> most only 16 bits are ever used instead of 64. The document need not >> really be massive, it can happen just from a strange analysis chain >> (n-grams etc) that you get large values here. >> > > As mentioned above, storing a few extra bytes for supermassive documents > doesn't affect the overall storage by more than a tiny fraction of a > percent. > > >> >> I have run comparisons in the past on standard collections to see what >> happens with this "feature" and differences were very small. I think >> in practice people do far more damage by sharding their document >> collections but not using a distributed interchange of IDF, causing >> results from different shards to be incomparable :) >> > > Ok, this is not what I see on my data. I see* more than* a 10% > degradation. This is not peanuts. Do we want to re-run experiments on > standard collections? Don't forget that Lucene is now used as a baseline to > compare against. People claim to beat BM25 while they beat something > inferior. > > >> >> As far as the bm25 tableization, that is *not* the justification for >> using an 8 byte encoding. The 8 byte encoding was already there long >> ago (to save memory/space) when bm25 was added, that small >> optimization just takes advantage of it. The optimization exists just >> so that bm25 will have comparable performance to ClassicSimilarity. >> > > Sorry, I don't understand this comment. What kind of 8-byte encoding are > you talking about? Do you mean a single-byte encoding? This is what the > current BM25 similarity seems to use. > > I also don't quite understand what is a justification for what, please, > clarify. > > >> >> Either way, the document's length can be stored with more accuracy, >> without wasting space, especially if you don't use index-time >> boosting. But the default encoding supports features like that because >> lucene supports all these features. >> > > Sorry, I don't get this again. Which features should Lucene support? If > you like to use boosting in exactly the same way you used it before (though > I won't recommend doing so), you can do this. In fact, my implementation > tries to mimic this as much as possible. If you mean something else, > please, clarify. > > Also, how does one save document length with more accuracy? Is there a > special API or something? > > Thank you! > > >> >> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote: >> > Hi everybody, >> > >> > Some time ago, I had to re-implement some Lucene similarities (in >> particular >> > BM25 and the older cosine). I noticed that the re-implemented version >> > (despite using the same formula) performed better on my data set. The >> main >> > difference was that my version did not approximate document length. >> > >> > Recently, I have implemented a modification of the current Lucene BM25 >> that >> > doesn't use this approximation either. I compared the existing and the >> > modified similarities (again on some of my quirky data sets). The >> results >> > are as follows: >> > >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5% >> in my >> > tests). >> > 2) The modified Lucene BM25 it is also more accurate >> > (I don't see a good reason as to why memoization and document >> approximation >> > results in any efficiency gain at all, but this is what seems to happen >> with >> > the current hardware.) >> > >> > If this potential accuracy degradation concerns you, additional >> experiments >> > using more standard collections can be done (e.g., some TREC >> collections). >> > >> > In any case, the reproducible example (which also links to a more >> detailed >> > explanation) is in my repo: >> > https://github.com/searchivarius/AccurateLuceneBM25 >> > >> > Many thanks! >> > >> > --- >> > Leo >> >> >> >> --- >> > Leo > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
