I agree that using one byte by default is questionable on modern machines and given common text field sizes as well. I think my understanding of how norms are encoding/accessed may be wrong from what I had said. Lucene53NormsFormat supports Long, I see, and it's clever about observing the max bytes-per-value needed. No need for some new format. It's the Similarity impls (BM25 is one but others do this too) that choose to encode a smaller value. It would be nice to have this be toggle-able! Maybe just a boolean flag?
On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote: > Hi David, > > thank you for picking it up. Now we are having a more meaningful > discussion regarding the "waste". > > Leo, >> There may be confusion here as to where the space is wasted. 1 vs 8 bytes >> per doc on disk is peanuts, sure, but in RAM it is not and that is the >> concern. AFAIK the norms are memory-mapped in, and we need to ensure it's >> trivial to know which offset to go to on disk based on a document id, >> which >> precludes compression but maybe you have ideas to improve that. >> > > First, my understanding is that all essential parts of the Lucene index > are memory mapped, in particular, the inverted index (in the most common > scenario at least). Otherwise, the search performance is miserable. That > said, memory mapping a few extra bytes per document shouldn't make a > noticeable difference. > > Also, judging by the code in the class Lucene53NormsProducer and a debug > session, Lucene only maps a compressed segment containing norm values. > Norms are stored using 1,2,4, or 8 bytes. They are uncompressed into an > 8-byte long. This is probably on a per-slice basis. > > Anyways, situations in which you will get more than 65536 words per > document are quite rare. Situations with documents having 4 billion words > (or more) are exotic. If you have such enormous documents, again, saving on > document normalization factors won't be your first priority. You would > probably think about the ways of splitting such a huge document containing > every possible keyword into something more manageable. > > To sum up, for 99.999% of the users squeezing normalization factors into > a single byte has absolutely no benefit. Memoization do seem to speed up > things a bit, but I suspect this may disappear with new generations of CPUs. > > >> To use your own norms encoding, see Codec.normsFormat. (disclaimer: I >> haven't used this but I know where to look) >> > > Ok, thanks. > > >> >> ~ David >> >> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote: >> >> > Hi, >> > >> > for some reason I didn't get a reply from the mailing list directly, so >> I >> > have to send a new message. I appreciate if something can be fixed, so >> that >> > I get a reply as well. >> > >> > First of all, I don't buy the claim about the issue being well-known. I >> > would actually argue that nobody except a few Lucene devs know about it. >> > There is also a bug in Lucene's tutorial example. This needs to be >> fixed as >> > well. >> > >> > Neither do I find your arguments convincing. In particular, I don't >> think >> > that there is any serious waste of space. Please, see my detailed >> comments >> > below. Please, note that I definitely don't know all the internals >> well, so >> > I appreciate if you could explain them better. >> > >> > The downsides are documented and known. But I don't think you are >> >> fully documenting the tradeoffs here, by encoding up to a 64-bit long, >> >> you can use up to *8x more memory and disk space* than what lucene >> >> does by default, and that is per-field. >> > >> > >> > This is not true. First of all, the increase is only for the textual >> > fields. Simple fields like Integers don't use normalization factors. So, >> > there is no increase for them. >> > >> > > In the worst case, you will have 7 extra bytes for a *text* field. > > >> > However, this is not an 8x increase. >> > >> > > If you do *compress* the length of the text field, then its size will > > >> > depend on the size of the text field. For example, one extra byte will >> be >> > required for fields that contain >> > more than 256 words, two extra bytes for fields having more than 65536 >> > > words, and so on so forth. *Compared to the field sizes, a several byte* >> > increase is simply *laughable*. >> > >> > If Lucene saves normalization factor *without compression, *it should >> now > > >> > use 8 bytes already. So, storing the full document length won't make a >> > difference. >> > >> > >> >> So that is a real trap. Maybe >> >> throw an exception there instead if the boost != 1F (just don't >> >> support it), and add a guard for "supermassive" documents, so that at >> >> most only 16 bits are ever used instead of 64. The document need not >> >> really be massive, it can happen just from a strange analysis chain >> >> (n-grams etc) that you get large values here. >> >> >> > >> > As mentioned above, storing a few extra bytes for supermassive documents >> > doesn't affect the overall storage by more than a tiny fraction of a >> > percent. >> > >> > >> >> >> >> I have run comparisons in the past on standard collections to see what >> >> happens with this "feature" and differences were very small. I think >> >> in practice people do far more damage by sharding their document >> >> collections but not using a distributed interchange of IDF, causing >> >> results from different shards to be incomparable :) >> >> >> > >> > > Ok, this is not what I see on my data. I see* more than* a 10% > > >> > degradation. This is not peanuts. Do we want to re-run experiments on >> > standard collections? Don't forget that Lucene is now used as a >> baseline to >> > compare against. People claim to beat BM25 while they beat something >> > inferior. >> > >> > >> >> >> >> As far as the bm25 tableization, that is *not* the justification for >> >> using an 8 byte encoding. The 8 byte encoding was already there long >> >> ago (to save memory/space) when bm25 was added, that small >> >> optimization just takes advantage of it. The optimization exists just >> >> so that bm25 will have comparable performance to ClassicSimilarity. >> >> >> > >> > Sorry, I don't understand this comment. What kind of 8-byte encoding are >> > you talking about? Do you mean a single-byte encoding? This is what the >> > current BM25 similarity seems to use. >> > >> > I also don't quite understand what is a justification for what, please, >> > clarify. >> > >> > >> >> >> >> Either way, the document's length can be stored with more accuracy, >> >> without wasting space, especially if you don't use index-time >> >> boosting. But the default encoding supports features like that because >> >> lucene supports all these features. >> >> >> > >> > Sorry, I don't get this again. Which features should Lucene support? If >> > you like to use boosting in exactly the same way you used it before >> (though >> > I won't recommend doing so), you can do this. In fact, my implementation >> > tries to mimic this as much as possible. If you mean something else, >> > please, clarify. >> > >> > Also, how does one save document length with more accuracy? Is there a >> > special API or something? >> > >> > Thank you! >> > >> > >> >> >> >> >> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote: >> >> > Hi everybody, >> >> > >> >> > Some time ago, I had to re-implement some Lucene similarities (in >> >> particular >> >> > BM25 and the older cosine). I noticed that the re-implemented version >> >> > (despite using the same formula) performed better on my data set. The >> >> main >> >> > difference was that my version did not approximate document length. >> >> > >> >> > Recently, I have implemented a modification of the current Lucene >> BM25 >> >> that >> >> > doesn't use this approximation either. I compared the existing and >> the >> >> > modified similarities (again on some of my quirky data sets). The >> >> results >> >> > are as follows: >> >> > >> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5% >> >> in my >> >> > tests). >> >> > 2) The modified Lucene BM25 it is also more accurate >> >> > (I don't see a good reason as to why memoization and document >> >> approximation >> >> > results in any efficiency gain at all, but this is what seems to >> happen >> >> with >> >> > the current hardware.) >> >> > >> >> > If this potential accuracy degradation concerns you, additional >> >> experiments >> >> > using more standard collections can be done (e.g., some TREC >> >> collections). >> >> > >> >> > In any case, the reproducible example (which also links to a more >> >> detailed >> >> > explanation) is in my repo: >> >> > https://github.com/searchivarius/AccurateLuceneBM25 >> >> > >> >> > Many thanks! >> >> > >> >> > --- >> >> > Leo >> >> >> >> >> >> >> >> --- >> >> >> > Leo >> > >> > -- >> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker >> LinkedIn: http://linkedin.com/in/davidwsmiley | Book: >> http://www.solrenterprisesearchserver.com >> >> >> >> >> --- >> Leo > > -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
