Our similarities do not need a boolean flag. Instead we should focus on making them as simple as possible: there can always be alternative implementations.
On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]> wrote: > I agree that using one byte by default is questionable on modern machines > and given common text field sizes as well. I think my understanding of how > norms are encoding/accessed may be wrong from what I had said. > Lucene53NormsFormat supports Long, I see, and it's clever about observing > the max bytes-per-value needed. No need for some new format. It's the > Similarity impls (BM25 is one but others do this too) that choose to encode > a smaller value. It would be nice to have this be toggle-able! Maybe just > a boolean flag? > > On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote: >> >> Hi David, >> >> thank you for picking it up. Now we are having a more meaningful >> discussion regarding the "waste". >> >>> Leo, >>> There may be confusion here as to where the space is wasted. 1 vs 8 >>> bytes >>> per doc on disk is peanuts, sure, but in RAM it is not and that is the >>> concern. AFAIK the norms are memory-mapped in, and we need to ensure >>> it's >>> trivial to know which offset to go to on disk based on a document id, >>> which >>> precludes compression but maybe you have ideas to improve that. >> >> >> First, my understanding is that all essential parts of the Lucene index >> are memory mapped, in particular, the inverted index (in the most common >> scenario at least). Otherwise, the search performance is miserable. That >> said, memory mapping a few extra bytes per document shouldn't make a >> noticeable difference. >> >> Also, judging by the code in the class Lucene53NormsProducer and a debug >> session, Lucene only maps a compressed segment containing norm values. Norms >> are stored using 1,2,4, or 8 bytes. They are uncompressed into an 8-byte >> long. This is probably on a per-slice basis. >> >> Anyways, situations in which you will get more than 65536 words per >> document are quite rare. Situations with documents having 4 billion words >> (or more) are exotic. If you have such enormous documents, again, saving on >> document normalization factors won't be your first priority. You would >> probably think about the ways of splitting such a huge document containing >> every possible keyword into something more manageable. >> >> To sum up, for 99.999% of the users squeezing normalization factors into >> a single byte has absolutely no benefit. Memoization do seem to speed up >> things a bit, but I suspect this may disappear with new generations of CPUs. >> >>> >>> To use your own norms encoding, see Codec.normsFormat. (disclaimer: I >>> haven't used this but I know where to look) >> >> >> Ok, thanks. >> >>> >>> >>> ~ David >>> >>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote: >>> >>> > Hi, >>> > >>> > for some reason I didn't get a reply from the mailing list directly, so >>> > I >>> > have to send a new message. I appreciate if something can be fixed, so >>> > that >>> > I get a reply as well. >>> > >>> > First of all, I don't buy the claim about the issue being well-known. I >>> > would actually argue that nobody except a few Lucene devs know about >>> > it. >>> > There is also a bug in Lucene's tutorial example. This needs to be >>> > fixed as >>> > well. >>> > >>> > Neither do I find your arguments convincing. In particular, I don't >>> > think >>> > that there is any serious waste of space. Please, see my detailed >>> > comments >>> > below. Please, note that I definitely don't know all the internals >>> > well, so >>> > I appreciate if you could explain them better. >>> > >>> > The downsides are documented and known. But I don't think you are >>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit long, >>> >> you can use up to *8x more memory and disk space* than what lucene >>> >> does by default, and that is per-field. >>> > >>> > >>> > This is not true. First of all, the increase is only for the textual >>> > fields. Simple fields like Integers don't use normalization factors. >>> > So, >>> > there is no increase for them. >>> > >>> >>> > In the worst case, you will have 7 extra bytes for a *text* field. >>> >>> >>> > However, this is not an 8x increase. >>> > >>> >>> > If you do *compress* the length of the text field, then its size will >>> >>> >>> > depend on the size of the text field. For example, one extra byte will >>> > be >>> > required for fields that contain >>> > more than 256 words, two extra bytes for fields having more than 65536 >>> >>> > words, and so on so forth. *Compared to the field sizes, a several >>> > byte* >>> > increase is simply *laughable*. >>> > >>> > If Lucene saves normalization factor *without compression, *it should >>> > now >>> >>> >>> > use 8 bytes already. So, storing the full document length won't make a >>> > difference. >>> > >>> > >>> >> So that is a real trap. Maybe >>> >> throw an exception there instead if the boost != 1F (just don't >>> >> support it), and add a guard for "supermassive" documents, so that at >>> >> most only 16 bits are ever used instead of 64. The document need not >>> >> really be massive, it can happen just from a strange analysis chain >>> >> (n-grams etc) that you get large values here. >>> >> >>> > >>> > As mentioned above, storing a few extra bytes for supermassive >>> > documents >>> > doesn't affect the overall storage by more than a tiny fraction of a >>> > percent. >>> > >>> > >>> >> >>> >> I have run comparisons in the past on standard collections to see what >>> >> happens with this "feature" and differences were very small. I think >>> >> in practice people do far more damage by sharding their document >>> >> collections but not using a distributed interchange of IDF, causing >>> >> results from different shards to be incomparable :) >>> >> >>> > >>> >>> > Ok, this is not what I see on my data. I see* more than* a 10% >>> >>> >>> > degradation. This is not peanuts. Do we want to re-run experiments on >>> > standard collections? Don't forget that Lucene is now used as a >>> > baseline to >>> > compare against. People claim to beat BM25 while they beat something >>> > inferior. >>> > >>> > >>> >> >>> >> As far as the bm25 tableization, that is *not* the justification for >>> >> using an 8 byte encoding. The 8 byte encoding was already there long >>> >> ago (to save memory/space) when bm25 was added, that small >>> >> optimization just takes advantage of it. The optimization exists just >>> >> so that bm25 will have comparable performance to ClassicSimilarity. >>> >> >>> > >>> > Sorry, I don't understand this comment. What kind of 8-byte encoding >>> > are >>> > you talking about? Do you mean a single-byte encoding? This is what the >>> > current BM25 similarity seems to use. >>> > >>> > I also don't quite understand what is a justification for what, please, >>> > clarify. >>> > >>> > >>> >> >>> >> Either way, the document's length can be stored with more accuracy, >>> >> without wasting space, especially if you don't use index-time >>> >> boosting. But the default encoding supports features like that because >>> >> lucene supports all these features. >>> >> >>> > >>> > Sorry, I don't get this again. Which features should Lucene support? If >>> > you like to use boosting in exactly the same way you used it before >>> > (though >>> > I won't recommend doing so), you can do this. In fact, my >>> > implementation >>> > tries to mimic this as much as possible. If you mean something else, >>> > please, clarify. >>> > >>> > Also, how does one save document length with more accuracy? Is there a >>> > special API or something? >>> > >>> > Thank you! >>> > >>> > >>> >> >>> >> >>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> wrote: >>> >> > Hi everybody, >>> >> > >>> >> > Some time ago, I had to re-implement some Lucene similarities (in >>> >> particular >>> >> > BM25 and the older cosine). I noticed that the re-implemented >>> >> > version >>> >> > (despite using the same formula) performed better on my data set. >>> >> > The >>> >> main >>> >> > difference was that my version did not approximate document length. >>> >> > >>> >> > Recently, I have implemented a modification of the current Lucene >>> >> > BM25 >>> >> that >>> >> > doesn't use this approximation either. I compared the existing and >>> >> > the >>> >> > modified similarities (again on some of my quirky data sets). The >>> >> results >>> >> > are as follows: >>> >> > >>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower >>> >> > (3-5% >>> >> in my >>> >> > tests). >>> >> > 2) The modified Lucene BM25 it is also more accurate >>> >> > (I don't see a good reason as to why memoization and document >>> >> approximation >>> >> > results in any efficiency gain at all, but this is what seems to >>> >> > happen >>> >> with >>> >> > the current hardware.) >>> >> > >>> >> > If this potential accuracy degradation concerns you, additional >>> >> experiments >>> >> > using more standard collections can be done (e.g., some TREC >>> >> collections). >>> >> > >>> >> > In any case, the reproducible example (which also links to a more >>> >> detailed >>> >> > explanation) is in my repo: >>> >> > https://github.com/searchivarius/AccurateLuceneBM25 >>> >> > >>> >> > Many thanks! >>> >> > >>> >> > --- >>> >> > Leo >>> >> >>> >> >>> >> >>> >> --- >>> >> >>> > Leo >>> > >>> > -- >>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker >>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book: >>> http://www.solrenterprisesearchserver.com >>> >>> >>> >>> >>> --- >>> Leo > > -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
