Hi David, Submitting a patch wouldn't be a problem. But let me do a couple more tests with more collections (this time I will try more standard&larger ones).
Thanks! --- Leo On Sat, Jul 9, 2016 at 10:20 AM, David Smiley <[email protected]> wrote: > --ok; (they already have configuration parameters). > > Leo if you can submit a patch to extend the BM25 similarity, I would > welcome it. > > On Sat, Jul 9, 2016 at 7:11 AM Robert Muir <[email protected]> wrote: > >> Our similarities do not need a boolean flag. Instead we should focus >> on making them as simple as possible: there can always be alternative >> implementations. >> >> On Sat, Jul 9, 2016 at 1:08 AM, David Smiley <[email protected]> >> wrote: >> > I agree that using one byte by default is questionable on modern >> machines >> > and given common text field sizes as well. I think my understanding of >> how >> > norms are encoding/accessed may be wrong from what I had said. >> > Lucene53NormsFormat supports Long, I see, and it's clever about >> observing >> > the max bytes-per-value needed. No need for some new format. It's the >> > Similarity impls (BM25 is one but others do this too) that choose to >> encode >> > a smaller value. It would be nice to have this be toggle-able! Maybe >> just >> > a boolean flag? >> > >> > On Thu, Jul 7, 2016 at 9:52 PM Leo Boytsov <[email protected]> wrote: >> >> >> >> Hi David, >> >> >> >> thank you for picking it up. Now we are having a more meaningful >> >> discussion regarding the "waste". >> >> >> >>> Leo, >> >>> There may be confusion here as to where the space is wasted. 1 vs 8 >> >>> bytes >> >>> per doc on disk is peanuts, sure, but in RAM it is not and that is the >> >>> concern. AFAIK the norms are memory-mapped in, and we need to ensure >> >>> it's >> >>> trivial to know which offset to go to on disk based on a document id, >> >>> which >> >>> precludes compression but maybe you have ideas to improve that. >> >> >> >> >> >> First, my understanding is that all essential parts of the Lucene index >> >> are memory mapped, in particular, the inverted index (in the most >> common >> >> scenario at least). Otherwise, the search performance is miserable. >> That >> >> said, memory mapping a few extra bytes per document shouldn't make a >> >> noticeable difference. >> >> >> >> Also, judging by the code in the class Lucene53NormsProducer and a >> debug >> >> session, Lucene only maps a compressed segment containing norm values. >> Norms >> >> are stored using 1,2,4, or 8 bytes. They are uncompressed into an >> 8-byte >> >> long. This is probably on a per-slice basis. >> >> >> >> Anyways, situations in which you will get more than 65536 words per >> >> document are quite rare. Situations with documents having 4 billion >> words >> >> (or more) are exotic. If you have such enormous documents, again, >> saving on >> >> document normalization factors won't be your first priority. You would >> >> probably think about the ways of splitting such a huge document >> containing >> >> every possible keyword into something more manageable. >> >> >> >> To sum up, for 99.999% of the users squeezing normalization factors >> into >> >> a single byte has absolutely no benefit. Memoization do seem to speed >> up >> >> things a bit, but I suspect this may disappear with new generations of >> CPUs. >> >> >> >>> >> >>> To use your own norms encoding, see Codec.normsFormat. (disclaimer: I >> >>> haven't used this but I know where to look) >> >> >> >> >> >> Ok, thanks. >> >> >> >>> >> >>> >> >>> ~ David >> >>> >> >>> On Wed, Jul 6, 2016 at 5:31 PM Leo Boytsov <[email protected]> wrote: >> >>> >> >>> > Hi, >> >>> > >> >>> > for some reason I didn't get a reply from the mailing list >> directly, so >> >>> > I >> >>> > have to send a new message. I appreciate if something can be fixed, >> so >> >>> > that >> >>> > I get a reply as well. >> >>> > >> >>> > First of all, I don't buy the claim about the issue being >> well-known. I >> >>> > would actually argue that nobody except a few Lucene devs know about >> >>> > it. >> >>> > There is also a bug in Lucene's tutorial example. This needs to be >> >>> > fixed as >> >>> > well. >> >>> > >> >>> > Neither do I find your arguments convincing. In particular, I don't >> >>> > think >> >>> > that there is any serious waste of space. Please, see my detailed >> >>> > comments >> >>> > below. Please, note that I definitely don't know all the internals >> >>> > well, so >> >>> > I appreciate if you could explain them better. >> >>> > >> >>> > The downsides are documented and known. But I don't think you are >> >>> >> fully documenting the tradeoffs here, by encoding up to a 64-bit >> long, >> >>> >> you can use up to *8x more memory and disk space* than what lucene >> >>> >> does by default, and that is per-field. >> >>> > >> >>> > >> >>> > This is not true. First of all, the increase is only for the textual >> >>> > fields. Simple fields like Integers don't use normalization factors. >> >>> > So, >> >>> > there is no increase for them. >> >>> > >> >>> >> >>> > In the worst case, you will have 7 extra bytes for a *text* field. >> >>> >> >>> >> >>> > However, this is not an 8x increase. >> >>> > >> >>> >> >>> > If you do *compress* the length of the text field, then its size >> will >> >>> >> >>> >> >>> > depend on the size of the text field. For example, one extra byte >> will >> >>> > be >> >>> > required for fields that contain >> >>> > more than 256 words, two extra bytes for fields having more than >> 65536 >> >>> >> >>> > words, and so on so forth. *Compared to the field sizes, a several >> >>> > byte* >> >>> > increase is simply *laughable*. >> >>> > >> >>> > If Lucene saves normalization factor *without compression, *it >> should >> >>> > now >> >>> >> >>> >> >>> > use 8 bytes already. So, storing the full document length won't >> make a >> >>> > difference. >> >>> > >> >>> > >> >>> >> So that is a real trap. Maybe >> >>> >> throw an exception there instead if the boost != 1F (just don't >> >>> >> support it), and add a guard for "supermassive" documents, so that >> at >> >>> >> most only 16 bits are ever used instead of 64. The document need >> not >> >>> >> really be massive, it can happen just from a strange analysis chain >> >>> >> (n-grams etc) that you get large values here. >> >>> >> >> >>> > >> >>> > As mentioned above, storing a few extra bytes for supermassive >> >>> > documents >> >>> > doesn't affect the overall storage by more than a tiny fraction of a >> >>> > percent. >> >>> > >> >>> > >> >>> >> >> >>> >> I have run comparisons in the past on standard collections to see >> what >> >>> >> happens with this "feature" and differences were very small. I >> think >> >>> >> in practice people do far more damage by sharding their document >> >>> >> collections but not using a distributed interchange of IDF, causing >> >>> >> results from different shards to be incomparable :) >> >>> >> >> >>> > >> >>> >> >>> > Ok, this is not what I see on my data. I see* more than* a 10% >> >>> >> >>> >> >>> > degradation. This is not peanuts. Do we want to re-run experiments >> on >> >>> > standard collections? Don't forget that Lucene is now used as a >> >>> > baseline to >> >>> > compare against. People claim to beat BM25 while they beat something >> >>> > inferior. >> >>> > >> >>> > >> >>> >> >> >>> >> As far as the bm25 tableization, that is *not* the justification >> for >> >>> >> using an 8 byte encoding. The 8 byte encoding was already there >> long >> >>> >> ago (to save memory/space) when bm25 was added, that small >> >>> >> optimization just takes advantage of it. The optimization exists >> just >> >>> >> so that bm25 will have comparable performance to ClassicSimilarity. >> >>> >> >> >>> > >> >>> > Sorry, I don't understand this comment. What kind of 8-byte encoding >> >>> > are >> >>> > you talking about? Do you mean a single-byte encoding? This is what >> the >> >>> > current BM25 similarity seems to use. >> >>> > >> >>> > I also don't quite understand what is a justification for what, >> please, >> >>> > clarify. >> >>> > >> >>> > >> >>> >> >> >>> >> Either way, the document's length can be stored with more accuracy, >> >>> >> without wasting space, especially if you don't use index-time >> >>> >> boosting. But the default encoding supports features like that >> because >> >>> >> lucene supports all these features. >> >>> >> >> >>> > >> >>> > Sorry, I don't get this again. Which features should Lucene >> support? If >> >>> > you like to use boosting in exactly the same way you used it before >> >>> > (though >> >>> > I won't recommend doing so), you can do this. In fact, my >> >>> > implementation >> >>> > tries to mimic this as much as possible. If you mean something else, >> >>> > please, clarify. >> >>> > >> >>> > Also, how does one save document length with more accuracy? Is >> there a >> >>> > special API or something? >> >>> > >> >>> > Thank you! >> >>> > >> >>> > >> >>> >> >> >>> >> >> >>> >> On Mon, Jul 4, 2016 at 1:53 AM, Leo Boytsov <[email protected]> >> wrote: >> >>> >> > Hi everybody, >> >>> >> > >> >>> >> > Some time ago, I had to re-implement some Lucene similarities (in >> >>> >> particular >> >>> >> > BM25 and the older cosine). I noticed that the re-implemented >> >>> >> > version >> >>> >> > (despite using the same formula) performed better on my data set. >> >>> >> > The >> >>> >> main >> >>> >> > difference was that my version did not approximate document >> length. >> >>> >> > >> >>> >> > Recently, I have implemented a modification of the current Lucene >> >>> >> > BM25 >> >>> >> that >> >>> >> > doesn't use this approximation either. I compared the existing >> and >> >>> >> > the >> >>> >> > modified similarities (again on some of my quirky data sets). The >> >>> >> results >> >>> >> > are as follows: >> >>> >> > >> >>> >> > 1) The modified Lucene BM25 similarity is, indeed, a tad slower >> >>> >> > (3-5% >> >>> >> in my >> >>> >> > tests). >> >>> >> > 2) The modified Lucene BM25 it is also more accurate >> >>> >> > (I don't see a good reason as to why memoization and document >> >>> >> approximation >> >>> >> > results in any efficiency gain at all, but this is what seems to >> >>> >> > happen >> >>> >> with >> >>> >> > the current hardware.) >> >>> >> > >> >>> >> > If this potential accuracy degradation concerns you, additional >> >>> >> experiments >> >>> >> > using more standard collections can be done (e.g., some TREC >> >>> >> collections). >> >>> >> > >> >>> >> > In any case, the reproducible example (which also links to a more >> >>> >> detailed >> >>> >> > explanation) is in my repo: >> >>> >> > https://github.com/searchivarius/AccurateLuceneBM25 >> >>> >> > >> >>> >> > Many thanks! >> >>> >> > >> >>> >> > --- >> >>> >> > Leo >> >>> >> >> >>> >> >> >>> >> >> >>> >> --- >> >>> >> >> >>> > Leo >> >>> > >> >>> > -- >> >>> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker >> >>> LinkedIn: http://linkedin.com/in/davidwsmiley | Book: >> >>> http://www.solrenterprisesearchserver.com >> >>> >> >>> >> >>> >> >>> >> >>> --- >> >>> Leo >> > >> > -- >> > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker >> > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: >> > http://www.solrenterprisesearchserver.com >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> -- > Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker > LinkedIn: http://linkedin.com/in/davidwsmiley | Book: > http://www.solrenterprisesearchserver.com >
