- There are three searches logged for each "1999" and "2000" while I
   only searched once.

As this is the live search, the search action isn't triggered when you hit the enter key, but on a timer. It can happen that a search is triggered more than once under certain circumstances. Which explains why performance is crucial for this search action. See below.

Why do some searches get a "w10"?

The full text index stores multiple sets of data for every item. They are put in different places, which later on would be weighted by their importance. w10 has highest weight, w1 lowest.

And here comes the lengthy explanation of what I've found investigating this very special case. Most likely (depending on your music collection) this really is a rather special case:

- the keyword is very precise: you know what you expect
- the keyword is very short
- the keyword likely is very popular

Now that popularity thing might be a bit irritating. You probably only have one single track with this name. Why would it be popular? Because we're dealing with a full text index, covering not only titles, but lots of other pieces of information, too. Eg. years, file paths, comments, even MusicBrainz IDs.

Digging the 99 case in my collection I found a lot of these:

Comment: ExactAudioCopy v0.99pb5

Yep. Or something like that:

UFID: [ http://musicbrainz.org, ebe13618-bbdd-4ef3-9a91-9981602e528f ]

That -9981602e528f at the end would match, too, as our search term is at the start of that "word".

That would explain the popularity of the search term. But why would an obvious hit not show up, but some obscure, hidden data would win?

Now this is getting complicated. Many factors play a role: optimization for speed (which might penalize this particular case), the nature of full text search indexing not only the obvious data, but anything. And some poor, deliberate choices. And bugs. Wow. Searching for "99" brought quite a few issues to the light of day :-).

So there's some optimization going on because the search needs to be fast. One of these optimizations is to try to limit the result set when we risk to deal with a large number of hits. Eg. short search terms, or single terms. In this case we're limiting the results to hits in the highest priority column only (which explains the "w10:99").

If we know that we are still dealing with a large resultset (>500 items found), the current implementation would only pick the top 500 items. And that's where I would say there is/was a bug: we pick the top items out of an non-ordered list... which means that even if the score of "99 Luftballons" was high, but it was far down the "randomly" ordered result list, it would be cut off.

When the search is being run, it does weigh the results based on aforementioned columns. If Nena's album had one track called "99 Luftballons", but another album had ten tracks with the EAC version string in the comment, the latter might outweigh Nena, because the track title on an album has weight 5, but the comment has 10x weight 1.

This is where a stupid decision kicks in: for whatever reason I decided it was a good idea to put the MusicBrainz IDs in w10. Sure, it's a unique value for every item. But nothing else should have them, right? Therefore they should always bring up exactly one track, even if the value is stored in the lowest priority column.

New builds are due out in a bit. Unfortunately my shiny new build system still isn't installed in a decent place. Therefore I have to upload from behind this super slow 10Mb connection... So please be patient.

Thanks for an interesting test/edge case! :-)


plugins mailing list

Reply via email to