[sqlite] Query takes 8 seconds on one machine but I abort after 40+ minutes on other machines

R.Smith Tue, 28 Jul 2015 23:43:24 +0200

On 2015-07-28 09:25 PM, rotaiv wrote:
> Not quite.  The older version brought it down to 8 seconds (as compared to
> 5 seconds) but still a whole lot better than 40+ minutes.
>
> It is very interesting to see an index can make such a remarkable
> difference in that particular scenario.  This is my first time creating
> indexes but I will definitely be doing that more in the future.
>
>  From what I read, it was recommended to create indexes for any field used
> in the SELECT statement.  Is that correct?


If you realize how an Index makes things faster, the question above will 
become easy to answer. You should learn more from web sources, but to 
get you rolling, here is the very condensed "Indexing 101" course:

TLDR: Some very basic Index info - please skip if you know Indexing well 
- or - if you are well versed, please note any ideas or caveats I may 
have missed.


I hope you have used one of the old type telephone books - the ones the 
phone company would send out with a long list with everyone's number in 
an area. It's rather easy to use if you know the person's full name and 
family name/surname because it is listed alphabetically by surname and 
shows the first name, address etc.

Imagine for a moment though, your boss for some nefarious reason tasks 
you with getting him a list of all the people in the phone book with the 
name "Jeremy". There is no real quick way to do this, you have to start 
at the start and read every line in the book seeing if perhaps that 
person is a Jeremy. It will take ages. (More than 40 minutes!)

Now imagine some friendly person in the office sneaks you a copy of a 
little black book they have made before, in this book is a list of all 
the first names in alphabetical order along with on which page and line 
in the phone-book they appear. You look into it, go straight to J, look 
for Jeremy, you see 12 Jeremy's and can quickly find them in the large 
phone-book and note their full details. You finish the task in under 5 
minutes and impress the boss.

This little black book is basically an Index and it makes every search 
faster IF you were looking for first names (The thing you were "looking 
for" I will refer to as a "search-term"). So now you have the 
Surname-ordered phone-book (we say the Surname is a Key of the 
phone-book, and the little black book is an Index on Name). Note that 
Surname cannot be a Primary Key because it won't be unique. The only 
value that will probably be completely Unique for every entry in the 
book is the Phone-Number itself, so this might be used in the 
Primary-Key, but isn't important to us when planning search-able Index 
creation, because it will rarely be a search-term).

A typical Schema might look something like this:

CREATE TABLE PhoneBook(
   Surname,  Name,
   City,  Suburb, Street, ApartmentNo,
   PhoneNo,
   PRIMARY KEY (Surname, PhoneNo)
)

and for the little black book:

INDEX ON PhoneBook(Name)

Your boss is however not satisfied that you have suffered enough and so 
tasks you with getting a list of people whom all live in "Church" street 
- and warns you that he will from now on need street-name related lists 
often.  This puts you back to square one, but you immediately realize 
that another little black book which listed all the street-names in 
alphabetical order and then linked to where in the phone-book they 
appear, would make your life much easier. You need an Index on Street, 
compiles the little black book list (which takes a long time) but 
afterward your life is much easier.

INDEX ON PhoneBook(Street)

One more complication happens - your boss reads your first result set, 
and says "no no, I don't want people from Church street in Downtown, I 
only want the people from Church street in Chinatown." The next step is 
to note in your little black book not just the street name, but also the 
Suburb - a combined Index.  Now you are faced with a choice, you can 
list the names as Suburb,Street - or - Street,Suburb. Which to use?

Well, turns out this depends on how you will do the most searching. If 
you will always have a Street-name to search for and sometimes need to 
limit it to a Suburb or two, then  this is better:

INDEX ON PhoneBook(Street, Suburb)

The Index above is also better if you will always have both a 
Street-name and a Suburb to search for because of the fact that there 
are many more Streets than Suburbs and the same Suburb will repeat a lot 
more than the Street - this is called Cardinality. We say the Suburb has 
low Cardinality because it repeats a lot and so there are very few 
Unique Suburb names. Street names have a higher Cardinality (more Unique 
names) and full addresses have very high Cardinality (Almost every 
person lives in a unique home/apartment).

Notice however that if ALL you get to search on is a Suburb, the Index 
above is pretty useless and will result in a full Index-scan which the 
query optimizer will probably forgo and cause a full Table-scan. 
(reading through the entire PhoneBook again is easier than reading an 
entire Index AND having to look up the PhoneBook entry for every item in 
the Index.)
If then sometimes you will have just the Suburb to search for, another 
little black book is needed:

INDEX ON PhoneBook(Suburb)

Another thing to note is that - You may think "An Index on Suburb is 
such a waste due to low Cardinality, how about I add the StreetName to 
this Index?" to make:

INDEX ON PhoneBook(Suburb, Street)

It's a great idea, but because we KNOW the cardinality of Street names 
will always be much higher than Suburbs, we would probably always grab 
the previous little black book when we have both Street and Suburb as 
search terms. However, in a next project, where the data isn't clear on 
Cardinality and we don't know before-hand, the best is to make both the 
Indices (or "Indexes" if you will) and let whomever will be doing the 
searching decide which little black book they need to use. Note however 
that there is a performance hit for having many Indices when Updating or 
Inserting - because simply the DB engine has to go through all the 
little black books and note the new information and sometimes reorganize 
and append structures to do so.

The general rule-of-thumb is if you have a DB that will have many many 
Updates/Inserts and fewer searches (like a logger): use the least 
possible Indices. If you have a DB that will very rarely be updated but 
support a LOT of searches (like a Phone-book): use many. For everything 
in-between, the best results will probably need a bit of experimentation.

In most SQL engines there is a function like ANALYZE which basically 
goes through the table and, for every Index that you have made, notes 
the Cardinality of the components so that the Query optimizer can better 
decide which is  the best to use for any given set of search-terms.

One final note: Imagine your first little black book had not only the 
Names of all the people in the Phone-book, but also included the rest of 
the detail:

INDEX ON PhoneBook(Name, Surname, Street, Suburb, City, ApartmentNo, 
PhoneNo)

then you would never need to even consult the Phone-book again, you can 
get all you need right from the Index! This is referred to as a 
"Covering Index" and quite a useful tool to make searching very fast. 
Note the order of the Indexed fields is simply what I feel are 
most-likely-to-be-searched-for (on the left) down to 
least-likely-to-be-searched-for.
Another useful Covering-Index might be:

INDEX ON PhoneBook(Surname, Name, Suburb, Street, City, ApartmentNo, 
PhoneNo)

If you know what will be the likely search terms, then a couple or more 
covering Indices will make searches very fast - but beware: they are 
really expensive to update, eats a lot of disk space, and adding really 
large data fields to them should be avoided.


>
> SELECT home.fpath
> FROM home
> LEFT JOIN work ON work.fpath = home.fpath
> WHERE work.fpath IS NULL
> AND home.ftype = 'f';
>
> What indexes should I create?  Seems like I only need an index on
> home.fpath  and work.fpath but I wanted to make sure.

Yes, these are all you need:
INDEX ON home(fpath);
INDEX ON work(fpath);

Or better yet, if your schema is the same for both tables, something 
like this:

CREATE TABLE %1 (
   fpath TEXT,
   ffile TEXT,
   ftype TEXT
   PRIMARY KEY (fpath, ffile)
);

[sqlite] Query takes 8 seconds on one machine but I abort after 40+ minutes on other machines

Reply via email to