On 2014/02/09 13:18, Constantine Yannakopoulos wrote:
Hello all,

I have a case where the user needs to perform a search in a text column of
a table with many rows. Typically the user enters the first n matching
characters as a search string and the application issues a SELECT statement
that uses the LIKE operator with the search string:

SELECT * FROM ATable WHERE AColumn LIKE :SearchString || '%'

According to the LIKE optimization this statement will use an index so it
will be fast.

This can't be true, there is no way a LIKE or GLOB operator can always use an Index (I mean, it can "use" the Indexed column, but it cannot always reduce the number of iterations via Indexing).

To explain more what I mean, consider a simple binary search, let's say we have an ordered list of names that we use the alphabet as an indexer:

0:Abraham
1:Ben
2:Constantine
3:Igor
4:James
5:John
6:Ryan
7:Simon

A binary search would start by hitting the middle item typically ( i = Length div 2 ) which is 4 in this case, then comparing it to the searched item, let's say it is "JOH" in this case. It sees that Idx 4 is James, which is smaller than "JOH" (no-case collation enabled), then looks to divide the remainder of items larger than James (Idx 5, 6 and 7) in two (3 div 2 = 1 ) and adds it to the bottom of them (5) so it checks Index 6, which is now Ryan and is higher than "JOH", it then divides into below Ryan and above James and obviously gets "John" which is a match.

A binary tree works similar with the difference it does not have to divide anything, the tree node children are already divisive so it just follows down the node closest to the target match until a definite match or matches is/are found (depending on search criteria).

The list above does however demonstrate why a LIKE operator cannot always use an Index, let's say I'm using a search for LIKE '%n', how on Earth would you be able to look for that by binary jumping through the list? ANY Indexed item might end on an n, indeed 4 of those above do, there is no way to tell and a full-table scan is inevitable.

Of course some clever DB systems, of which SQLite is one, can detect when you use LIKE "Jam%" and knows this is index-searchable and still use the Index, but it all depends on what you type and where those % signs are - something which is again negated if the search collation does not match the column collation, but is rather easy when standard text or binary collations are used.


store two text columns in the table.  The first is the text as entered.
  The second is your text reduced to its simplified searchable form,
probably all LATIN characters, perhaps using some sort of soundex.  Search
on the second column but return the text in the first.

This allows you to write your conversion routine in the language you're
using to do the rest of your programming, instead of having to express it
as a SQLite function.

        Thanks for your reply,


        Yes. I thought of that as well. I even have the greek soundex() function
        from a previous implementation. Problem is it will bloat the database
        considerably, keeping in mind that the users will typically demand that 
ALL
        searchable text columns in the application work that way, and who can 
blame
        them? And the project manager will not be very keen on accepting both 
this
        database size increase and the time needed to calculate the extra 
soundex
        column for every database row. It will be much easier to convince this
        person to accept the time-costly database upgrade needed in both cases
        (tables need to be recreated to change collation) but not both upgrade 
and
        bloat.


And I am not happy to accept the fact that I cannot fly, but the laws of the 
Universe demands I adhere to the deficit, and when I simply have to fly, employ 
the help of a very large costly winged tube with jet engines attached to it!

It will be a matter of finding the most acceptable deficit... Whether it be 
size, time waste, upgrade cost etc.  By the way, I don't think upgrading the 
table schemata need to be a real hard thing... some scripts can take care of 
that in minimum amount of time. (Test them thoroughly though). Also, another 
poster here had developed a full set of international collations and comparison 
mechanisms as a loadable extension to SQLite - Nunicode by Aleksey Tulinov I 
think... link here:

https://bitbucket.org/alekseyt/nunicode





_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to