Re: [sqlite] LIKE operator and collations

2019-02-15 Thread Shawn Wagner
Look into using the ICU extension. If you're compiling sqlite yourself,
just define SQLITE_ENABLE_ICU to 1 (And link with the ICU libraries),
otherwise you'll have to grab the source and compile it as a loadable
module.

https://www3.sqlite.org/cgi/src/dir?ci=03c4f00317233a34=ext/icu for
details. (Is there a way to link to the latest version of a file in the
repository and not a particular commit?)

On Fri, Feb 15, 2019 at 6:06 AM Aydin Ozgur Yagmur 
wrote:

> Hello,
>
> I want to use custom collations for "like" and "not equals" queries.
> *select * from tbl_internal where col_internal like 'ç%' collate
> TURKISH_CI;*
>
> it is ok for "equals" operator.
> *select * from tbl_internal where col_internal = 'çç' collate TURKISH_CI;*
>
> but not ok, for "like" and "not equals" operators
>
> How can i make a search like these? Could you give me any clues?
>
> Thanks for advance,
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2019-02-15 Thread Simon Slavin
You can write your own LIKE function and use that:



If you have the source for "collate TURKISH_CI" then you might be able to use 
it in your own function.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Constantine Yannakopoulos
On Wed, May 14, 2014 at 1:35 PM, Jan Slodicka  wrote:

> Simon Slavin-3 wrote
> > On 13 May 2014, at 5:21pm, Constantine Yannakopoulos wrote:
> >
> >> ​This is very interesting Jan. The only way this could fail is if the
> >> collation implementation does something funny if it encounters this
> >> character​, e.g. choose to ignore it when comparing.
> >
> > That cuts out a very large number of collations.  The solution works fine
> > for any collation which orders strings according to Unicode order.  But
> > the point of creating a correlation is that you don't want that order.
> >
> > Simon.
>
> Simon, I think that the most frequent point of making a collation is to get
> the Unicode order. At the bare minimum adding LIKE optimization to the ICU
> Sqlite extension would make sense, the savings are really huge.
>

There could be a flag in sqlite3_create_collation_v2()'s TextRep argument,
much like the flag SQLITE_DETERMINISTIC of sqlite3_create_function() that
will flag the collation as a "unicode text" collation. If this flag is set,
the engine can perform the LIKE optimization for these collations using the
U+10FFFD idea to construct an upper limit for the range as it has been
described in previous posts. Since this flag is not present in existing
calls to sqlite3_create_collation_v2() the change will be
backward-compatible.

Either this or the already mentioned idea of giving the ability to manually
specify lower and upper bounds for the LIKE optimization, perhaps by means
of a callback in a hypothetical sqlite3_create_collation_v3() variant.

And by the way, "unicode text" collations include all "strange" collations
like the one of accent insensitivity and mixed codepage I described in my
original post. And I would expect these to be about 95% of all custom coded
collations.

--Constantine
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Jan Slodicka
Simon Slavin-3 wrote
> On 13 May 2014, at 5:21pm, Constantine Yannakopoulos wrote:
> 
>> ​This is very interesting Jan. The only way this could fail is if the
>> collation implementation does something funny if it encounters this
>> character​, e.g. choose to ignore it when comparing.
> 
> That cuts out a very large number of collations.  The solution works fine
> for any collation which orders strings according to Unicode order.  But
> the point of creating a correlation is that you don't want that order. 
> 
> Simon.

Simon, I think that the most frequent point of making a collation is to get
the Unicode order. At the bare minimum adding LIKE optimization to the ICU
Sqlite extension would make sense, the savings are really huge.





--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75667.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Clemens Ladisch
Hick Gunter wrote:
> I was under the impression you wanted to achieve this:
>
> > select hex('abc' || X'10FFFD');
> 61626310FFFD
> > select length('abc' || X'10FFFD');
> 6

If you want to create characters through a blob, you have to use
the correct UTF-8 encoding:

 sqlite> select quote(cast(char(1114109) as blob));
 X'F48FBFBD'
 sqlite> select unicode(x'F48FBFBD');
 1114109
 sqlite> select hex('abc' || x'F48FBFBD');
 616263F48FBFBD
 sqlite> select length('abc' || x'F48FBFBD');
 4


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Jan Slodicka
Dominique Devienne wrote
> On Tue, May 13, 2014 at 5:50 PM, Jan Slodicka 

> jano@

>  wrote:
>> So one could replace "LIKE 'xxx%'" by "BETWEEN('xxx', 'xxx' +
>> '\uDBFF\uDFFD').
> 
> make that
> 
> BETWEEN('xxx', 'xxx' + char(1114109))
> 
> I don't think SQlite supports \u literals, nor does it support hex
> literals, so must use the decimal equivalent to U+10FFFD.

Dominique,

I just wanted to sketch the idea and did not care of the syntax. I am going
to apply this differently, anyway:-)




--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75665.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Hick Gunter
I was under the impression you wanted to achieve this:

asql> select hex('abc' || X'10FFFD');
hex('abc' || X'10FFFD')
---
61626310FFFD
asql> select length('abc' || X'10FFFD');
length('abc' || X'10FFFD')
--
6
asql> select typeof('abc' || X'10FFFD');
typeof('abc' || X'10FFFD')
--
text


-Ursprüngliche Nachricht-
Von: Dominique Devienne [mailto:ddevie...@gmail.com]
Gesendet: Mittwoch, 14. Mai 2014 09:28
An: General Discussion of SQLite Database
Betreff: Re: [sqlite] LIKE operator and collations

On Wed, May 14, 2014 at 8:30 AM, Hick Gunter <h...@scigames.at> wrote:
> Actually SQLite does support X'...' literals for creating blobs.

Note sure how that's relevant Hick. We don't need a blob, but a integer for 
char(). I was obviously talking about *number* literals (prefixed with 0b, 0, 
0x for binary / octal / hexa), not blob literals.

I'd be +1 to having char() also accept blobs (1 to 3 bytes long) instead of 
integer, since x'10FFFD' is closer to U+10FFFD than 1114109 is.

But Dr. Hipp would need to agree with that. Until then, 1114109 it must be. --DD

C:\Users\DDevienne>sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12 Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select char(x'10FFFD');

sqlite> select typeof(char(x'10FFFD'));
text
sqlite> select length(char(x'10FFFD'));
0
sqlite>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


---
Gunter Hick
Software Engineer

Scientific Games International GmbH
Klitschgasse 2 – 4, A - 1130 Vienna,
Austria
FN 157284 a, HG Wien
Tel: +43 1 80100 0
E-Mail: h...@scigames.at

This e-mail is confidential and may well also be legally privileged. If you 
have received it in error, you are on notice as to its status and accordingly 
please notify us immediately by reply e-mail and then
delete this message from your system. Please do not copy it or use it for any 
purposes, or disclose its contents to any person as to do so could be a breach 
of confidence. Thank you for your cooperation.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Dominique Devienne
On Wed, May 14, 2014 at 8:30 AM, Hick Gunter  wrote:
> Actually SQLite does support X'...' literals for creating blobs.

Note sure how that's relevant Hick. We don't need a blob, but a
integer for char(). I was obviously talking about *number* literals
(prefixed with 0b, 0, 0x for binary / octal / hexa), not blob
literals.

I'd be +1 to having char() also accept blobs (1 to 3 bytes long)
instead of integer, since x'10FFFD' is closer to U+10FFFD than 1114109
is.

But Dr. Hipp would need to agree with that. Until then, 1114109 it must be. --DD

C:\Users\DDevienne>sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select char(x'10FFFD');

sqlite> select typeof(char(x'10FFFD'));
text
sqlite> select length(char(x'10FFFD'));
0
sqlite>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-14 Thread Hick Gunter
Actually SQLite does support X'...' literals for creating blobs.

-Ursprüngliche Nachricht-
Von: Dominique Devienne [mailto:ddevie...@gmail.com]
Gesendet: Dienstag, 13. Mai 2014 18:19
An: General Discussion of SQLite Database
Betreff: Re: [sqlite] LIKE operator and collations

On Tue, May 13, 2014 at 5:50 PM, Jan Slodicka <j...@resco.net> wrote:
> So one could replace "LIKE 'xxx%'" by "BETWEEN('xxx', 'xxx' + '\uDBFF\uDFFD').

make that

BETWEEN('xxx', 'xxx' + char(1114109))

I don't think SQlite supports \u literals, nor does it support hex 
literals, so must use the decimal equivalent to U+10FFFD.

 C:\Users\DDevienne>sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12 Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select '\uDBFF\uDFFD';
\uDBFF\uDFFD
sqlite> select char(1114109);
􏿽
sqlite> select char(0x10FFFD);
Error: unrecognized token: "0x10FFFD"
sqlite> select typeof(char(1114109));
text
sqlite> select length(char(1114109));
1
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


---
Gunter Hick
Software Engineer

Scientific Games International GmbH
Klitschgasse 2 – 4, A - 1130 Vienna,
Austria
FN 157284 a, HG Wien
Tel: +43 1 80100 0
E-Mail: h...@scigames.at

This e-mail is confidential and may well also be legally privileged. If you 
have received it in error, you are on notice as to its status and accordingly 
please notify us immediately by reply e-mail and then
delete this message from your system. Please do not copy it or use it for any 
purposes, or disclose its contents to any person as to do so could be a breach 
of confidence. Thank you for your cooperation.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-13 Thread Simon Slavin

On 13 May 2014, at 5:21pm, Constantine Yannakopoulos wrote:

> ​This is very interesting Jan. The only way this could fail is if the
> collation implementation does something funny if it encounters this
> character​, e.g. choose to ignore it when comparing.

That cuts out a very large number of collations.  The solution works fine for 
any collation which orders strings according to Unicode order.  But the point 
of creating a correlation is that you don't want that order.  For instance, I 
have a correlation which is used for IP addresses and expects its input to look 
like

d.d.d.d

where each 'd' is one or more digits.  If you hand it a string which doesn't 
conform (octet missing, or octet > 255), that string evaluates the same as 
'0.0.0.0'.  If I was to use

... WHERE source LIKE '10.%'

and the char(1114109) solution was used I'd get incorrect results whereas

... WHERE source BETWEEN '10.0.0.0' AND '10.255.255.255'

works fine.

There isn't a solution to the problem unless SQLite has further information 
from whoever devised the collation.

Simon.

Refs: 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-13 Thread Dominique Devienne
On Tue, May 13, 2014 at 5:50 PM, Jan Slodicka  wrote:
> So one could replace "LIKE 'xxx%'" by "BETWEEN('xxx', 'xxx' + '\uDBFF\uDFFD').

make that

BETWEEN('xxx', 'xxx' + char(1114109))

I don't think SQlite supports \u literals, nor does it support hex
literals, so must use the decimal equivalent to U+10FFFD.

 C:\Users\DDevienne>sqlite3
SQLite version 3.8.4.3 2014-04-03 16:53:12
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select '\uDBFF\uDFFD';
\uDBFF\uDFFD
sqlite> select char(1114109);
􏿽
sqlite> select char(0x10FFFD);
Error: unrecognized token: "0x10FFFD"
sqlite> select typeof(char(1114109));
text
sqlite> select length(char(1114109));
1
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-13 Thread Constantine Yannakopoulos
On Tue, May 13, 2014 at 6:50 PM, Jan Slodicka  wrote:

> Any comments are welcome.


​This is very interesting Jan. The only way this could fail is if the
collation implementation does something funny if it encounters this
character​, e.g. choose to ignore it when comparing. Since most collations
end up calling the OS unicode API, and this handles U+10FFFD correctly,
this should be a very rare case.

This may be a reason for Dr Hipp to reject adding this to the LIKE
optimization, but anyone could choose to make the optimization manually
using BETWEEN instead of LIKE if they are sure that any custom collations
they have coded and are using handle U+10FFFD correctly. This is obviously
true in your case.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-13 Thread Jan Slodicka
Constantine Yannakopoulos wrote
> On Mon, May 12, 2014 at 4:46 PM, Jan Slodicka 

> jano@

>  wrote:
> I understand that it is difficult to find the least greater character of a
> given character if you are unaware of the inner workings of a collation,
> but maybe finding a consistent upper limit for all characters in all
> possible collations is not impossible?
> 
> -- Constantine
> ___
> sqlite-users mailing list

> sqlite-users@

> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

I could be wrong when I refused this idea. U+10FFFD (represented as
surrogate pair "\uDBFF\uDFFD") may be that character. So one could replace
"LIKE 'xxx%'" by "BETWEEN('xxx', 'xxx' + '\uDBFF\uDFFD').

U+10FFFD is the highest codepoint from the Unicode Private Use Area (PUA).
Unicode standards do not specify anything for this area. So vendors and/or
apps (MS Word for example) can define here their own characters incl. their
properties (font, sorting etc.). I saw somewhere that somebody placed here
Japanese Kanji characters, for example.

msdn statement for CompareString: "All characters in the PUA are sorted
after all other Unicode characters. Within the area, characters are sorted
in numerical order." Hence Windows is OK.

I could not find any such notion for UCI library, but when I tested their
interactive demo, character U+10FFFD was really sorted at the end. (Even
with Japanese, Hindi etc.) That might mean that at least the standard setup
of the UCI library sorts U+10FFFD in the expected way. Of course, this is no
proof - UCI library can be customized.

After all, it seems that U+10FFFD is a good choice, at least for a vast
majority of cases.

BTW, my tests showed that the speed improvement of the LIKE optimization is
in the range 10x-100x.

Any comments are welcome.



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75643.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-13 Thread Jan Slodicka
> ​It can be implemented if the definition of a collation is extended to be 
able to provide this information as Simon suggested.

Sure, this could be done. But I am not sure whether it would be that
usefull. For example I would not use it for my current application (C#).



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75638.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Constantine Yannakopoulos
On Mon, May 12, 2014 at 6:45 PM, Simon Slavin  wrote:

> They're not the same, Jan.  This optimization is for one very specific use
> of LIKE: where the match string has no wildcards except for a percent sign
> at the end


In theory a LIKE can be optimized if it has a fixed part at the beginning​
and any combination of wildcards after that. The fixed part is used to
define a range in an index (if there is one) and then the wildcard part -if
not a plain '%' taht defines 'everything'- is applied as a filter to each
record in the range to narrow down the result. Now, whether this is to be
preferred instead of a full table scan should be a job for the query
planner.

--Constantine
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Simon Slavin

On 12 May 2014, at 4:05pm, Jan Slodicka wrote:

> And whether we'll call it "LIKE optimization" or "using BETWEEN", the
> problem remains basically the same.

They're not the same, Jan.  This optimization is for one very specific use of 
LIKE: where the match string has no wildcards except for a percent sign at the 
end.  I don't think it's out of order to say "Wildcard matching is wildcard 
matching, which is slow.  If you want fast, use BETWEEN.".

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Constantine Yannakopoulos
On Mon, May 12, 2014 at 5:41 PM, Jan Slodicka  wrote:

> I think that the answer is as before: The LIKE optimization cannot be
> performed by the DB engine, but only by the collation author.
>

​It can be implemented if the definition of a collation is extended to be
able to provide this information as Simon suggested. For example, by
defining a second callback that provides the lower and upper bounds that
enclose the results of a LIKE prefix string. If this callback is not
implemented or if it returns null strings then no optimization should be
attempted.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Constantine Yannakopoulos
On Mon, May 12, 2014 at 5:22 PM, Simon Slavin  wrote:

> agree: it's not possible to deduce an optimization without understanding
> the collation.  It might be possible to introduce it in a future version of
> SQLite by requiring anyone who writes an optimization to supply routines
> which work out the proper strings for comparison.
>

​Indeed. And if these routines are not implemented (e.g. for Japanese that
do not have ordering)​ then no LIKE optimization should be attempted by the
engine.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Jan Slodicka
> However, I don't think it's necessary to solve this problem.  Just don't
try to optimize it.  Whoever is doing the programming knows that pattern
matching is slow.  If they want a fast solution they'll use BETWEEN instead. 
And that will make them have to provide their own comparison strings.
 
It looks like you are now talking to Dr. Hipp.

Because we - application programmers - will eventually have to do the
optimization; the users will tell us when they are frustrated from the
waiting.

And whether we'll call it "LIKE optimization" or "using BETWEEN", the
problem remains basically the same.




--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75624.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Jan Slodicka
> ​I understand that it is difficult to find the least greater character of a 
given character

If you understand this, then you must admit that it cannot be done by the DB
engine.

Secondly:

I pointed out to the problems when deriving the string "abd" from "abc".
However, you suggest now a different way: extending "abc" to "abcX". Here
are first problems that came to my mind:

- Some languages do not have any official ordering at all - take Japanese,
for example.

- CZE/SVK: Adding 'Z' may lead to "DZ". (specific sorting applies)

- Look at Unicode code charts. What would you select? Many (if not most)
characters with high codes would be probably ignored by the collation. In
such a case your optimization would lead to no results at all.

I think that the answer is as before: The LIKE optimization cannot be
performed by the DB engine, but only by the collation author.



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75623.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Simon Slavin

On 12 May 2014, at 3:05pm, Constantine Yannakopoulos wrote:

> ​I understand that it is difficult to find the least greater character of a
> given character if you are unaware of the inner workings of a collation,
> but maybe finding a consistent upper limit for all characters in all
> possible collations is not impossible?

I agree: it's not possible to deduce an optimization without understanding the 
collation.  It might be possible to introduce it in a future version of SQLite 
by requiring anyone who writes an optimization to supply routines which work 
out the proper strings for comparison.

However, I don't think it's necessary to solve this problem.  Just don't try to 
optimize it.  Whoever is doing the programming knows that pattern matching is 
slow.  If they want a fast solution they'll use BETWEEN instead.  And that will 
make them have to provide their own comparison strings.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Stephan Beal
On Mon, May 12, 2014 at 4:05 PM, Constantine Yannakopoulos <
alfasud...@gmail.com> wrote:

> ​I understand that it is difficult to find the least greater character of a
> given character if you are unaware of the inner workings of a collation,
> but maybe finding a consistent upper limit for all characters in all
> possible collations is not impossible?
>


i don't know if this helps, but i recently ran across a query both in the
Fossil SCM and in SVN (via a post on this list) which does something
similar for paths:

char const * zCollation = fsl_cx_filename_collation(f);
rc = fsl_db_prepare(db, ,
"SELECT id FROM vfile WHERE vid=%"FSL_ID_T_PFMT
" AND (pathname=%Q %s "
"OR (pathname>'%q/' %s AND pathname<'%q0' %s))",
(fsl_id_t)vid,
zName, zCollation, zName,
zCollation, zName, zCollation )

that zCollation part resolves to either an empty string or "COLLATE
nocase," depending on application-level settings. zPath is a filename or
dir name. the 'vid' part there is not relevant for you, but the pathname
conditions sound similar to what you are trying to achieve.


-- 
- stephan beal
http://wanderinghorse.net/home/stephan/
http://gplus.to/sgbeal
"Freedom is sloppy. But since tyranny's the only guaranteed byproduct of
those who insist on a perfect world, freedom will have to do." -- Bigby Wolf
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Jan Slodicka
Hi Clemens,

I know that link very well. I answered Constantine's statement who claimed
that Sqlite implementation could be less restrictive. I just tried to
explain why it is not possible.

Jan



--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75621.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Constantine Yannakopoulos
On Mon, May 12, 2014 at 4:46 PM, Jan Slodicka  wrote:

> Sqlite LIKE optimization is described as follows:
> A like pattern of the form "x LIKE 'abc%'" is changed into constraints
> "x>='abc' AND x<'abd' AND x LIKE 'abc%'"
>

Actually, I would do something ​like:

"x>='abc' AND x<'ab
​c​
'
​ || ​
AND x LIKE 'abc%'"

​where  is a string with a single character that is
guaranteed to be greater than -and NOT equal to- any other character. For
instance, if the encoding was single-byte ANSI​ it would be something like:

"x>='abc' AND x<'ab
​c​
'
​ || Char(255)
AND x LIKE 'abc%'"

​I understand that it is difficult to find the least greater character of a
given character if you are unaware of the inner workings of a collation,
but maybe finding a consistent upper limit for all characters in all
possible collations is not impossible?

-- Constantine
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Clemens Ladisch
Jan Slodicka wrote:
> Sqlite LIKE optimization is described as follows:
> A like pattern of the form "x LIKE 'abc%'" is changed into constraints
> "x>='abc' AND x<'abd' AND x LIKE 'abc%'"
>
> If you look into sqlite code, then the string "abd" is generated from "abc"
> using a trivial algebra.
>
> However, this algebra won't work on a custom collation.

 says:
| Terms that are composed of the LIKE or GLOB operator can sometimes be
| used to constrain indices. There are many conditions on this use:
| [...]
| 6. For the LIKE operator, if case_sensitive_like mode is enabled then
|the column must indexed using BINARY collating sequence, or if
|case_sensitive_like mode is disabled then the column must indexed
|using built-in NOCASE collating sequence.


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-05-12 Thread Jan Slodicka
Sqlite LIKE optimization is described as follows:
A like pattern of the form "x LIKE 'abc%'" is changed into constraints
"x>='abc' AND x<'abd' AND x LIKE 'abc%'"

If you look into sqlite code, then the string "abd" is generated from "abc"
using a trivial algebra.

However, this algebra won't work on a custom collation. Right now I am
investigating a similar issue (we use custom case-insensitive collations)
and come to the conclusion that the LIKE optimization cannot be done without
the specific knowledge of the collation used.

In other words, it cannot be done by the Sqlite engine.

Two extreme examples:

CZE/SVK: 
If you replace "x LIKE 'ch%'" by "ch < x AND x < 'ci'", you'll get no
results.

DAN/NOR:
How would you replace "aa%"? (Note this correct sorting: Zorro < Aaron)
Or what about this: "\u0061\u030A"? (Å written with combining diacritics.)

Look  here    for more
problems.





--
View this message in context: 
http://sqlite.1065341.n5.nabble.com/LIKE-operator-and-collations-tp73789p75617.html
Sent from the SQLite mailing list archive at Nabble.com.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-10 Thread Constantine Yannakopoulos
On Sun, Feb 9, 2014 at 8:27 PM, Simon Slavin  wrote:

> I know it's a hack.  But it's an elegant efficient hack that takes
> advantage of the things SQLite does well.  As long as that's the only way
> you were using LIKE.
>

Don't get me wrong, the solution is good. But apart from the specific
problem I also started the thread in order to prove that the implementation
of the LIKE optimization in SqLite is not all it could be. Apart from mixed
languages there are other, less extreme scenarios where a
collation-sensitive like optimization will come in handy. For instance, for
languages with accents it would be nice to be able to create a
case-insensitive accent-insensitive (CI_AI) collation and be able to use
LIKE on it, even if it doesn't use an index. And overloading the LIKE
operator globally is not a good idea because it will affect all LIKE
operations in a database, even in columns that are not CI_AI.

I was hoping to elicit a response from D. R. Hipp but he has chosen not to
respond.

--
Constantine
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Simon Slavin

On 9 Feb 2014, at 6:05pm, Constantine Yannakopoulos  
wrote:

> - You are assuming that 'z' is the higher order character that can appear
> in a value. This is not the case; for instance greek characters have higher
> order than 'z'. This can be fixed (only for latin/greek) by using the
> highest order greek character 'ώ' (accented omega) instead of 'z'; but I
> would prefer a very high-order non-printable one instead.

By all means replace the '' I typed with a string of 200 accented 
lower-case omegas.  Or with (char)0x7F or some other equivalent.  Since (if I 
remember my Greek) it's impossible for a word to contain three omegas in a row, 
I doubt you'll come across any occurrences of it from anyone who is using your 
program realistically.

Hmm.  I assume that some internal part of SQLite would actually be putting this 
string through your collation function.  It would presumably happen 
automatically as part of what SQLite does.

I know it's a hack.  But it's an elegant efficient hack that takes advantage of 
the things SQLite does well.  As long as that's the only way you were using 
LIKE.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Constantine Yannakopoulos
Thank you all for your
replies,

@RSmith:

My question is about the _specific_ case:

SELECT * FROM ATable WHERE AColumn LIKE
'FixedSearchPart' || 'SomeWildCard'

where the right side of the LIKE operator is a constant that has a fixed
part at the beginning and a wildcard after that. The optimization is that
the index is used to produce a rowset in which AColumn starts with
'FixedSearchPart' and then rows are filtered according to 'SomeWildCard'.
For instance, in

SELECT * FROM ATable WHERE AColumn LIKE
'Constantine%Yann%'

As I understand it, the index scan will use the string 'Constantine' as
argument and then the full string 'Constantine%Yann%' will be used to
further filter the rowset. Of course any other case that has no fixed part
at the start of the right-side string will have to fall back to a full scan
and filter.

Maybe the parameter notation was a little confusing but from the
description you can deduce that it will not contain a wildcard.
Essentially, the user will be asking for the rows where AColumn BEGINS WITH
a string.

@Jean-Christophe:

Thank you for the extension. I will certainly have a look at it, but I
already have the comparison algorithm (specific to latin/greek) from
another case. So the only thing I have to do is to tailor it inside a
custom collation.

@Yuriy:

Yes, the current interface for custom functions does not provide any
information on what the arguments are, so it is impossible to deduce the
collation of arguments inside the function body. That's why this has to be
implemented by the SqLite query optimizer itself. I was thinking about
something like:

- The query optimizer understands that the left-side of the LIKE argument
is a column with an index and a collation, and the right side is a literal
with a fixed left part (after parameter substitution).
- It takes the fixed part, appends a very high-order (preferably
non-printable) character to it and use the index to materialize this clause:

  AColumn BETWEEN 'FixedPart' AND 'FixedPart' || .

If we were using ANSI I would suggest something like:

  AColumn BETWEEN 'FixedPart' AND 'FixedPart' || (char)0xFF

but for UTF-8 I am not sure what to suggest. It would have to be a UTF8
character value that is guaranteed to be greater than any other usable
character in any code page. I am not sure whether there actually is one.

-Then it applies the whole right-side argument to this rowset using LIKE
pattern matching to produce the correct result.

@Simon:

Your proposal is very clever but it has two weaknesses:
- You are assuming that 'z' is the higher order character that can appear
in a value. This is not the case; for instance greek characters have higher
order than 'z'. This can be fixed (only for latin/greek) by using the
highest order greek character 'ώ' (accented omega) instead of 'z'; but I
would prefer a very high-order non-printable one instead.
- It assumes that the column has a maximum width. Since SQLite does not
have maximum widths for columns of TEXT affinity (although a max width can
be implied and imposed by the app itself), improbable as it may be, there
could be a row with the value :SearchString||''||'a' which will
erroneously not make it into the result set. Again, the very high-order
non-printable character would solve this.
So it can be a solution for a specific case with a given max value and a
known set of code pages but it cannot be a general solution.

--Constantine.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Simon Slavin

On 9 Feb 2014, at 4:21pm, Yuriy Kaminskiy  wrote:

> Unfortunately, builtin LIKE ignores collation, and if you override LIKE (as in
> ICU extension), sqlite won't use index for optimization.

However, this particular use of LIKE ...

> SELECT * FROM ATable WHERE AColumn LIKE :SearchString || '%'

is really just a call to BETWEEN:

SELECT * FROM ATable WHERE AColumn BETWEEN :SearchString AND 
:SearchString||''

And BETWEEN translates into >= and <=.  And those /will/ use the index.

So implementing a 'greek soundex' function as a collation would be useful, if 
this is the only use you need to make of LIKE.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Yuriy Kaminskiy
Constantine Yannakopoulos wrote:
> I have a case where the user needs to perform a search in a text column of
> a table with many rows. Typically the user enters the first n matching
> characters as a search string and the application issues a SELECT statement
> that uses the LIKE operator with the search string:
> 
> SELECT * FROM ATable WHERE AColumn LIKE :SearchString || '%'
> 
> According to the LIKE optimization this statement will use an index so it
> will be fast.

FWIW, sqlite only use index if pattern is string constant or placeholder, it
won't try to use index optimization with more complex constant expression.
I.e.
   SELECT * FROM ATable WHERE AColumn LIKE :SearchString
and append % to your search string outside of sql.
(Well, it does not matter, as I think your following question does not have
acceptable solution).

> The application is used by Greek users. The greek alphabet has some letters
> that are visually identical to corresponding latin letters when in
> capitals, (e.g. A, E, I, O, B, H, K, etc), some that are different but
> match deterministically to a single latin letter (e.g. Σ = S, Φ = F, Ω = O
> etc.) and some that don't have a deterministic match (e.g. Γ = Gh or Yi).

You may want to look at ICU extension, but then sqlite won't use index.

Another option may be fts extension (likely, with custom tokenizer function).

> The table contains strings that consist of words that can be written in
> either latin or greek characters; sometimes even mixed (the user changed
> input locale midword before typing the first non-common letter). I have a
> request that the search should match strings that are written with either
> latin or greek or mixed letters, e.g. "MIS" should match "MISKO" (all
> latin), "ΜΙΣΚΟ" (all greek) or "MIΣΚΟ" (first two letters latin, rest
> greek). I thought of using a custom collation that does this type of
> comparison, have the column use that collation and create an index on that
> column to speed up the search but I discovered that the LIKE operator
> either will not use collations other than BINARY and NOCASE (pragma
> case_sensitive_like) or (if overloaded to perform custom matching) will not

I think "will not use index" is a bug that was fixed in latest sqlite version;
however, "ignoring collation" is intended behavior, and not easy to change :-(.

> use an index, and, worse yet, its behaviour will be the same to all string
> comparisons regardless of collation. So, a full table scan seems inevitable.
> I was wondering whether it is realistic to ask for the LIKE operator to use
> by default the assigned collation of a column. I assume that an index on

From first look, it won't be easy. Probably impossible without changing current
sqlite interface for user-defined collation/function/etc. Besides, it will break
compatibility with existing code, so would require some new PRAGMA to enable.

> that column is using by default the specified collation of the column for
> comparisons, so a LIKE clause like the aforementioned can use the index and
> perform a fast search while using the "mixed" comparison I need. This would
> transparently solve my problem and make the case_sensitive_like pragma
> redundant, but for backward compatibility this behaviour could be activated
> by a new pragma.
> 
> Are there any details I am missing that prevent this from being implemented?
> 
> Thanks in advance.
> 
> --Constantine.


RSmith wrote:
[...]
> It will be a matter of finding the most acceptable deficit... Whether it
> be size, time waste, upgrade cost etc.  By the way, I don't think
> upgrading the table schemata need to be a real hard thing... some
> scripts can take care of that in minimum amount of time. (Test them
> thoroughly though). Also, another poster here had developed a full set
> of international collations and comparison mechanisms as a loadable
> extension to SQLite - Nunicode by Aleksey Tulinov I think... link here:
>
> https://bitbucket.org/alekseyt/nunicode

Unfortunately, builtin LIKE ignores collation, and if you override LIKE (as in
ICU extension), sqlite won't use index for optimization.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread RSmith


On 2014/02/09 13:18, Constantine Yannakopoulos wrote:

Hello all,

I have a case where the user needs to perform a search in a text column of
a table with many rows. Typically the user enters the first n matching
characters as a search string and the application issues a SELECT statement
that uses the LIKE operator with the search string:

SELECT * FROM ATable WHERE AColumn LIKE :SearchString || '%'

According to the LIKE optimization this statement will use an index so it
will be fast.


This can't be true, there is no way a LIKE or GLOB operator can always use an Index (I mean, it can "use" the Indexed column, but it 
cannot always reduce the number of iterations via Indexing).


To explain more what I mean, consider a simple binary search, let's say we have an ordered list of names that we use the alphabet as 
an indexer:


0:Abraham
1:Ben
2:Constantine
3:Igor
4:James
5:John
6:Ryan
7:Simon

A binary search would start by hitting the middle item typically ( i = Length div 2 ) which is 4 in this case, then comparing it to 
the searched item, let's say it is "JOH" in this case.  It sees that Idx 4 is James, which is smaller than "JOH" (no-case collation 
enabled), then looks to divide the remainder of items larger than James (Idx 5, 6 and 7) in two (3 div 2 = 1 ) and adds it to the 
bottom of them (5) so it checks Index 6, which is now Ryan and is higher than "JOH", it then divides into below Ryan and above James 
and obviously gets "John" which is a match.


A binary tree works similar with the difference it does not have to divide anything, the tree node children are already divisive so 
it just follows down the node closest to the target match until a definite match or matches is/are found (depending on search criteria).


The list above does however demonstrate why a LIKE operator cannot always use an Index, let's say I'm using a search for LIKE '%n', 
how on Earth would you be able to look for that by binary jumping through the list? ANY Indexed item might end on an n, indeed 4 of 
those above do, there is no way to tell and a full-table scan is inevitable.


Of course some clever DB systems, of which SQLite is one, can detect when you use LIKE "Jam%" and knows this is index-searchable and 
still use the Index, but it all depends on what you type and where those % signs are - something which is again negated if the 
search collation does not match the column collation, but is rather easy when standard text or binary collations are used.




store two text columns in the table.  The first is the text as entered.
  The second is your text reduced to its simplified searchable form,
probably all LATIN characters, perhaps using some sort of soundex.  Search
on the second column but return the text in the first.

This allows you to write your conversion routine in the language you're
using to do the rest of your programming, instead of having to express it
as a SQLite function.


Thanks for your reply,


Yes. I thought of that as well. I even have the greek soundex() function
from a previous implementation. Problem is it will bloat the database
considerably, keeping in mind that the users will typically demand that 
ALL
searchable text columns in the application work that way, and who can 
blame
them? And the project manager will not be very keen on accepting both 
this
database size increase and the time needed to calculate the extra 
soundex
column for every database row. It will be much easier to convince this
person to accept the time-costly database upgrade needed in both cases
(tables need to be recreated to change collation) but not both upgrade 
and
bloat.


And I am not happy to accept the fact that I cannot fly, but the laws of the 
Universe demands I adhere to the deficit, and when I simply have to fly, employ 
the help of a very large costly winged tube with jet engines attached to it!

It will be a matter of finding the most acceptable deficit... Whether it be 
size, time waste, upgrade cost etc.  By the way, I don't think upgrading the 
table schemata need to be a real hard thing... some scripts can take care of 
that in minimum amount of time. (Test them thoroughly though). Also, another 
poster here had developed a full set of international collations and comparison 
mechanisms as a loadable extension to SQLite - Nunicode by Aleksey Tulinov I 
think... link here:

https://bitbucket.org/alekseyt/nunicode





___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Constantine Yannakopoulos
On Sun, Feb 9, 2014 at 1:25 PM, Simon Slavin  wrote:

> store two text columns in the table.  The first is the text as entered.
>  The second is your text reduced to its simplified searchable form,
> probably all LATIN characters, perhaps using some sort of soundex.  Search
> on the second column but return the text in the first.
>
> This allows you to write your conversion routine in the language you're
> using to do the rest of your programming, instead of having to express it
> as a SQLite function.
>

Thanks for your reply,


Yes. I thought of that as well. I even have the greek soundex() function
from a previous implementation. Problem is it will bloat the database
considerably, keeping in mind that the users will typically demand that ALL
searchable text columns in the application work that way, and who can blame
them? And the project manager will not be very keen on accepting both this
database size increase and the time needed to calculate the extra soundex
column for every database row. It will be much easier to convince this
person to accept the time-costly database upgrade needed in both cases
(tables need to be recreated to change collation) but not both upgrade and
bloat.

--
Constantine.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] LIKE operator and collations

2014-02-09 Thread Simon Slavin

On 9 Feb 2014, at 11:18am, Constantine Yannakopoulos  
wrote:

> So, a full table scan seems inevitable.

I can't answer the specific question you asked, but I have a suggestion for 
your program: store two text columns in the table.  The first is the text as 
entered.  The second is your text reduced to its simplified searchable form, 
probably all LATIN characters, perhaps using some sort of soundex.  Search on 
the second column but return the text in the first.

This allows you to write your conversion routine in the language you're using 
to do the rest of your programming, instead of having to express it as a SQLite 
function.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users