Re: [sqlite] Help with sqlite3_value_text

2019-04-16 Thread x
Thanks Keith, that I understand. Thanks also to everyone who contributed to 
this thread. I’ve learned a lot from it.




From: sqlite-users  on behalf of 
Keith Medcalf 
Sent: Monday, April 15, 2019 4:09:02 PM
To: SQLite mailing list
Subject: Re: [sqlite] Help with sqlite3_value_text


sqlite3_value_text and sqlite3_value_bytes will cause the conversion (if 
required) to UTF-8, and then return the data requested.
sqlite3_value_text16 and sqlite3_value_bytes16 will cause the conversion (if 
required) to UTF-16 and then return the data requested.

So if you call sqlite3_value_text AND THEN sqlite3_value_bytes16 your original 
UTF-8 text pointer will be invalid.  (Because the data must be converted to 
UTF-16 so that you can get the bytes count of that, thus the original UTF-8 no 
longer exists).

If you call only "like for like" functions, then the conversion will only be 
carried out the first time it is required and not for the subsequent calls to 
the "other function" that does not require conversion ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of x
>Sent: Monday, 15 April, 2019 04:08
>To: SQLite mailing list
>Subject: Re: [sqlite] Help with sqlite3_value_text
>
>>As long as you use _value_bytes after _text you're fine... so if any
>>conversion did take place the value will be right of the last
>returned
>>string type.
>
>JD, Could you explain that to me? I’m not sure why any conversion
>takes place and, on reading the text below, I would’ve thought it
>would be better to call sqlite3_value_bytes first (if it’s called
>“subsequently” the pointer returned by sqlite3_value_text “can be
>invalidated”).
>
>Please pay particular attention to the fact that the pointer returned
>from sqlite3_value_blob()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or
>sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html> can
>be invalidated by a subsequent call to
>sqlite3_value_bytes()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_bytes16()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or
>sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html>.”
>
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-15 Thread Simon Slavin
I don't know about any of this, but it seems that someone needs to write a 
'Unicode' (or 'Multibyte charaacters') page for the SQLite documentation.

Simon.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-15 Thread Keith Medcalf

sqlite3_value_text and sqlite3_value_bytes will cause the conversion (if 
required) to UTF-8, and then return the data requested.
sqlite3_value_text16 and sqlite3_value_bytes16 will cause the conversion (if 
required) to UTF-16 and then return the data requested.

So if you call sqlite3_value_text AND THEN sqlite3_value_bytes16 your original 
UTF-8 text pointer will be invalid.  (Because the data must be converted to 
UTF-16 so that you can get the bytes count of that, thus the original UTF-8 no 
longer exists).

If you call only "like for like" functions, then the conversion will only be 
carried out the first time it is required and not for the subsequent calls to 
the "other function" that does not require conversion ...

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of x
>Sent: Monday, 15 April, 2019 04:08
>To: SQLite mailing list
>Subject: Re: [sqlite] Help with sqlite3_value_text
>
>>As long as you use _value_bytes after _text you're fine... so if any
>>conversion did take place the value will be right of the last
>returned
>>string type.
>
>JD, Could you explain that to me? I’m not sure why any conversion
>takes place and, on reading the text below, I would’ve thought it
>would be better to call sqlite3_value_bytes first (if it’s called
>“subsequently” the pointer returned by sqlite3_value_text “can be
>invalidated”).
>
>Please pay particular attention to the fact that the pointer returned
>from sqlite3_value_blob()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or
>sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html> can
>be invalidated by a subsequent call to
>sqlite3_value_bytes()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_bytes16()<https://sqlite.org/c3ref/value_blob.html>,
>sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or
>sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html>.”
>
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-15 Thread Clemens Ladisch
x wrote:
>> As long as you use _value_bytes after _text you're fine... so if any
>> conversion did take place the value will be right of the last returned
>> string type.
>
> Could you explain that to me? I’m not sure why any conversion takes place
> and, on reading the text below, I would’ve thought it would be better to
> call sqlite3_value_bytes first

As shown in the table, conversion from TEXT to BLOB does not change anything.
However, conversion from BLOB to TEXT might require appending a zero terminator.


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-15 Thread x
>As long as you use _value_bytes after _text you're fine... so if any
>conversion did take place the value will be right of the last returned
>string type.

JD, Could you explain that to me? I’m not sure why any conversion takes place 
and, on reading the text below, I would’ve thought it would be better to call 
sqlite3_value_bytes first (if it’s called “subsequently” the pointer returned 
by sqlite3_value_text “can be invalidated”).

Please pay particular attention to the fact that the pointer returned from 
sqlite3_value_blob(), 
sqlite3_value_text(), or 
sqlite3_value_text16() can be 
invalidated by a subsequent call to 
sqlite3_value_bytes(), 
sqlite3_value_bytes16(), 
sqlite3_value_text(), or 
sqlite3_value_text16().”


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-14 Thread J Decker
On Sun, Apr 14, 2019 at 5:40 AM x  wrote:

> On second thoughts JD, can’t use strlen or sqlite3_value_bytes in case
> values(1) contains more than a single unicode character. This looks OK.
>
> Bytes are what you need though; it doesn't matter how big the buffer is,
as long as you have all of it.

As long as you use _value_bytes after _text you're fine... so if any
conversion did take place the value will be right of the last returned
string type.


> # define CHARLEN(x) !(x & 128) ? 1 : (x & 16 ? 4 : (x & 32 ? 3 : 2))
>
> char *c = (char *)sqlite3_value_text(values[0]);
> char *Sep = (char *)sqlite3_value_text(values[1]);
> int Count=0, Len, SepLen = CHARLEN(*Sep);
>
> while (*c)
> {
>if ((Len = CHARLEN(*c)) == SepLen && memcmp(c, Sep, Len)==0)
> Count++; // at start of Sep
>c += Len;
> }
> sqlite3_result_int(ctx, Count);
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-14 Thread x
On second thoughts JD, can’t use strlen or sqlite3_value_bytes in case 
values(1) contains more than a single unicode character. This looks OK.

# define CHARLEN(x) !(x & 128) ? 1 : (x & 16 ? 4 : (x & 32 ? 3 : 2))

char *c = (char *)sqlite3_value_text(values[0]);
char *Sep = (char *)sqlite3_value_text(values[1]);
int Count=0, Len, SepLen = CHARLEN(*Sep);

while (*c)
{
   if ((Len = CHARLEN(*c)) == SepLen && memcmp(c, Sep, Len)==0) Count++; // 
at start of Sep
   c += Len;
}
sqlite3_result_int(ctx, Count);

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-14 Thread x


From: J Decker<mailto:d3c...@gmail.com>
Sent: 13 April 2019 20:05
To: SQLite mailing list<mailto:sqlite-users@mailinglists.sqlite.org>
Subject: Re: [sqlite] Help with sqlite3_value_text

>> char *c = (char *)sqlite3_value_text(values[0]);
>> char *Sep = (char *)sqlite3_value_text(values[1]);
>> int Byte1, Count=0, NrBytes, NrSepBytes = strlen(Sep);

>you could use sqlite3_value_bytes(values[1]); instead.

Hi JD, Would I have to worry about this

“Please pay particular attention to the fact that the pointer returned from 
sqlite3_value_blob()<https://sqlite.org/c3ref/value_blob.html>, 
sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or 
sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html> can be 
invalidated by a subsequent call to 
sqlite3_value_bytes()<https://sqlite.org/c3ref/value_blob.html>, 
sqlite3_value_bytes16()<https://sqlite.org/c3ref/value_blob.html>, 
sqlite3_value_text()<https://sqlite.org/c3ref/value_blob.html>, or 
sqlite3_value_text16()<https://sqlite.org/c3ref/value_blob.html>.”

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-13 Thread J Decker
On Sat, Apr 13, 2019 at 12:04 PM x  wrote:

> This seems to work OK as a sqlite function.
>
>
>
> // assume values[0] & [1] are supplied and not null
>
> // find Count of values[1] in values[0]
>
>
>
> char *c = (char *)sqlite3_value_text(values[0]);
>
> char *Sep = (char *)sqlite3_value_text(values[1]);
>
> int Byte1, Count=0, NrBytes, NrSepBytes = strlen(Sep);
>
you could use sqlite3_value_bytes(values[1]); instead.

>
>
>
> while (*c)
>
> {
>
>
>
>Byte1 = (*c) >> 4;
>
>if ((Byte1 & 8) == 0) NrBytes = 1;
>
>else if (Byte1 & 1) NrBytes = 4;
>
>else if (Byte1 & 2) NrBytes = 3;
>
>else NrBytes = 2; // (Byte1 & 4) == 4
>
>
>
>if (NrBytes == NrSepBytes && memcmp(c, Sep, NrBytes) == 0) Count++;
> // at first byte of Sep
>
>c += NrBytes;
>
> }
>
> sqlite3_result_int(ctx, Count);
>
>
>
> ____
> From: sqlite-users  on
> behalf of Scott Robison 
> Sent: Friday, April 12, 2019 8:40:19 PM
> To: SQLite mailing list
> Subject: Re: [sqlite] Help with sqlite3_value_text
>
> On Fri, Apr 12, 2019, 1:06 PM Keith Medcalf  wrote:
>
> >
> > Actually you would have to convert the strings to UCS-4.  UTF-16 is a
> > variable-length encoding.  An actual "unicode character" is (at this
> > present moment in time, though perhaps not tomorrow) 4 bytes (64-bits).
> >
>
> That is some impressive compression! :)
>
> Regardless, even if you use UCS-4, you still have the issue of combining
> characters. Unicode is complex as had been observed.
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-13 Thread x
This seems to work OK as a sqlite function.



// assume values[0] & [1] are supplied and not null

// find Count of values[1] in values[0]



char *c = (char *)sqlite3_value_text(values[0]);

char *Sep = (char *)sqlite3_value_text(values[1]);

int Byte1, Count=0, NrBytes, NrSepBytes = strlen(Sep);



while (*c)

{



   Byte1 = (*c) >> 4;

   if ((Byte1 & 8) == 0) NrBytes = 1;

   else if (Byte1 & 1) NrBytes = 4;

   else if (Byte1 & 2) NrBytes = 3;

   else NrBytes = 2; // (Byte1 & 4) == 4



   if (NrBytes == NrSepBytes && memcmp(c, Sep, NrBytes) == 0) Count++; // 
at first byte of Sep

   c += NrBytes;

}

sqlite3_result_int(ctx, Count);




From: sqlite-users  on behalf of 
Scott Robison 
Sent: Friday, April 12, 2019 8:40:19 PM
To: SQLite mailing list
Subject: Re: [sqlite] Help with sqlite3_value_text

On Fri, Apr 12, 2019, 1:06 PM Keith Medcalf  wrote:

>
> Actually you would have to convert the strings to UCS-4.  UTF-16 is a
> variable-length encoding.  An actual "unicode character" is (at this
> present moment in time, though perhaps not tomorrow) 4 bytes (64-bits).
>

That is some impressive compression! :)

Regardless, even if you use UCS-4, you still have the issue of combining
characters. Unicode is complex as had been observed.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-13 Thread x
Thanks for all the help. Things are much clearer now.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Warren Young
On Apr 12, 2019, at 1:06 PM, Keith Medcalf  wrote:
> 
> Actually you would have to convert the strings to UCS-4.

UTF-32 is the new name of that standard:

   https://en.wikipedia.org/wiki/UTF-32#History

> UTF-16 is a variable-length encoding.

Only if you’re outside the BMP, which is why I restricted my answer that way.

And if you add in combining characters and such, *all* Unicode encodings are 
variable-length.

We’re not going to replace the whole books necessary to fully cover Unicode 
here.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Leland Helgerson


-Original Message-
From: sqlite-users [mailto:sqlite-users-boun...@mailinglists.sqlite.org] On 
Behalf Of Scott Robison
Sent: Friday, April 12, 2019 2:40 PM
To: SQLite mailing list
Subject: Re: [sqlite] Help with sqlite3_value_text

On Fri, Apr 12, 2019, 1:06 PM Keith Medcalf  wrote:

>
> Actually you would have to convert the strings to UCS-4.  UTF-16 is a 
> variable-length encoding.  An actual "unicode character" is (at this 
> present moment in time, though perhaps not tomorrow) 4 bytes (64-bits).
>

That is some impressive compression! :)

Regardless, even if you use UCS-4, you still have the issue of combining 
characters. Unicode is complex as had been observed.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Scott Robison
On Fri, Apr 12, 2019, 1:06 PM Keith Medcalf  wrote:

>
> Actually you would have to convert the strings to UCS-4.  UTF-16 is a
> variable-length encoding.  An actual "unicode character" is (at this
> present moment in time, though perhaps not tomorrow) 4 bytes (64-bits).
>

That is some impressive compression! :)

Regardless, even if you use UCS-4, you still have the issue of combining
characters. Unicode is complex as had been observed.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Keith Medcalf

Actually you would have to convert the strings to UCS-4.  UTF-16 is a 
variable-length encoding.  An actual "unicode character" is (at this present 
moment in time, though perhaps not tomorrow) 4 bytes (64-bits).

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users [mailto:sqlite-users-
>boun...@mailinglists.sqlite.org] On Behalf Of Warren Young
>Sent: Friday, 12 April, 2019 09:45
>To: SQLite mailing list
>Subject: Re: [sqlite] Help with sqlite3_value_text
>
>On Apr 12, 2019, at 8:51 AM, x  wrote:
>>
>> How do I do the same thing if the string param is a utf-8 or utf-16
>string and the SearchChar is a Unicode character?
>
>Convert the characters to 32-bit wide characters first, then iterate
>over the array of uint32_t or similar.
>
>One method is shown by the SQLite-internal function
>sqlite3Utf8Read().  It’s static in the amalgamation build, but since
>SQLite is public domain, you can just copy that function’s text out
>into your program and use it there or modify it to suit your
>purposes.
>
>Your platform libraries may have UTF-8 to UTF-32 or similar
>mechanisms.
>
>On POSIX platforms, the most common of these is iconv(3).
>
>On Windows, the legacy of UCS-2 and UTF-16 makes this difficult, but
>if you can stick to the Basic Multilingual Plane, converting UTF-8 to
>UCS-2 gives the same effect.  See MultiByteToWideChar(…, CP_UTF8, …):
>
>https://docs.microsoft.com/windows/desktop/api/stringapiset/nf-
>stringapiset-multibytetowidechar
>
>For all platforms, there’s ICU.  That’s of particular interest with
>SQLite since there’s an included ICU extension you can include to get
>more Unicode power in SQLite:
>
>   https://www.sqlite.org/compile.html#enable_icu
>
>There are complications that your questions don’t push into, but
>beware that if you start getting beyond simple “character in string”
>questions, you’ll eventually have to confront them: combining
>characters, etc.
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Richard Damon


> On Apr 12, 2019, at 12:58 PM, x  wrote:
> 
> I’ve been asking myself if I could have done the above more efficiently as 
> sqlite’s converting the original string then I’m converting it and copying 
> it. While thinking about that I started to wonder how c++ handled utf8/16. 
> E.g. To access the i’th character does it have to rattle through all previous 
> I-1 characters to find the start of character i, how pointer arithmetic was 
> handled when pointing to utf8/16 chars etc.
> 

Basically, if you are dealing with a variable width encoding (UTF-8/UTF-16), 
then finding the nth character requires scanning the string counting beginning 
of characters. If this is an important operation, you pay the cost of 
conversion and work in UCS-4. On the other hand, UTF-8 has a lot of nice 
properties such that it can be a fairly seamless upgrade for processing plain 
ASCII text, and if reasonably efficient for typical text. (There are a number 
of complications if you try to support ALL of Unicode, like the composed 
characters, where you use several code-point together to define a single 
character), where you need to decide how you want to normalize and need some 
big character tables for the instructions of how to do this.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread x
Thanks for the replies. There’s plenty for me to look at there.

I’ve been in poor health the last 5 years and after almost a year’s break I’m 
trying to get back into sqlite to preserve my sanity. I’m so rusty my opening 
post is riddled with errors.

I’ve just realised that, before my break, I was dealing with this as follows

#define VALTXT(i) String((wchar_t*)sqlite3_value_text16(values[i]))

and inside the function I’d access the parameter with

String S = VALTXT(0);

The String type is an embarcadero c++ builder typedef which I think is a wide 
string. I then accessed the i’th character using S[I]. The above involves 
copying the parameter to another string.

In an old post I made on this forum someone told me I should be using 
sqlite3_value_text( as that was sqlite’s default storage and would save sqlite 
having to convert it to utf16.

I’ve been asking myself if I could have done the above more efficiently as 
sqlite’s converting the original string then I’m converting it and copying it. 
While thinking about that I started to wonder how c++ handled utf8/16. E.g. To 
access the i’th character does it have to rattle through all previous I-1 
characters to find the start of character i, how pointer arithmetic was handled 
when pointing to utf8/16 chars etc.

___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Warren Young
On Apr 12, 2019, at 8:51 AM, x  wrote:
> 
> How do I do the same thing if the string param is a utf-8 or utf-16 string 
> and the SearchChar is a Unicode character?

Convert the characters to 32-bit wide characters first, then iterate over the 
array of uint32_t or similar.

One method is shown by the SQLite-internal function sqlite3Utf8Read().  It’s 
static in the amalgamation build, but since SQLite is public domain, you can 
just copy that function’s text out into your program and use it there or modify 
it to suit your purposes.

Your platform libraries may have UTF-8 to UTF-32 or similar mechanisms.

On POSIX platforms, the most common of these is iconv(3).

On Windows, the legacy of UCS-2 and UTF-16 makes this difficult, but if you can 
stick to the Basic Multilingual Plane, converting UTF-8 to UCS-2 gives the same 
effect.  See MultiByteToWideChar(…, CP_UTF8, …):


https://docs.microsoft.com/windows/desktop/api/stringapiset/nf-stringapiset-multibytetowidechar

For all platforms, there’s ICU.  That’s of particular interest with SQLite 
since there’s an included ICU extension you can include to get more Unicode 
power in SQLite:

   https://www.sqlite.org/compile.html#enable_icu

There are complications that your questions don’t push into, but beware that if 
you start getting beyond simple “character in string” questions, you’ll 
eventually have to confront them: combining characters, etc.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Shawn Wagner
Welcome to the wonderful world of multibyte encodings, and Unicode in
particular.

Unless you're looking for an ASCII character, you're looking for a
substring, not a character. And you're really looking for what's called a
codepoint (The entire concept of character gets kind of fuzzy with
Unicode). If you're not careful, looking for 'a' (U+0061 LATIN SMALL LETTER
A) will match the start of á, which is actually a two codepoint grapheme
(U+0061 and U+0301 COMBINING ACUTE ACCENT) that renders as a single entity.
And if you're okay with matching that, what about á (U+00E1 LATIN SMALL
LETTER A WITH ACUTE), the single codepoint composed version?

Unicode is hard. There are libraries like ICU and libunistring which help a
bit. I have a bunch of sqlite extensions at
https://github.com/shawnw/useful_sqlite_extensions (That I really need to
polish up for an actual release) including a string library that expands a
lot on the build in ICU extension to make working with graphemes and
unicode in general in sqlite a lot easier.


On Fri, Apr 12, 2019 at 7:51 AM x  wrote:

> I’m still confused by utf strings. For simplicity, suppose I set up an
> sqlite function that takes a single string parameter and I want to scan the
> string to count the number of occurrences of a certain character . If I
> knew the string was made up entirely of ascii chars I’d do this
>
> char *c = _value_text(0)[0];
> int count=0;
> while (*c) if (*c++ == SearchChar) count++;
>
> How do I do the same thing if the string param is a utf-8 or utf-16 string
> and the SearchChar is a Unicode character?
>
> I’m confused by the fact that Unicode characters are not a fixed number of
> bytes so if I do this e.g.
>
> wchar_t *c = (wchar_t*) sqlite3_value_text(0);
>
> does this mean a complete temporary copy of the value of
> sqlite3_value_text(0) has to be constructed by the compiler such that all
> characters of the newly constructed string are fixed width? If so, I’m just
> wanting to check if there’s a way  of avoiding this overhead.
>
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread J Decker
http://utf8everywhere.org/


On Fri, Apr 12, 2019 at 7:51 AM x  wrote:

> I’m still confused by utf strings. For simplicity, suppose I set up an
> sqlite function that takes a single string parameter and I want to scan the
> string to count the number of occurrences of a certain character . If I
> knew the string was made up entirely of ascii chars I’d do this
>
> char *c = _value_text(0)[0];
> int count=0;
> while (*c) if (*c++ == SearchChar) count++;
>
> How do I do the same thing if the string param is a utf-8 or utf-16 string
> and the SearchChar is a Unicode character?
>

You'd decide on the common operating mode of your system for 1... there are
of course middleware libraries that need to cater to both and consume
databases instead of just writing it.


>
> I’m confused by the fact that Unicode characters are not a fixed number of
> bytes so if I do this e.g.
>
> wchar_t *c = (wchar_t*) sqlite3_value_text(0);
>
>
https://www.sqlite.org/c3ref/value_blob.html

const void *sqlite3_value_text16(

returns wchar data, converting if nessecary from utf8 if it was stored as
utf8.

value_text only returns utf8 (char* really, but suffices for storage and
retreival of utf8 phrases)

for utf8 it would just be matter of matching 1,2,3, or 4 bytes (depending
on the codepoint you were matching).



does this mean a complete temporary copy of the value of
> sqlite3_value_text(0) has to be constructed by the compiler such that all
> characters of the newly constructed string are fixed width? If so, I’m just
> wanting to check if there’s a way  of avoiding this overhead.
>
>
If it is converted, a temporary value is allocated by the library (at
runtime, well beyond the compiler), ...


> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Dominique Devienne
On Fri, Apr 12, 2019 at 4:51 PM x  wrote:

> I’m still confused by utf strings. [...  I want to scan the string to
> count the number of occurrences of a certain character. [...]
> How do I do the same thing if the string param is a utf-8 or utf-16 string
> and the SearchChar is a Unicode character?
>
> I’m confused by the fact that Unicode characters are not a fixed number of
> bytes so if I do this e.g.
>
> wchar_t *c = (wchar_t*) sqlite3_value_text(0);


That's very wrong. _text() always returns UTF8. the _text16*() variants
return UTF16.

As to how many bytes a UTF8-encoded "codepoint" takes, it's well documented
on Wikipedia.
Based on the leading bits, one can know unambiguously whether this is the
1st, 2nd, 3rd, or 4th
byte of a 1 to 4 multi-byte sequence.

Even UTF16 can lead to "surrogate pairs" for codepoints beyond the
so-called "CMP".

And that's not even getting into the fact the encoding may not be "unique",
and Unicode "normalization".
This is not an easy subject...

You can play with the char() built-in SQL function to see how different
code point values are encoded in UTF8. --DD
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Help with sqlite3_value_text

2019-04-12 Thread Igor Tandetnik

On 4/12/2019 10:51 AM, x wrote:

I’m still confused by utf strings. For simplicity, suppose I set up an sqlite 
function that takes a single string parameter and I want to scan the string to 
count the number of occurrences of a certain character . If I knew the string 
was made up entirely of ascii chars I’d do this

char *c = _value_text(0)[0];
int count=0;
while (*c) if (*c++ == SearchChar) count++;

How do I do the same thing if the string param is a utf-8 or utf-16 string and 
the SearchChar is a Unicode character?


The problem you need to solve is "count occurrences of a substring in a 
string". The substring in question could consist of one byte representing a single 
ASCII character, or a sequence of bytes comprising a UTF-8 encoding of one Unicode 
character. This really has nothing to do with SQLite.


I’m confused by the fact that Unicode characters are not a fixed number of 
bytes so if I do this e.g.

wchar_t *c = (wchar_t*) sqlite3_value_text(0);


That's just wrong. sqlite3_value_text does *not* return a pointer to a sequence 
of wchar_t. Any attempt to actually use `c` pointer would exhibit undefined 
behavior.


does this mean a complete temporary copy of the value of sqlite3_value_text(0) 
has to be constructed by the compiler such that all characters of the newly 
constructed string are fixed width? If so, I’m just wanting to check if there’s 
a way  of avoiding this overhead.


You seem to ascribe some magical properties to a cast. Nothing is "constructed" by it - 
it simply tells the compiler "take this pointer to a memory block, and believe that it 
contains something different than what the type of the original pointer suggests; trust me, I know 
what I'm doing".

If you prefer UTF-16 encoding over UTF-8, there's sqlite3_value_text16 for that.

If you are unsure what UTF-8 and UTF-16 mean, see

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

--
Igor Tandetnik


___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users