Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Richard Damon

On 1/18/20 3:21 AM, Rocky Ji wrote:

Hi,

I am asked to highlight rows containing strange characters. All data were
ingested by a proprietary crawler.

By strange, I mean, question marks, boxes, little Christmas Trees,  solid
arrows, etc. kind of symbols; these appear suddenly in flow of normal ASCII
English letters.

How do I approach this?

Thanks.


The first thing that you are going to need to find to do this is how the 
data has actually been stored. The strange characters you are describing 
sound like somewhere some data obtained in one character encoding, but 
then interpreted as another.  This is a very old problem, without a 
solution in general except to always know the encoding of your data. It 
goes back to the development of the ASCII character set which is the 
base for most character sets in use today (another major branch is 
EBCDIC, but that has its own issues and you tend to know when you need 
to deal with that).


The A in ASCII is for American, the ASCII character set was designed for 
American (i.e. english) data, and when it when it was developed, memory 
and bandwidth were limited and expensive, so you didn't waste them. 
ASCII was a compact set, using only 7 bits per character, and was fine 
for english information, with the basic alphabet. It became the base 
standard because America was a core part in the initial computer 
development, and had a lot of influence in the initial standards.


While it worked well for American data, it didn't work so well for many 
other countries, so most other countries adopted their own character set 
for use in their country, normally based on ASCII as a base, but adding 
codes to extend the coding to an 8 bit code, keep (at least most of) 
ASCII as the first 128 values.


To exchange data between character machines, you needed to include what 
character set the data was in (or you see some funny words due to the 
mis-match).


Operating systems adopted the concept of Code Pages, which basically 
defined which of the many standard character sets was to be used, and 
some transfer formats actually included as part of the header 
information what character set the data that follows was in. One of 
these was the web pages that were on the Internet.


Later, to try and get out of this mess, a new character encoding was 
invented called Unicode, Unicode initially intended to provide a single 
universal encoding that would let any of the many standard encodings be 
converted to this universal encoding. It was first thought that this 
could be done with a 16 bit character set, but it later needed to be 
enlarged in size as they found out how many different characters there 
really were. While memory isn't quite as precious as it was when ASCII 
was designed, it isn't so abundant that we could just increase the size 
of out data by a factor, so a compact encoding was developed called 
UTF-8, which represents the ASCII characters exactly as the ASCII 
character set, and all the extra characters with multiple bytes.


Because it did make files with the extra characters longer, and it was 
somewhat complicated to work with, and most people only worked with 
documents that could all be encoded in a single character set, its 
adoption was slow, but it now is becoming a norm, but still many things 
are in the old legacy encodings.


If you try to interpret a file that is in one of the legacy encodings as 
Unicode UTF-8, then (if it uses extended characters) it almost certainly 
will create decoding errors (UTF-8 was intentionally designed with 
redundancy in the encoding to easy processing, so many 'random' 
sequences become invalid). If you interpret a file that is in UTF-8 and 
a legacy encoding, you will tend to get some strange out of place 
extended characters.


My first guess is that your proprietary crawler either didn't properly 
detect the page encoding and handle it, ideally it would have converted 
it to UTF-8, but in might also save the data in the original encoding 
and saved what that encoding was, or your read out program isn't 
detecting the character set the data was stored as and processing it 
right. I believe SQLite assumes that 'TEXT' data is UTF-8 encoded, but 
other encodings can be declared or data stored as BLOBs.


What likely should happen is someone (maybe you) needs to read out 
samples of the funny data as a blob, and figure out how the data is 
actually encoded, ideally comparing it to the original page crawled, and 
once you know what the problem was, you can perhaps work on fixing it 
and detecting the records with problems.


One possible issue is that some conversion routines take characters they 
don't know how to handle and replace them with the ASCII Question Mark, 
and if that is what has been stored in the database, it may be very hard 
to distinguish that from an actual question mark in the data.


--
Richard Damon

___
sqlite-users mailing list

Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Keith Medcalf

If we are talking the later case, and the 'text' field contains text in Windows 
MBCS then you can use, for example:

for row in db.execute('select cast(mbcsfield as blob) from table'):
   textfield = row[0].decode('mbcs')

to recover proper unicode text.  If the encoding is not 'mbcs' substitute the 
actual encoding.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.

>-Original Message-
>From: sqlite-users  On
>Behalf Of Keith Medcalf
>Sent: Saturday, 18 January, 2020 05:38
>To: SQLite mailing list 
>Subject: Re: [sqlite] How can I detect rows with non-ASCII values?
>
>
>On Saturday, 18 January, 2020 05:21, Rocky Ji 
>wrote:
>
>>>  > GLOB supports character classes
>
>>thanks for teaching new keyword and its use.
>
>>My first attempt was very similar to what you suggest, except I used
>>sqlite3 and re from inside Python.
>
>>But as you see, I can't reliably seprate 'interrogative' question marks
>>from question marks that get displayed due to 'encoding faults'.
>
>>Any suggestions?
>
>Ah.  So the real problem is that you stored non-text (text defined a UTF-
>8 encoded sequence of unicode codepoints with no zero/null byte except at
>the end) in a database text field, and now you are trying to access those
>text fields with something that expects them to contain properly
>formatted text strings?  Or do you mean that they *are* valid UTF-8
>encoded strings any you are trying to encode them as something else?
>
>If the former you can retrieve the raw bytes in python by retrieving the
>field as cast(x as blob) and then .decode the result from whatever
>encoding it is in into proper unicode.
>
>--
>The fact that there's a Highway to Hell but only a Stairway to Heaven
>says a lot about anticipated traffic volume.
>
>
>
>
>___
>sqlite-users mailing list
>sqlite-users@mailinglists.sqlite.org
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Simon Slavin
On 18 Jan 2020, at 12:12pm, Rocky Ji  wrote:

> By question marks, I meant- that some text, like Dutch programmers names, and 
> address in Nordic locations, have accents and umaults and other such 
> modifications done to English-alphabets. These get displayed as ? or box

SQLite doesn't display anything.  It's a database.  It stores things in a file 
and allows you to recall them.  It provides these facilities to programs other 
people write.

So you must be using a program which isn't SQLite to display those characters.  
That program will be choosing whether to display them as alphabet characters 
with accents, or strange things like boxes and question-marks.

What program are you using to display those characters ?
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Keith Medcalf

On Saturday, 18 January, 2020 05:21, Rocky Ji  wrote:

>>  > GLOB supports character classes

>thanks for teaching new keyword and its use.

>My first attempt was very similar to what you suggest, except I used
>sqlite3 and re from inside Python.

>But as you see, I can't reliably seprate 'interrogative' question marks
>from question marks that get displayed due to 'encoding faults'.

>Any suggestions?

Ah.  So the real problem is that you stored non-text (text defined a UTF-8 
encoded sequence of unicode codepoints with no zero/null byte except at the 
end) in a database text field, and now you are trying to access those text 
fields with something that expects them to contain properly formatted text 
strings?  Or do you mean that they *are* valid UTF-8 encoded strings any you 
are trying to encode them as something else?

If the former you can retrieve the raw bytes in python by retrieving the field 
as cast(x as blob) and then .decode the result from whatever encoding it is in 
into proper unicode.

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.




___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Keith Medcalf
On Saturday, 18 January, 2020 05:13, Rocky Ji  wrote:

>Sorry for lack of clarity.

>By question marks, I meant- that some text, like Dutch programmers names,
>and address in Nordic locations, have accents and umaults and other such
>modifications done to English-alphabets. These get displayed as ? or box

x GLOB '*[^ -~]*' will return true (1) if x contains any character that is not 
printable 7-bit ASCII character

-- 
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.



___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Rocky Ji
>
>  > GLOB supports character classes


thanks for teaching new keyword and its use.

My first attempt was very similar to what you suggest, except I used
sqlite3 and re from inside Python.

But as you see, I can't reliably seprate 'interrogative' question marks
from question marks that get displayed due to 'encoding faults'.

Any suggestions?

Thanks.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Rocky Ji
Sorry for lack of clarity.

By question marks, I meant- that some text, like Dutch programmers names,
and address in Nordic locations, have accents and umaults and other such
modifications done to English-alphabets. These get displayed as ? or box

On Sat, Jan 18, 2020, 16:34 Clemens Ladisch  wrote:

> Rocky Ji wrote:
> > I am asked to highlight rows containing strange characters. All data were
> > ingested by a proprietary crawler.
> >
> > By strange, I mean, question marks, boxes, little Christmas Trees,  solid
> > arrows, etc. kind of symbols; these appear suddenly in flow of normal
> ASCII
> > English letters.
>
> GLOB supports character classes:
>
>   SELECT *
>   FROM MyTable
>   WHERE DataField GLOB '*[^ -~]*';
>
> Question marks _are_ ASCII characters.  If you want to allow fewer
> characters,
> list them:  [^ A-Za-z0-9,.-]
>
>
> Regards,
> Clemens
> ___
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Clemens Ladisch
Rocky Ji wrote:
> I am asked to highlight rows containing strange characters. All data were
> ingested by a proprietary crawler.
>
> By strange, I mean, question marks, boxes, little Christmas Trees,  solid
> arrows, etc. kind of symbols; these appear suddenly in flow of normal ASCII
> English letters.

GLOB supports character classes:

  SELECT *
  FROM MyTable
  WHERE DataField GLOB '*[^ -~]*';

Question marks _are_ ASCII characters.  If you want to allow fewer characters,
list them:  [^ A-Za-z0-9,.-]


Regards,
Clemens
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] How can I detect rows with non-ASCII values?

2020-01-18 Thread Rocky Ji
Hi,

I am asked to highlight rows containing strange characters. All data were
ingested by a proprietary crawler.

By strange, I mean, question marks, boxes, little Christmas Trees,  solid
arrows, etc. kind of symbols; these appear suddenly in flow of normal ASCII
English letters.

How do I approach this?

Thanks.
___
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users