Re: Strange issue with inserting varchar2()

2021-05-30 Thread Peter J. Holzer
On 2021-05-27 23:35:32 +, Bruce  Johnson wrote:
> The column in question is a varchar2(4000) so I am truncating the
> input to 4000 bytes.
> 
> (via use “bytes; $string=substr($orig,0,4000); ” in the section where
> I actually truncate the string.)
> 
> When I do the insert I get an  ORA-12899: value too large for column
> "AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054,
> maximum: 4000) error

Is $orig actually UTF-8 encoded at that time? If it only contains
codepoints <= U+00FF, it might be 1 byte/character, but when sent to
Oracle the codepoints between U+0080 and U+00FF will still be encoded as
two bytes. 

Something like:

my $bs = encode_utf8($orig);
my $sbs = substr($bs, 0, $len);
my $truncated = decode_utf8($sbs, FB_QUIET);

would be safer.

Also, waht encoding do you use in Oracle? UTF8 isn't really UTF-8: It's
a weird double-encoding which may take up to 6 six bytes for non-BMP
characters. To get real UTF-8, use AL32UTF8.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Strange issue with inserting varchar2()

2021-05-28 Thread Bruce Johnson
That would work, I think. Or, just changing the datatype in the table for the 
comments to a 4K CLOB in the *first* place, then the perl side truncate would 
work. This table is truncated each year, so there’s no real need to maintain 
any sort of historical data.


On May 28, 2021, at 3:55 AM, jurl...@urlwincc.com<mailto:jurl...@urlwincc.com> 
wrote:

What about truncating the data in PLSQL?  Eg an insert function?  It would have 
to input a CLOB I would guess…

Sent from my iPhone

On May 28, 2021, at 4:15 AM, Marcus Bergner 
mailto:marcus.berg...@vizrt.com>> wrote:


The way I have done it is to ensure that truncation only happens on valid UTF-8 
boundaries but length calculation is done in "use bytes" mode, and then try to 
guesstimate where to make the UTF-8 cut, and try that repeatedly until you 
manage to go under the max size you specified. Then you should have a valid 
UTF-8 string whos' byte encoded representation is less than the given number of 
bytes. There might be circumstances where this truncates slightly more but has 
worked well in practice for me for exactly this purpose of storing slightly too 
long text values in varchar columns.

sub truncByteString
{
my ($str, $bytes) = @_;
my ($len,$blen);
return undef unless defined $str;

while(1) {
do { use bytes; $blen = length $str };
last unless $blen > $bytes;
$len = length $str;
my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1;
$scaled_diff = 1 if $scaled_diff<1;
my $nlen = $len - $scaled_diff;
$str = substr $str,0,$nlen;
}

return $str;
}

/ Marcus
--
Marcus Bergner, M.Sc CSEE-mail: 
marcus.berg...@vizrt.com<mailto:marcus.berg...@vizrt.com>
Software Architect, Vizrt.Mobile: +46 (0)730-808025

From: Fennell, Brian mailto:fenne...@radial.com>>
Sent: Friday, May 28, 2021 03:11
To: Bruce Johnson 
mailto:john...@pharmacy.arizona.edu>>; dbi users 
mailto:dbi-users@perl.org>>
Subject: RE: Strange issue with inserting varchar2()

One more

https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0




The information contained in this electronic mail transmission is intended only 
for the use of the individual or entity named in this transmission. If you are 
not the intended recipient of this transmission, you are hereby notified that 
any disclosure, copying or distribution of the contents of this transmission is 
strictly prohibited and that you should delete the contents of this 
transmission from your system immediately. Any comments or statements contained 
in this transmission do not necessarily reflect the views or position of Radial 
or its subsidiaries and/or affiliates.



--
Bruce Johnson
University of Arizona
College of Pharmacy
Information Technology Group

Institutions do not have opinions, merely customs



Re: Strange issue with inserting varchar2()

2021-05-28 Thread jurlwin
What about truncating the data in PLSQL?  Eg an insert function?  It would have 
to input a CLOB I would guess…

Sent from my iPhone

> On May 28, 2021, at 4:15 AM, Marcus Bergner  wrote:
> 
> 
> The way I have done it is to ensure that truncation only happens on valid 
> UTF-8 boundaries but length calculation is done in "use bytes" mode, and then 
> try to guesstimate where to make the UTF-8 cut, and try that repeatedly until 
> you manage to go under the max size you specified. Then you should have a 
> valid UTF-8 string whos' byte encoded representation is less than the given 
> number of bytes. There might be circumstances where this truncates slightly 
> more but has worked well in practice for me for exactly this purpose of 
> storing slightly too long text values in varchar columns.
> 
> sub truncByteString
> {
> my ($str, $bytes) = @_;
> my ($len,$blen);
> return undef unless defined $str;
> 
> while(1) {
> do { use bytes; $blen = length $str };
> last unless $blen > $bytes;
> $len = length $str;
> my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1;
> $scaled_diff = 1 if $scaled_diff<1;
> my $nlen = $len - $scaled_diff;
> $str = substr $str,0,$nlen;
> }
> 
> return $str;
> }
> 
> / Marcus
> --
> Marcus Bergner, M.Sc CSEE-mail: marcus.berg...@vizrt.com
> Software Architect, Vizrt.Mobile: +46 (0)730-808025
> From: Fennell, Brian 
> Sent: Friday, May 28, 2021 03:11
> To: Bruce Johnson ; dbi users 
> 
> Subject: RE: Strange issue with inserting varchar2()
>  
> One more
> 
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0
> 
> 
> 
> 
> The information contained in this electronic mail transmission is intended 
> only for the use of the individual or entity named in this transmission. If 
> you are not the intended recipient of this transmission, you are hereby 
> notified that any disclosure, copying or distribution of the contents of this 
> transmission is strictly prohibited and that you should delete the contents 
> of this transmission from your system immediately. Any comments or statements 
> contained in this transmission do not necessarily reflect the views or 
> position of Radial or its subsidiaries and/or affiliates.
> 
> 


Re: Strange issue with inserting varchar2()

2021-05-28 Thread Marcus Bergner
The way I have done it is to ensure that truncation only happens on valid UTF-8 
boundaries but length calculation is done in "use bytes" mode, and then try to 
guesstimate where to make the UTF-8 cut, and try that repeatedly until you 
manage to go under the max size you specified. Then you should have a valid 
UTF-8 string whos' byte encoded representation is less than the given number of 
bytes. There might be circumstances where this truncates slightly more but has 
worked well in practice for me for exactly this purpose of storing slightly too 
long text values in varchar columns.

sub truncByteString
{
my ($str, $bytes) = @_;
my ($len,$blen);
return undef unless defined $str;

while(1) {
do { use bytes; $blen = length $str };
last unless $blen > $bytes;
$len = length $str;
my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1;
$scaled_diff = 1 if $scaled_diff<1;
my $nlen = $len - $scaled_diff;
$str = substr $str,0,$nlen;
}

return $str;
}

/ Marcus
--
Marcus Bergner, M.Sc CSEE-mail: marcus.berg...@vizrt.com
Software Architect, Vizrt.Mobile: +46 (0)730-808025

From: Fennell, Brian 
Sent: Friday, May 28, 2021 03:11
To: Bruce Johnson ; dbi users 
Subject: RE: Strange issue with inserting varchar2()

One more

https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0




The information contained in this electronic mail transmission is intended only 
for the use of the individual or entity named in this transmission. If you are 
not the intended recipient of this transmission, you are hereby notified that 
any disclosure, copying or distribution of the contents of this transmission is 
strictly prohibited and that you should delete the contents of this 
transmission from your system immediately. Any comments or statements contained 
in this transmission do not necessarily reflect the views or position of Radial 
or its subsidiaries and/or affiliates.




RE: Strange issue with inserting varchar2()

2021-05-27 Thread Fennell, Brian
One more

https://stackoverflow.com/questions/1454952/dummys-guide-to-unicode




The information contained in this electronic mail transmission is intended only 
for the use of the individual or entity named in this transmission. If you are 
not the intended recipient of this transmission, you are hereby notified that 
any disclosure, copying or distribution of the contents of this transmission is 
strictly prohibited and that you should delete the contents of this 
transmission from your system immediately. Any comments or statements contained 
in this transmission do not necessarily reflect the views or position of Radial 
or its subsidiaries and/or affiliates.




RE: Strange issue with inserting varchar2()

2021-05-27 Thread Fennell, Brian
Here are a few articles if you really want to dig in

https://stackoverflow.com/questions/5290182/how-many-bytes-does-one-unicode-character-take

https://docs.oracle.com/database/121/NLSPG/ch2charset.htm#NLSPG1037

https://metacpan.org/pod/Text::Iconv

https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod

https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod



-Original Message-
From: Fennell, Brian 
Sent: Thursday, May 27, 2021 8:10 PM
To: Bruce Johnson ; dbi users 
Subject: [EXTERNAL] RE: Strange issue with inserting varchar2()

UTF-8 has a variable number of bytes per character.  Some encodings have 1 byte 
per character always.  Some encodings have two bytes per character always.
Assuming that the number of bytes is the same as the number of characters 
sometimes leads to wrong answers.
When you are using perl as a client to oracle you may have characters in more 
than one encoding so you have to be very careful with the byte count and 
character count.
The perl libraries which access Oracle database are built with a specific 
Oracle C Library implementation, and the rules for character encoding and 
character byte count are sometimes different from one version to another.
In addition the perl rules for character counting and byte counting are 
different for different versions of perl.
In addition the Oracle Database has its own idea of character encoding and byte 
count.  And this also changes with version.
So you are threading more than one needle at the same time - it can get (and 
usually does get) very complicated very quickly.
The usual solution is to try to get all the layers to use the same encoding 
(and therefore the same number of bytes per character).  UTF-8 (with Unicode 
Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all 
layers usually works, but if you don't have control over all the layers you may 
not be able to choose this.
Another choice is to find out what character encoding the database is using and 
try to configure all other layers to use the same encoding.
It is also helpful to know that the English letters A-Z and a-z and the digits 
0-9 are often encoded the same way for many different encodings so sometimes 
you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping 
that all the layers will be "close enough".
The OS also may have its own rules about character encoding so even the same 
version of perl on Windows and Linux may have different encoding rules.

The last time I tried to figure this out for a real life example I simply gave 
up trying to understand it and had to roll back the version of perl AND the DBI 
/ DBD libraries (and therefore the version of Oracle C Client Libraries) to 
known working.

To be absolutely 100 percent sure you get it right you need to understand the 
OS, perl, oracle client and oracle server rules inside and out - and frankly 
this is very hard.
Even if you understand it all some layer may simply not be able to do what you 
want it to do.

When the perl programming language was first created the idea was that a 
character was a byte and a byte was a character - things have changes since 
then - more than once.
The same is true of the Oracle Database and client libraries - (and Windows and 
Linux) - the growing pains left a lot of strange artifacts behind in all.

Another trick is make all buffers 4 times are big as you think you really need 
them and then have the database concat the data together on the server side - 
do in the end you are only working with one layer.
Last time I checked 4 bytes per character was the worse case scenario.  It 
might be 8 by now.

There are also "lossless" ways of encoding "illegal characters", such as UTF-8 
encode first then Mime Q-Encode (quoted printable) or using XML / XHTML 
entities like this -  or UTF-8 encode then percent-encode.
This can be used work around special characters when you are dealing with 
mostly English with some European characters and an encoding that is either 
7-bit-ascii compatible or can easily be converted to and from (losslessly) 
7-bit-ascii (such as EBCIDIC with some creativity).
And there is always good old hexadecimal.

It all depends on your use-cases, how much control you have over your 
environment and how much time you can spend being clever.




-Original Message-
From: Bruce Johnson 
Sent: Thursday, May 27, 2021 7:36 PM
To: dbi users 
Subject: [EXTERNAL] Strange issue with inserting varchar2()

I'm working on an app that processes (among other thing) long sections of text, 
and I’m running into odd multi-byte characters in some of the entries.

The column in question is a varchar2(4000) so I am truncating the input to 4000 
bytes.

(via use “bytes; $string=substr($orig,0,4000); ” in the section where I 
actually truncate the string.)

When I do the insert I get an  ORA-12899: value too large for column 
"AWARD"."PRECEPT_NOMINA

Re: Strange issue with inserting varchar2()

2021-05-27 Thread Bruce Johnson
I’m beginning to think that just truncating the string to something well short 
of the column limit is going to be the simplest way to go. This has been a 
persistent issue in the past with earlier versions of this app that were 
handled by the previous programmer/dba by…manually editing the file to fit by 
trial and error.

It happens with just a few records every time (this is a once-a-year process) 
and it’s always like this: the input is too long by a relative handful of bytes.

(The actual data is supposed to be *short* comments in a review process, and 
some of the reviewers are really wordy. All I ever get is a csv of the 
responses, so I’m limited in my options. I could move to a BLOB but this is a 
lot of work to manage a few responses in this system. We retain the original 
data always so if anyone really wants the whole novella-length ’short comment’ 
for review we can get it. :-)

Thanks for the deep dive into the complications I'm facing! A lot of food for 
thought.

On May 27, 2021, at 5:09 PM, Fennell, Brian 
mailto:fenne...@radial.com>> wrote:

There are also "lossless" ways of encoding "illegal characters", such as UTF-8 
encode first then Mime Q-Encode (quoted printable) or using XML / XHTML 
entities like this -  or UTF-8 encode then percent-encode.
This can be used work around special characters when you are dealing with 
mostly English with some European characters and an encoding that is either 
7-bit-ascii compatible or can easily be converted to and from (losslessly) 
7-bit-ascii (such as EBCIDIC with some creativity).
And there is always good old hexadecimal.

It all depends on your use-cases, how much control you have over your 
environment and how much time you can spend being clever.

--
Bruce Johnson
University of Arizona
College of Pharmacy
Information Technology Group

Institutions do not have opinions, merely customs



RE: Strange issue with inserting varchar2()

2021-05-27 Thread Fennell, Brian
UTF-8 has a variable number of bytes per character.  Some encodings have 1 byte 
per character always.  Some encodings have two bytes per character always.
Assuming that the number of bytes is the same as the number of characters 
sometimes leads to wrong answers.
When you are using perl as a client to oracle you may have characters in more 
than one encoding so you have to be very careful with the byte count and 
character count.
The perl libraries which access Oracle database are built with a specific 
Oracle C Library implementation, and the rules for character encoding and 
character byte count are sometimes different from one version to another.
In addition the perl rules for character counting and byte counting are 
different for different versions of perl.
In addition the Oracle Database has its own idea of character encoding and byte 
count.  And this also changes with version.
So you are threading more than one needle at the same time - it can get (and 
usually does get) very complicated very quickly.
The usual solution is to try to get all the layers to use the same encoding 
(and therefore the same number of bytes per character).  UTF-8 (with Unicode 
Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all 
layers usually works,
but if you don't have control over all the layers you may not be able to choose 
this.
Another choice is to find out what character encoding the database is using and 
try to configure all other layers
to use the same encoding.
It is also helpful to know that the English letters A-Z and a-z and the digits 
0-9 are often encoded the same way for many different encodings so sometimes 
you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping 
that all the layers will be "close enough".
The OS also may have its own rules about character encoding so even the same 
version of perl on Windows and Linux may have different encoding rules.

The last time I tried to figure this out for a real life example I simply gave 
up trying to understand it and had to roll back the version of perl AND the DBI 
/ DBD libraries (and therefore the version of Oracle C Client Libraries) to 
known working.

To be absolutely 100 percent sure you get it right you need to understand the 
OS, perl, oracle client and oracle server rules inside and out - and frankly 
this is very hard.
Even if you understand it all some layer may simply not be able to do what you 
want it to do.

When the perl programming language was first created the idea was that a 
character was a byte and a byte was a character - things have changes since 
then - more than once.
The same is true of the Oracle Database and client libraries - (and Windows and 
Linux) - the growing pains left a lot of strange artifacts behind in all.

Another trick is make all buffers 4 times are big as you think you really need 
them and then have the database concat the data together on the server side - 
do in the end you are only working with one layer.
Last time I checked 4 bytes per character was the worse case scenario.  It 
might be 8 by now.

There are also "lossless" ways of encoding "illegal characters", such as UTF-8 
encode first then Mime Q-Encode (quoted printable) or using XML / XHTML 
entities like this -  or UTF-8 encode then percent-encode.
This can be used work around special characters when you are dealing with 
mostly English with some European characters and an encoding that is either 
7-bit-ascii compatible or can easily be converted to and from (losslessly) 
7-bit-ascii (such as EBCIDIC with some creativity).
And there is always good old hexadecimal.

It all depends on your use-cases, how much control you have over your 
environment and how much time you can spend being clever.




-Original Message-
From: Bruce Johnson 
Sent: Thursday, May 27, 2021 7:36 PM
To: dbi users 
Subject: [EXTERNAL] Strange issue with inserting varchar2()

I'm working on an app that processes (among other thing) long sections of text, 
and I’m running into odd multi-byte characters in some of the entries.

The column in question is a varchar2(4000) so I am truncating the input to 4000 
bytes.

(via use “bytes; $string=substr($orig,0,4000); ” in the section where I 
actually truncate the string.)

When I do the insert I get an  ORA-12899: value too large for column 
"AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054, maximum: 4000) 
error

When I check the actual length of the offending string in bytes on the perl 
side it is 4000 *bytes* long

Original Oncology comment is 4804 chars (4824 bytes) long and when truncated in 
character mode to 4000 chars it’s 4018 bytes long, when truncated in byte mode, 
it’s 4000 chars and bytes.

What’s confusing me is that the size of the string being reported by oracle in 
the error doesn’t match any of the values that I see in perl. I’m not sure this 
is a DBI issue or not.

--
Bruce Johnson
University of Arizona
College of Pharmacy
Information 

Re: Strange issue with inserting varchar2()

2021-05-27 Thread pali
On Thursday 27 May 2021 23:35:32 Bruce  Johnson wrote:
> use bytes; $string=substr($orig,0,4000);

Hello! This is really suspicious. See **BOLD** description of 'bytes' module:

  **Use of this module for anything other than debugging purposes is strongly 
discouraged.**

  **If you feel that the functions here within might be useful for your 
application, this possibly indicates a mismatch between your mental model of 
Perl Unicode and the current reality.**