Re: Strange issue with inserting varchar2()
On 2021-05-27 23:35:32 +, Bruce Johnson wrote: > The column in question is a varchar2(4000) so I am truncating the > input to 4000 bytes. > > (via use “bytes; $string=substr($orig,0,4000); ” in the section where > I actually truncate the string.) > > When I do the insert I get an ORA-12899: value too large for column > "AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054, > maximum: 4000) error Is $orig actually UTF-8 encoded at that time? If it only contains codepoints <= U+00FF, it might be 1 byte/character, but when sent to Oracle the codepoints between U+0080 and U+00FF will still be encoded as two bytes. Something like: my $bs = encode_utf8($orig); my $sbs = substr($bs, 0, $len); my $truncated = decode_utf8($sbs, FB_QUIET); would be safer. Also, waht encoding do you use in Oracle? UTF8 isn't really UTF-8: It's a weird double-encoding which may take up to 6 six bytes for non-BMP characters. To get real UTF-8, use AL32UTF8. hp -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | h...@hjp.at |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature
Re: Strange issue with inserting varchar2()
That would work, I think. Or, just changing the datatype in the table for the comments to a 4K CLOB in the *first* place, then the perl side truncate would work. This table is truncated each year, so there’s no real need to maintain any sort of historical data. On May 28, 2021, at 3:55 AM, jurl...@urlwincc.com<mailto:jurl...@urlwincc.com> wrote: What about truncating the data in PLSQL? Eg an insert function? It would have to input a CLOB I would guess… Sent from my iPhone On May 28, 2021, at 4:15 AM, Marcus Bergner mailto:marcus.berg...@vizrt.com>> wrote: The way I have done it is to ensure that truncation only happens on valid UTF-8 boundaries but length calculation is done in "use bytes" mode, and then try to guesstimate where to make the UTF-8 cut, and try that repeatedly until you manage to go under the max size you specified. Then you should have a valid UTF-8 string whos' byte encoded representation is less than the given number of bytes. There might be circumstances where this truncates slightly more but has worked well in practice for me for exactly this purpose of storing slightly too long text values in varchar columns. sub truncByteString { my ($str, $bytes) = @_; my ($len,$blen); return undef unless defined $str; while(1) { do { use bytes; $blen = length $str }; last unless $blen > $bytes; $len = length $str; my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1; $scaled_diff = 1 if $scaled_diff<1; my $nlen = $len - $scaled_diff; $str = substr $str,0,$nlen; } return $str; } / Marcus -- Marcus Bergner, M.Sc CSEE-mail: marcus.berg...@vizrt.com<mailto:marcus.berg...@vizrt.com> Software Architect, Vizrt.Mobile: +46 (0)730-808025 From: Fennell, Brian mailto:fenne...@radial.com>> Sent: Friday, May 28, 2021 03:11 To: Bruce Johnson mailto:john...@pharmacy.arizona.edu>>; dbi users mailto:dbi-users@perl.org>> Subject: RE: Strange issue with inserting varchar2() One more https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0 The information contained in this electronic mail transmission is intended only for the use of the individual or entity named in this transmission. If you are not the intended recipient of this transmission, you are hereby notified that any disclosure, copying or distribution of the contents of this transmission is strictly prohibited and that you should delete the contents of this transmission from your system immediately. Any comments or statements contained in this transmission do not necessarily reflect the views or position of Radial or its subsidiaries and/or affiliates. -- Bruce Johnson University of Arizona College of Pharmacy Information Technology Group Institutions do not have opinions, merely customs
Re: Strange issue with inserting varchar2()
What about truncating the data in PLSQL? Eg an insert function? It would have to input a CLOB I would guess… Sent from my iPhone > On May 28, 2021, at 4:15 AM, Marcus Bergner wrote: > > > The way I have done it is to ensure that truncation only happens on valid > UTF-8 boundaries but length calculation is done in "use bytes" mode, and then > try to guesstimate where to make the UTF-8 cut, and try that repeatedly until > you manage to go under the max size you specified. Then you should have a > valid UTF-8 string whos' byte encoded representation is less than the given > number of bytes. There might be circumstances where this truncates slightly > more but has worked well in practice for me for exactly this purpose of > storing slightly too long text values in varchar columns. > > sub truncByteString > { > my ($str, $bytes) = @_; > my ($len,$blen); > return undef unless defined $str; > > while(1) { > do { use bytes; $blen = length $str }; > last unless $blen > $bytes; > $len = length $str; > my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1; > $scaled_diff = 1 if $scaled_diff<1; > my $nlen = $len - $scaled_diff; > $str = substr $str,0,$nlen; > } > > return $str; > } > > / Marcus > -- > Marcus Bergner, M.Sc CSEE-mail: marcus.berg...@vizrt.com > Software Architect, Vizrt.Mobile: +46 (0)730-808025 > From: Fennell, Brian > Sent: Friday, May 28, 2021 03:11 > To: Bruce Johnson ; dbi users > > Subject: RE: Strange issue with inserting varchar2() > > One more > > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0 > > > > > The information contained in this electronic mail transmission is intended > only for the use of the individual or entity named in this transmission. If > you are not the intended recipient of this transmission, you are hereby > notified that any disclosure, copying or distribution of the contents of this > transmission is strictly prohibited and that you should delete the contents > of this transmission from your system immediately. Any comments or statements > contained in this transmission do not necessarily reflect the views or > position of Radial or its subsidiaries and/or affiliates. > >
Re: Strange issue with inserting varchar2()
The way I have done it is to ensure that truncation only happens on valid UTF-8 boundaries but length calculation is done in "use bytes" mode, and then try to guesstimate where to make the UTF-8 cut, and try that repeatedly until you manage to go under the max size you specified. Then you should have a valid UTF-8 string whos' byte encoded representation is less than the given number of bytes. There might be circumstances where this truncates slightly more but has worked well in practice for me for exactly this purpose of storing slightly too long text values in varchar columns. sub truncByteString { my ($str, $bytes) = @_; my ($len,$blen); return undef unless defined $str; while(1) { do { use bytes; $blen = length $str }; last unless $blen > $bytes; $len = length $str; my $scaled_diff = int (($blen-$bytes)/($blen/$len))-1; $scaled_diff = 1 if $scaled_diff<1; my $nlen = $len - $scaled_diff; $str = substr $str,0,$nlen; } return $str; } / Marcus -- Marcus Bergner, M.Sc CSEE-mail: marcus.berg...@vizrt.com Software Architect, Vizrt.Mobile: +46 (0)730-808025 From: Fennell, Brian Sent: Friday, May 28, 2021 03:11 To: Bruce Johnson ; dbi users Subject: RE: Strange issue with inserting varchar2() One more https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstackoverflow.com%2Fquestions%2F1454952%2Fdummys-guide-to-unicodedata=04%7C01%7CMarcus.Bergner%40vizrt.com%7C1767beb561424d97b19f08d92175a617%7Cc63c3ba740db460ebdf1e63a02a3ad59%7C0%7C0%7C637577611463282317%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=wqFX4kaJyubBeA2vuUo4U5bhggujKZcyukf1%2FN5%2BwEg%3Dreserved=0 The information contained in this electronic mail transmission is intended only for the use of the individual or entity named in this transmission. If you are not the intended recipient of this transmission, you are hereby notified that any disclosure, copying or distribution of the contents of this transmission is strictly prohibited and that you should delete the contents of this transmission from your system immediately. Any comments or statements contained in this transmission do not necessarily reflect the views or position of Radial or its subsidiaries and/or affiliates.
RE: Strange issue with inserting varchar2()
One more https://stackoverflow.com/questions/1454952/dummys-guide-to-unicode The information contained in this electronic mail transmission is intended only for the use of the individual or entity named in this transmission. If you are not the intended recipient of this transmission, you are hereby notified that any disclosure, copying or distribution of the contents of this transmission is strictly prohibited and that you should delete the contents of this transmission from your system immediately. Any comments or statements contained in this transmission do not necessarily reflect the views or position of Radial or its subsidiaries and/or affiliates.
RE: Strange issue with inserting varchar2()
Here are a few articles if you really want to dig in https://stackoverflow.com/questions/5290182/how-many-bytes-does-one-unicode-character-take https://docs.oracle.com/database/121/NLSPG/ch2charset.htm#NLSPG1037 https://metacpan.org/pod/Text::Iconv https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod https://metacpan.org/pod/distribution/perl/pod/perlunifaq.pod -Original Message- From: Fennell, Brian Sent: Thursday, May 27, 2021 8:10 PM To: Bruce Johnson ; dbi users Subject: [EXTERNAL] RE: Strange issue with inserting varchar2() UTF-8 has a variable number of bytes per character. Some encodings have 1 byte per character always. Some encodings have two bytes per character always. Assuming that the number of bytes is the same as the number of characters sometimes leads to wrong answers. When you are using perl as a client to oracle you may have characters in more than one encoding so you have to be very careful with the byte count and character count. The perl libraries which access Oracle database are built with a specific Oracle C Library implementation, and the rules for character encoding and character byte count are sometimes different from one version to another. In addition the perl rules for character counting and byte counting are different for different versions of perl. In addition the Oracle Database has its own idea of character encoding and byte count. And this also changes with version. So you are threading more than one needle at the same time - it can get (and usually does get) very complicated very quickly. The usual solution is to try to get all the layers to use the same encoding (and therefore the same number of bytes per character). UTF-8 (with Unicode Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all layers usually works, but if you don't have control over all the layers you may not be able to choose this. Another choice is to find out what character encoding the database is using and try to configure all other layers to use the same encoding. It is also helpful to know that the English letters A-Z and a-z and the digits 0-9 are often encoded the same way for many different encodings so sometimes you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping that all the layers will be "close enough". The OS also may have its own rules about character encoding so even the same version of perl on Windows and Linux may have different encoding rules. The last time I tried to figure this out for a real life example I simply gave up trying to understand it and had to roll back the version of perl AND the DBI / DBD libraries (and therefore the version of Oracle C Client Libraries) to known working. To be absolutely 100 percent sure you get it right you need to understand the OS, perl, oracle client and oracle server rules inside and out - and frankly this is very hard. Even if you understand it all some layer may simply not be able to do what you want it to do. When the perl programming language was first created the idea was that a character was a byte and a byte was a character - things have changes since then - more than once. The same is true of the Oracle Database and client libraries - (and Windows and Linux) - the growing pains left a lot of strange artifacts behind in all. Another trick is make all buffers 4 times are big as you think you really need them and then have the database concat the data together on the server side - do in the end you are only working with one layer. Last time I checked 4 bytes per character was the worse case scenario. It might be 8 by now. There are also "lossless" ways of encoding "illegal characters", such as UTF-8 encode first then Mime Q-Encode (quoted printable) or using XML / XHTML entities like this - or UTF-8 encode then percent-encode. This can be used work around special characters when you are dealing with mostly English with some European characters and an encoding that is either 7-bit-ascii compatible or can easily be converted to and from (losslessly) 7-bit-ascii (such as EBCIDIC with some creativity). And there is always good old hexadecimal. It all depends on your use-cases, how much control you have over your environment and how much time you can spend being clever. -Original Message- From: Bruce Johnson Sent: Thursday, May 27, 2021 7:36 PM To: dbi users Subject: [EXTERNAL] Strange issue with inserting varchar2() I'm working on an app that processes (among other thing) long sections of text, and I’m running into odd multi-byte characters in some of the entries. The column in question is a varchar2(4000) so I am truncating the input to 4000 bytes. (via use “bytes; $string=substr($orig,0,4000); ” in the section where I actually truncate the string.) When I do the insert I get an ORA-12899: value too large for column "AWARD"."PRECEPT_NOMINA
Re: Strange issue with inserting varchar2()
I’m beginning to think that just truncating the string to something well short of the column limit is going to be the simplest way to go. This has been a persistent issue in the past with earlier versions of this app that were handled by the previous programmer/dba by…manually editing the file to fit by trial and error. It happens with just a few records every time (this is a once-a-year process) and it’s always like this: the input is too long by a relative handful of bytes. (The actual data is supposed to be *short* comments in a review process, and some of the reviewers are really wordy. All I ever get is a csv of the responses, so I’m limited in my options. I could move to a BLOB but this is a lot of work to manage a few responses in this system. We retain the original data always so if anyone really wants the whole novella-length ’short comment’ for review we can get it. :-) Thanks for the deep dive into the complications I'm facing! A lot of food for thought. On May 27, 2021, at 5:09 PM, Fennell, Brian mailto:fenne...@radial.com>> wrote: There are also "lossless" ways of encoding "illegal characters", such as UTF-8 encode first then Mime Q-Encode (quoted printable) or using XML / XHTML entities like this - or UTF-8 encode then percent-encode. This can be used work around special characters when you are dealing with mostly English with some European characters and an encoding that is either 7-bit-ascii compatible or can easily be converted to and from (losslessly) 7-bit-ascii (such as EBCIDIC with some creativity). And there is always good old hexadecimal. It all depends on your use-cases, how much control you have over your environment and how much time you can spend being clever. -- Bruce Johnson University of Arizona College of Pharmacy Information Technology Group Institutions do not have opinions, merely customs
RE: Strange issue with inserting varchar2()
UTF-8 has a variable number of bytes per character. Some encodings have 1 byte per character always. Some encodings have two bytes per character always. Assuming that the number of bytes is the same as the number of characters sometimes leads to wrong answers. When you are using perl as a client to oracle you may have characters in more than one encoding so you have to be very careful with the byte count and character count. The perl libraries which access Oracle database are built with a specific Oracle C Library implementation, and the rules for character encoding and character byte count are sometimes different from one version to another. In addition the perl rules for character counting and byte counting are different for different versions of perl. In addition the Oracle Database has its own idea of character encoding and byte count. And this also changes with version. So you are threading more than one needle at the same time - it can get (and usually does get) very complicated very quickly. The usual solution is to try to get all the layers to use the same encoding (and therefore the same number of bytes per character). UTF-8 (with Unicode Integer to Character Mapping and UTF-8 Byte Encoding of the integers) for all layers usually works, but if you don't have control over all the layers you may not be able to choose this. Another choice is to find out what character encoding the database is using and try to configure all other layers to use the same encoding. It is also helpful to know that the English letters A-Z and a-z and the digits 0-9 are often encoded the same way for many different encodings so sometimes you can cheat by forcing all of your data to be "pure 7 bit ascii" and hoping that all the layers will be "close enough". The OS also may have its own rules about character encoding so even the same version of perl on Windows and Linux may have different encoding rules. The last time I tried to figure this out for a real life example I simply gave up trying to understand it and had to roll back the version of perl AND the DBI / DBD libraries (and therefore the version of Oracle C Client Libraries) to known working. To be absolutely 100 percent sure you get it right you need to understand the OS, perl, oracle client and oracle server rules inside and out - and frankly this is very hard. Even if you understand it all some layer may simply not be able to do what you want it to do. When the perl programming language was first created the idea was that a character was a byte and a byte was a character - things have changes since then - more than once. The same is true of the Oracle Database and client libraries - (and Windows and Linux) - the growing pains left a lot of strange artifacts behind in all. Another trick is make all buffers 4 times are big as you think you really need them and then have the database concat the data together on the server side - do in the end you are only working with one layer. Last time I checked 4 bytes per character was the worse case scenario. It might be 8 by now. There are also "lossless" ways of encoding "illegal characters", such as UTF-8 encode first then Mime Q-Encode (quoted printable) or using XML / XHTML entities like this - or UTF-8 encode then percent-encode. This can be used work around special characters when you are dealing with mostly English with some European characters and an encoding that is either 7-bit-ascii compatible or can easily be converted to and from (losslessly) 7-bit-ascii (such as EBCIDIC with some creativity). And there is always good old hexadecimal. It all depends on your use-cases, how much control you have over your environment and how much time you can spend being clever. -Original Message- From: Bruce Johnson Sent: Thursday, May 27, 2021 7:36 PM To: dbi users Subject: [EXTERNAL] Strange issue with inserting varchar2() I'm working on an app that processes (among other thing) long sections of text, and I’m running into odd multi-byte characters in some of the entries. The column in question is a varchar2(4000) so I am truncating the input to 4000 bytes. (via use “bytes; $string=substr($orig,0,4000); ” in the section where I actually truncate the string.) When I do the insert I get an ORA-12899: value too large for column "AWARD"."PRECEPT_NOMINATIONS"."ONCOLOGY_COMMENTS" (actual: 4054, maximum: 4000) error When I check the actual length of the offending string in bytes on the perl side it is 4000 *bytes* long Original Oncology comment is 4804 chars (4824 bytes) long and when truncated in character mode to 4000 chars it’s 4018 bytes long, when truncated in byte mode, it’s 4000 chars and bytes. What’s confusing me is that the size of the string being reported by oracle in the error doesn’t match any of the values that I see in perl. I’m not sure this is a DBI issue or not. -- Bruce Johnson University of Arizona College of Pharmacy Information
Re: Strange issue with inserting varchar2()
On Thursday 27 May 2021 23:35:32 Bruce Johnson wrote: > use bytes; $string=substr($orig,0,4000); Hello! This is really suspicious. See **BOLD** description of 'bytes' module: **Use of this module for anything other than debugging purposes is strongly discouraged.** **If you feel that the functions here within might be useful for your application, this possibly indicates a mismatch between your mental model of Perl Unicode and the current reality.**