> Yary has an issue posted regarding 'display-width' of UTF-16 encoded strings:

>  https://github.com/rakudo/rakudo/issues/3461

>  I know it might be far-fetched, but what if your UTF-8 issue and
Yary's UTF-16 issue were related

Well, an issue with handling combining characters could easily effect
both, nothing about it is specific to one encoding. Yary's issue
doesn't have to do with reading from disk though, he's just looking at
the raw bytes the encoding generates.



On 4/24/20, William Michels <w...@caa.columbia.edu> wrote:
> Hi Joe,
>
> I was able to run the code you posted and reproduced the exact same
> result (Rakudo version 2020.02.1.0000.1 built on MoarVM version
> 2020.02.1 implementing Raku 6.d). I tried playing with file encodings a bit
> (e.g. UTF8-C8), but I didn't see any improvement.
>
> Yary has an issue posted regarding 'display-width' of UTF-16 encoded
> strings:
>
> https://github.com/rakudo/rakudo/issues/3461
>
> I know it might be far-fetched, but what if your UTF-8 issue and
> Yary's UTF-16 issue were related? It would be nice to kill two birds
> with one stone.
>
> Best Regards, Bill.
>
>
>
>
> On Fri, Apr 24, 2020 at 1:20 PM Joseph Brenner <doom...@gmail.com> wrote:
>>
>> Another version of my test code, checking .tell throughout:
>>
>> use v6;
>> use Test;
>>
>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> my $file = "$tmpdir/scratch_file.txt";
>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";  #
>> ሀⶀ䷼ꪪⲤⲎ
>> my $ascii_str =   "ABCDEFGHI";
>>
>> test_read_and_read_again($unichar_str, $file, 3);
>> test_read_and_read_again($ascii_str,   $file, 0);
>>
>> # write given string to file, then read the third character twice and
>> check
>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>>     spurt $file, $str;
>>     my $fh = $file.IO.open;
>>     printf "%d: just opened\n", $fh.tell;
>>     $fh.readchars(2);  # skip a few
>>     printf "%d: after skipping 2\n", $fh.tell;
>>     my $chr_1 =      $fh.readchars(1);
>>     printf "%d: after reading 3rd: %s\n", $fh.tell, $chr_1;
>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes, always
>> 1 or 3
>>     my $step_back = $width + $nudge;
>>     $fh.seek: -$step_back, SeekFromCurrent;
>>     printf "%d: after seeking back %d\n", $fh.tell, $step_back;
>>     my $chr_2 =      $fh.readchars(1);
>>     printf "%d: after re-reading 3rd: %s\n", $fh.tell, $chr_2;
>>     is( $chr_1, $chr_2,
>>         "read, seek back, and read again gets same char with nudge of
>> $nudge" );
>> }
>>
>>
>> The output looks like so:
>>
>> /home/doom/End/Cave/Perl6/bin/trial-seeking_inner_truth.pl6
>> 0: just opened
>> 9: after skipping 2
>> 12: after reading 3rd: ䷼
>> 6: after seeking back 6
>> 12: after re-reading 3rd: ䷼
>> ok 1 - read, seek back, and read again gets same char with nudge of 3
>> 0: just opened
>> 2: after skipping 2
>> 3: after reading 3rd: C
>> 2: after seeking back 1
>> 3: after re-reading 3rd: C
>> ok 2 - read, seek back, and read again gets same char with nudge of 0
>>
>> It's really hard to see what I should do if I really wanted to
>> intermix readchars and seeks like this... I'd need to check the range
>> of the codepoint to see how far I need to seek to get where I expect
>> to be.
>>
>>
>>
>> On 4/24/20, Joseph Brenner <doom...@gmail.com> wrote:
>> > Thanks, yes I understand unicode and utf-8 reasonably well.
>> >
>> >> So Rakudo has to read the next codepoint to make sure that it isn't a
>> >> combining codepoint.
>> >
>> >> It is probably faking up the reads to look right when reading ASCII,
>> >> but
>> >> failing to do that for wider codepoints.
>> >
>> > I think it'd be the other way around... the idea here would be it's
>> > doing an extra readchar behind the scenes just in-case there's
>> > combining chars involved-- so you're figuring there's some confusion
>> > about the actual point in the file that's being read and the
>> > abstraction that readchars is supplying?
>> >
>> >
>> > On 4/24/20, Brad Gilbert <b2gi...@gmail.com> wrote:
>> >> In UTF8 characters can be 1 to 4 bytes long.
>> >>
>> >> UTF8 was designed so that 7-bit ASCII is a subset of it.
>> >>
>> >> Any 8bit byte that has its most significant bit set cannot be ASCII.
>> >> So multi-byte codepoints have the most significant bit set for all of
>> >> the
>> >> bytes.
>> >> The first byte can tell you the number of bytes that follow it.
>> >>
>> >> That is how a singe codepoint is stored.
>> >>
>> >> A character can be made of several codepoints.
>> >>
>> >>     "\c[LATIN SMALL LETTER E]\c[COMBINING ACUTE ACCENT]"
>> >>     "é"
>> >>
>> >> So Rakudo has to read the next codepoint to make sure that it isn't a
>> >> combining codepoint.
>> >>
>> >> It is probably faking up the reads to look right when reading ASCII,
>> >> but
>> >> failing to do that for wider codepoints.
>> >>
>> >> On Fri, Apr 24, 2020 at 1:34 PM Joseph Brenner <doom...@gmail.com>
>> >> wrote:
>> >>
>> >>> I thought that doing a readchars on a filehandle, seeking backwards
>> >>> the width of the char in bytes and then doing another read
>> >>> would always get the same character.  That works for ascii-range
>> >>> characters (1-byte in utf-8 encoding) but not multi-byte "wide"
>> >>> characters (commonly 3-bytes in utf-8).
>> >>>
>> >>> The question then, is why do I need a $nudge of 3 for wide chars, but
>> >>> not ascii-range ones?
>> >>>
>> >>> use v6;
>> >>> use Test;
>> >>>
>> >>> my $tmpdir = IO::Spec::Unix.tmpdir;
>> >>> my $file = "$tmpdir/scratch_file.txt";
>> >>> my $unichar_str = "\x[1200]\x[2D80]\x[4DFC]\x[AAAA]\x[2CA4]\x[2C8E]";
>> >>> #
>> >>> ሀⶀ䷼ꪪⲤⲎ
>> >>> my $ascii_str =   "ABCDEFGHI";
>> >>>
>> >>> subtest {
>> >>>     my $nudge = 3;
>> >>>     test_read_and_read_again($unichar_str, $file, $nudge);
>> >>> }, "Wide unicode chars: $unichar_str";
>> >>>
>> >>> subtest {
>> >>>     my $nudge = 0;
>> >>>     test_read_and_read_again($ascii_str, $file, $nudge);
>> >>> }, "Ascii-range chars: $ascii_str";
>> >>>
>> >>> # write given string to file, then read the third character twice and
>> >>> check
>> >>> sub test_read_and_read_again($str, $file, $nudge = 0) {
>> >>>     spurt $file, $str;
>> >>>     my $fh = $file.IO.open;
>> >>>     $fh.readchars(2);  # skip a few
>> >>>     my $chr_1 =      $fh.readchars(1);
>> >>>     my $width = $chr_1.encode('UTF-8').bytes;  # for our purposes,
>> >>> always
>> >>> 1 or 3
>> >>>     my $step_back = $width + $nudge;
>> >>>     $fh.seek: -$step_back, SeekFromCurrent;
>> >>>     my $chr_2 =      $fh.readchars(1);
>> >>>     is( $chr_1, $chr_2,
>> >>>         "read, seek back, and read again gets same char with nudge of
>> >>> $nudge" );
>> >>> }
>> >>>
>> >>
>> >
>

Reply via email to