Re: Unicode characters counted as multiple characters

Deyan Ginev Fri, 23 May 2014 08:30:18 -0700

Interestingly enough, if you remove all calls to encode() you get a parse
that prints out 1, as you would expect.


Usually you only want to call encode() that you don't type in the source
yourself.

Hope that helps,
Deyan


On Fri, May 23, 2014 at 5:18 PM, Ion Toloaca <[email protected]> wrote:

> Hello everyone,
>
>       I have been trying to get the start and end positions of the last
> matched rule for some time -
> and I got into trouble when I tried an example that had unicode. Here is a
> simplified version
> below that shows that the position (the return value of the read() method)
> is counted in a wrong
> because of the unicode character; the code works fine if it is replaced
> with a non-unicode char,
> for example '='. (The start position and the length are, by the way, given
> in token - but I solve this
> by using my $last_expression = $recce->substring($start_rule,
> $length_rule); and getting its length.)
>      Is here Marpa at fault for not counting unicode right - or did I just
> use "encode", "decode" or something
> else in wrong way?
>
>
> #################################################################################################
> use utf8;
> use 5.014;
> use strict;
> use warnings;
> use Data::Dumper;
> use Marpa::R2;
> use Encode;
>
> my $dsl = encode("UTF-8",<<END_OF_DSL);
> :start ::= Start
> :default ::= action => do_print
> Start ::= Rule1
> Rule1 ::= '≠'
> event 'Start' = completed Start
> END_OF_DSL
>
>
> #Initialize grammar#
> my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
> my $recce = Marpa::R2::Scanless::R->new(
>     { grammar => $grammar, semantics_package => 'My_Actions' } );
>
>
> my $input = encode("UTF-8",'≠');
> my $pos = $recce->read( \$input);
> my ($start_rule, $length_rule) = $recce->last_completed("Start");
> print("$pos"); # $pos == 3 since the unicode symbol is counted as 3
> symbols (for usual symbols - $pos ==1 as expected)
>
> ###############################################################################################
>
>     Thank you in advance for help regarding this issue
>
> Best regards,
> Toloaca Ion
>
>  --
> You received this message because you are subscribed to the Google Groups
> "marpa parser" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Unicode characters counted as multiple characters

Reply via email to