Unicode characters counted as multiple characters

Ion Toloaca Fri, 23 May 2014 08:18:44 -0700

Hello everyone,

      I have been trying to get the start and end positions of the last 
matched rule for some time -
and I got into trouble when I tried an example that had unicode. Here is a 
simplified version 
below that shows that the position (the return value of the read() method) 
is counted in a wrong
because of the unicode character; the code works fine if it is replaced 
with a non-unicode char, 
for example '='. (The start position and the length are, by the way, given 
in token - but I solve this
by using my $last_expression = $recce->substring($start_rule, 
$length_rule); and getting its length.)
     Is here Marpa at fault for not counting unicode right - or did I just 
use "encode", "decode" or something
else in wrong way?


#################################################################################################
use utf8;
use 5.014;
use strict;
use warnings;
use Data::Dumper; 
use Marpa::R2;
use Encode;

my $dsl = encode("UTF-8",<<END_OF_DSL);
:start ::= Start
:default ::= action => do_print
Start ::= Rule1 
Rule1 ::= '≠'
event 'Start' = completed Start 
END_OF_DSL


#Initialize grammar#
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
my $recce = Marpa::R2::Scanless::R->new(
    { grammar => $grammar, semantics_package => 'My_Actions' } );


my $input = encode("UTF-8",'≠');
my $pos = $recce->read( \$input);
my ($start_rule, $length_rule) = $recce->last_completed("Start");
print("$pos"); # $pos == 3 since the unicode symbol is counted as 3 symbols 
(for usual symbols - $pos ==1 as expected)
###############################################################################################

    Thank you in advance for help regarding this issue

Best regards, 
Toloaca Ion

-- 
You received this message because you are subscribed to the Google Groups 
"marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Unicode characters counted as multiple characters

Reply via email to