[ngram] Re: Upper Half of ASCII Character Table

ted_pedersen Sat, 11 May 2013 10:29:54 -0700

Some users have reported that problems like this can be resolved via the 
inclusion of the statement


use locale;

in count.pl and statistic.pl. There's some discussion of that issue here, which 
includes a link to a bug report on sourceforge with more details...

http://tech.groups.yahoo.com/group/ngram/message/158

I think that would be my suggestion as the first thing to try - the change you 
found for count.pl is from a very old version, and I'm not sure how relevant 
that is any more.

Here is some general Perl documentation about use locale which might be helpful 
in explaining what is happening when you use this.

http://perldoc.perl.org/perllocale.html#The-use-locale-pragma

But, please let us know what happens with this!

Cordially,
Ted

--- In ngram@yahoogroups.com, "ted_pedersen" <tpederse@...> wrote:
>
> There has been some previous discussion of encoding issues on the list, for 
> example the thread which starts here :
> 
> http://tech.groups.yahoo.com/group/ngram/message/211
> 
> I'll dig around a little more and see what else I can find.
> 
> More soon,
> Ted
> 
> --- In ngram@yahoogroups.com, "Dian Jia" <dianj_83@> wrote:
> >
> > Hi there,
> > 
> > I would like to add the upper half of ASCII Character Table in NSP. I found 
> > the following possible solution in the package. Any suggestions to add the 
> > following to the latest version, which has been modified?
> > 
> > "Here's an idea (courtesy of Michal Kren) - you can make the following 
> > modification to line 165 of count.pl (in v0.3):
> > 
> > while ( /(([\w\x80-\xff]+)|[,.!?;:])/g )
> > 
> > This will extend the "matching" for words to include ASCII characters
> > numbered 127 to 256 (the upper half of the table). This includes a
> > number of accented characters and other alphabets, so it might possibly
> > include the characters you are interested in."
> > 
> > Thanks a lot
> > Di
> >
>

[ngram] Re: Upper Half of ASCII Character Table

Reply via email to