Some users have reported that problems like this can be resolved via the inclusion of the statement
use locale; in count.pl and statistic.pl. There's some discussion of that issue here, which includes a link to a bug report on sourceforge with more details... http://tech.groups.yahoo.com/group/ngram/message/158 I think that would be my suggestion as the first thing to try - the change you found for count.pl is from a very old version, and I'm not sure how relevant that is any more. Here is some general Perl documentation about use locale which might be helpful in explaining what is happening when you use this. http://perldoc.perl.org/perllocale.html#The-use-locale-pragma But, please let us know what happens with this! Cordially, Ted --- In ngram@yahoogroups.com, "ted_pedersen" <tpederse@...> wrote: > > There has been some previous discussion of encoding issues on the list, for > example the thread which starts here : > > http://tech.groups.yahoo.com/group/ngram/message/211 > > I'll dig around a little more and see what else I can find. > > More soon, > Ted > > --- In ngram@yahoogroups.com, "Dian Jia" <dianj_83@> wrote: > > > > Hi there, > > > > I would like to add the upper half of ASCII Character Table in NSP. I found > > the following possible solution in the package. Any suggestions to add the > > following to the latest version, which has been modified? > > > > "Here's an idea (courtesy of Michal Kren) - you can make the following > > modification to line 165 of count.pl (in v0.3): > > > > while ( /(([\w\x80-\xff]+)|[,.!?;:])/g ) > > > > This will extend the "matching" for words to include ASCII characters > > numbered 127 to 256 (the upper half of the table). This includes a > > number of accented characters and other alphabets, so it might possibly > > include the characters you are interested in." > > > > Thanks a lot > > Di > > >