There has been some previous discussion of encoding issues on the list, for example the thread which starts here :
http://tech.groups.yahoo.com/group/ngram/message/211 I'll dig around a little more and see what else I can find. More soon, Ted --- In ngram@yahoogroups.com, "Dian Jia" <dianj_83@...> wrote: > > Hi there, > > I would like to add the upper half of ASCII Character Table in NSP. I found > the following possible solution in the package. Any suggestions to add the > following to the latest version, which has been modified? > > "Here's an idea (courtesy of Michal Kren) - you can make the following > modification to line 165 of count.pl (in v0.3): > > while ( /(([\w\x80-\xff]+)|[,.!?;:])/g ) > > This will extend the "matching" for words to include ASCII characters > numbered 127 to 256 (the upper half of the table). This includes a > number of accented characters and other alphabets, so it might possibly > include the characters you are interested in." > > Thanks a lot > Di >