Tokenization and the --token option are described here : http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens
On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] < ngram@yahoogroups.com> wrote: > > [Attachment(s) <#m_-6964475169159201585_TopText> from > amir.jad...@yahoo.com included below] > > I'm trying to run count.pl for a directory of unicode documents (a sample > document has been attached) using Perl 5 (v5.18.2). The output is a list > of digits and punctuations without any unicode word: > > 2732 > > .<>1589 > > :<>626 > > 2<>19 > > !<>17 > > 10<>16 > > 4<>14 > > 13<>13 > > 12<>13 > > 20<>12 > > 9<>11 > > 15<>11 > > 3<>10 > > 5<>10 > > Is it possible to ask count.pl to tokenize the input file just by space? > > There is --token option which maybe useful. But I don't how to use it. > > >