Re: [ngram] count.pl for unicode documents

Ted Pedersen tpede...@d.umn.edu [ngram] Tue, 10 May 2016 08:23:34 -0700

Tokenization and the --token option are described here :

http://search.cpan.org/~tpederse/Text-NSP/doc/README.pod#2._Tokens


On Tue, May 10, 2016 at 8:14 AM, amir.jad...@yahoo.com [ngram] <
ngram@yahoogroups.com> wrote:

>
> [Attachment(s) <#m_-6964475169159201585_TopText> from
> amir.jad...@yahoo.com included below]
>
> I'm trying to run count.pl for a directory of unicode documents (a sample
> document has been attached) using Perl 5 (v5.18.2). The output is a list
> of digits and punctuations without any unicode word:
>
> 2732
>
> .<>1589
>
> :<>626
>
> 2<>19
>
> !<>17
>
> 10<>16
>
> 4<>14
>
> 13<>13
>
> 12<>13
>
> 20<>12
>
> 9<>11
>
> 15<>11
>
> 3<>10
>
> 5<>10
>
> Is it possible to ask count.pl to tokenize the input file just by space?
>
> There is --token option which maybe useful. But I don't how to use it.
>
> 
>

Re: [ngram] count.pl for unicode documents

Reply via email to