Re: [ngram] formatting + punctuation removal

Ted Pedersen Wed, 17 Aug 2011 14:31:58 -0700

Hi Patrick,

One additional idea might be to use the --token option, and say that
you only want to consider alphanumerics as your tokens (which is what
you will count).


For example...

marengo(129): cat test
my friends, i have news!!!!!!!!
i like .... ngrams!!!!

Now without any token list, stop list, etc...

marengo(130): count.pl outa test

marengo(131): cat outa
24
!<>!<>10 11 12
.<>.<>3 4 4
news<>!<>1 1 12
have<>news<>1 1 1
.<>ngrams<>1 4 1
!<>i<>1 11 2
,<>i<>1 1 2
i<>have<>1 2 1
ngrams<>!<>1 1 12
like<>.<>1 1 4
friends<>,<>1 1 1
i<>like<>1 2 1
my<>friends<>1 1 1

Now I define a token file...

marengo(132): cat token.txt
/\w+/

marengo(133): count.pl out test --token token.txt

marengo(134): cat out
7
i<>have<>1 2 1
news<>i<>1 1 2
have<>news<>1 1 1
like<>ngrams<>1 1 1
i<>like<>1 2 1
friends<>i<>1 1 2
my<>friends<>1 1 1

Note that we only have alphanumerics...that might be the simplest
thing to try fist...

Hope this helps...
Ted

On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu <liux0...@umn.edu> wrote:
> Hi Patrick,
>
> You need to pre-process the text (data cleaning) to remove
> punctuations before run by count.pl. The same idea, you
> need to post-process to get the format you want of the bigrams
> or trigrams.
>
> Thanks,
> Ying
>
> semiotica24 wrote:
>>
>> Sorry for the basic questions:
>> 1. I need 2 versions of output for each list of bigrams and trigrams
>> that I create using the various measures in count.pl and statistic.pl:
>> one with the default statistics and one without. How do I format to
>> exclude the statistics?
>> e.g.:
>> mobile<>phones<>100 280 384
>> cellular<>phones<>96 214 384
>>
>> mobile phones
>> cellular phones
>>
>> 2. I need to remove punctuation . and , I've tried within my stopword
>> list, but I don't have the tags quite right. How should I enter into
>> my stop file?
>>
>> Thanks!
>>
>> Patrick
>>
>>
>
>
>
> ------------------------------------
>
> Yahoo! Groups Links
>
>
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] formatting + punctuation removal

Reply via email to