Re: [ngram] formatting + punctuation removal

bthomson Wed, 17 Aug 2011 14:45:55 -0700

Hi Patrick,

I thought I would throw my idea in as well :-) I tend to use the 
--nontoken option. It is kind of the flip side of Ted's. For example using 
Ted's example below:


bridget@cheshire:~/test$ cat test.txt
my friends, i have news!!!!!!!!
i like .... ngrams!!!!

and a nontoken file containing a regex of the punctuation that you want to 
remove:

bridget@cheshire:~/test$ cat nontokenfile
/\./
/\!/
/\,/


You can run count.pl with the --nontoken option as follows:

bridget@cheshire:~/test$ count.pl --ngram 2 --nontoken nontokenfile test.2 
test.txt
bridget@cheshire:~/test$ cat test.2
7
i<>have<>1 2 1
news<>i<>1 1 2
have<>news<>1 1 1
like<>ngrams<>1 1 1
i<>like<>1 2 1
friends<>i<>1 1 2
my<>friends<>1 1 1

This gives some control over what punctuation you want to remove and what 
punctuation you would like to keep - for example hyphens.

For your first question on formatting, I didn't completely understand what 
you were asking. You do not want the statistics in the output file after 
running statistic.pl? Or you would like a program to remove the 
statistics and the <> markers after running statistics.pl?

Thanks,

Bridget


On Wed, 17 Aug 2011, Ted Pedersen wrote:

> Hi Patrick,
>
> One additional idea might be to use the --token option, and say that
> you only want to consider alphanumerics as your tokens (which is what
> you will count).
>
> For example...
>
> marengo(129): cat test
> my friends, i have news!!!!!!!!
> i like .... ngrams!!!!
>
> Now without any token list, stop list, etc...
>
> marengo(130): count.pl outa test
>
> marengo(131): cat outa
> 24
> !<>!<>10 11 12
> .<>.<>3 4 4
> news<>!<>1 1 12
> have<>news<>1 1 1
> .<>ngrams<>1 4 1
> !<>i<>1 11 2
> ,<>i<>1 1 2
> i<>have<>1 2 1
> ngrams<>!<>1 1 12
> like<>.<>1 1 4
> friends<>,<>1 1 1
> i<>like<>1 2 1
> my<>friends<>1 1 1
>
> Now I define a token file...
>
> marengo(132): cat token.txt
> /\w+/
>
> marengo(133): count.pl out test --token token.txt
>
> marengo(134): cat out
> 7
> i<>have<>1 2 1
> news<>i<>1 1 2
> have<>news<>1 1 1
> like<>ngrams<>1 1 1
> i<>like<>1 2 1
> friends<>i<>1 1 2
> my<>friends<>1 1 1
>
> Note that we only have alphanumerics...that might be the simplest
> thing to try fist...
>
> Hope this helps...
> Ted
>
> On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu <liux0...@umn.edu> wrote:
>> Hi Patrick,
>>
>> You need to pre-process the text (data cleaning) to remove
>> punctuations before run by count.pl. The same idea, you
>> need to post-process to get the format you want of the bigrams
>> or trigrams.
>>
>> Thanks,
>> Ying
>>
>> semiotica24 wrote:
>>>
>>> Sorry for the basic questions:
>>> 1. I need 2 versions of output for each list of bigrams and trigrams
>>> that I create using the various measures in count.pl and statistic.pl:
>>> one with the default statistics and one without. How do I format to
>>> exclude the statistics?
>>> e.g.:
>>> mobile<>phones<>100 280 384
>>> cellular<>phones<>96 214 384
>>>
>>> mobile phones
>>> cellular phones
>>>
>>> 2. I need to remove punctuation . and , I've tried within my stopword
>>> list, but I don't have the tags quite right. How should I enter into
>>> my stop file?
>>>
>>> Thanks!
>>>
>>> Patrick
>>>
>>>
>>
>>
>>
>> ------------------------------------
>>
>> Yahoo! Groups Links
>>
>>
>>
>>
>
>
>
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>

Re: [ngram] formatting + punctuation removal

Reply via email to