Hi Patrick, I thought I would throw my idea in as well :-) I tend to use the --nontoken option. It is kind of the flip side of Ted's. For example using Ted's example below:
bridget@cheshire:~/test$ cat test.txt my friends, i have news!!!!!!!! i like .... ngrams!!!! and a nontoken file containing a regex of the punctuation that you want to remove: bridget@cheshire:~/test$ cat nontokenfile /\./ /\!/ /\,/ You can run count.pl with the --nontoken option as follows: bridget@cheshire:~/test$ count.pl --ngram 2 --nontoken nontokenfile test.2 test.txt bridget@cheshire:~/test$ cat test.2 7 i<>have<>1 2 1 news<>i<>1 1 2 have<>news<>1 1 1 like<>ngrams<>1 1 1 i<>like<>1 2 1 friends<>i<>1 1 2 my<>friends<>1 1 1 This gives some control over what punctuation you want to remove and what punctuation you would like to keep - for example hyphens. For your first question on formatting, I didn't completely understand what you were asking. You do not want the statistics in the output file after running statistic.pl? Or you would like a program to remove the statistics and the <> markers after running statistics.pl? Thanks, Bridget On Wed, 17 Aug 2011, Ted Pedersen wrote: > Hi Patrick, > > One additional idea might be to use the --token option, and say that > you only want to consider alphanumerics as your tokens (which is what > you will count). > > For example... > > marengo(129): cat test > my friends, i have news!!!!!!!! > i like .... ngrams!!!! > > Now without any token list, stop list, etc... > > marengo(130): count.pl outa test > > marengo(131): cat outa > 24 > !<>!<>10 11 12 > .<>.<>3 4 4 > news<>!<>1 1 12 > have<>news<>1 1 1 > .<>ngrams<>1 4 1 > !<>i<>1 11 2 > ,<>i<>1 1 2 > i<>have<>1 2 1 > ngrams<>!<>1 1 12 > like<>.<>1 1 4 > friends<>,<>1 1 1 > i<>like<>1 2 1 > my<>friends<>1 1 1 > > Now I define a token file... > > marengo(132): cat token.txt > /\w+/ > > marengo(133): count.pl out test --token token.txt > > marengo(134): cat out > 7 > i<>have<>1 2 1 > news<>i<>1 1 2 > have<>news<>1 1 1 > like<>ngrams<>1 1 1 > i<>like<>1 2 1 > friends<>i<>1 1 2 > my<>friends<>1 1 1 > > Note that we only have alphanumerics...that might be the simplest > thing to try fist... > > Hope this helps... > Ted > > On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu <liux0...@umn.edu> wrote: >> Hi Patrick, >> >> You need to pre-process the text (data cleaning) to remove >> punctuations before run by count.pl. The same idea, you >> need to post-process to get the format you want of the bigrams >> or trigrams. >> >> Thanks, >> Ying >> >> semiotica24 wrote: >>> >>> Sorry for the basic questions: >>> 1. I need 2 versions of output for each list of bigrams and trigrams >>> that I create using the various measures in count.pl and statistic.pl: >>> one with the default statistics and one without. How do I format to >>> exclude the statistics? >>> e.g.: >>> mobile<>phones<>100 280 384 >>> cellular<>phones<>96 214 384 >>> >>> mobile phones >>> cellular phones >>> >>> 2. I need to remove punctuation . and , I've tried within my stopword >>> list, but I don't have the tags quite right. How should I enter into >>> my stop file? >>> >>> Thanks! >>> >>> Patrick >>> >>> >> >> >> >> ------------------------------------ >> >> Yahoo! Groups Links >> >> >> >> > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >