Hi Patrick, One additional idea might be to use the --token option, and say that you only want to consider alphanumerics as your tokens (which is what you will count).
For example... marengo(129): cat test my friends, i have news!!!!!!!! i like .... ngrams!!!! Now without any token list, stop list, etc... marengo(130): count.pl outa test marengo(131): cat outa 24 !<>!<>10 11 12 .<>.<>3 4 4 news<>!<>1 1 12 have<>news<>1 1 1 .<>ngrams<>1 4 1 !<>i<>1 11 2 ,<>i<>1 1 2 i<>have<>1 2 1 ngrams<>!<>1 1 12 like<>.<>1 1 4 friends<>,<>1 1 1 i<>like<>1 2 1 my<>friends<>1 1 1 Now I define a token file... marengo(132): cat token.txt /\w+/ marengo(133): count.pl out test --token token.txt marengo(134): cat out 7 i<>have<>1 2 1 news<>i<>1 1 2 have<>news<>1 1 1 like<>ngrams<>1 1 1 i<>like<>1 2 1 friends<>i<>1 1 2 my<>friends<>1 1 1 Note that we only have alphanumerics...that might be the simplest thing to try fist... Hope this helps... Ted On Wed, Aug 17, 2011 at 4:05 PM, Ying Liu <liux0...@umn.edu> wrote: > Hi Patrick, > > You need to pre-process the text (data cleaning) to remove > punctuations before run by count.pl. The same idea, you > need to post-process to get the format you want of the bigrams > or trigrams. > > Thanks, > Ying > > semiotica24 wrote: >> >> Sorry for the basic questions: >> 1. I need 2 versions of output for each list of bigrams and trigrams >> that I create using the various measures in count.pl and statistic.pl: >> one with the default statistics and one without. How do I format to >> exclude the statistics? >> e.g.: >> mobile<>phones<>100 280 384 >> cellular<>phones<>96 214 384 >> >> mobile phones >> cellular phones >> >> 2. I need to remove punctuation . and , I've tried within my stopword >> list, but I don't have the tags quite right. How should I enter into >> my stop file? >> >> Thanks! >> >> Patrick >> >> > > > > ------------------------------------ > > Yahoo! Groups Links > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse