Hi Patrick, NSP makes no real distinction between punctuation and words, so if you do not do anything with tokenization via --token or --nontoken or preprocessing, the punctuation marks will be treated just like words and will affect your results. --token and --nontoken essentially remove them from the data, so the bigrams you find are affected as is the total sample size.
Hope this helps! Ted On Wed, Aug 17, 2011 at 4:30 PM, semiotica24 <semiotic...@yahoo.com> wrote: > ** > > > So in other words punctuation such as . and , are not used at all by the > algorithms/measures and I should get the same results if I remove them > before I run count.pl and stat.pl, correct? > > > --- In ngram@yahoogroups.com, Ying Liu <liux0395@...> wrote: > > > > Hi Patrick, > > > > You need to pre-process the text (data cleaning) to remove > > punctuations before run by count.pl. The same idea, you > > need to post-process to get the format you want of the bigrams > > or trigrams. > > > > Thanks, > > Ying > > > > semiotica24 wrote: > > > > > > Sorry for the basic questions: > > > 1. I need 2 versions of output for each list of bigrams and trigrams > > > that I create using the various measures in count.pl and statistic.pl: > > > > one with the default statistics and one without. How do I format to > > > exclude the statistics? > > > e.g.: > > > mobile<>phones<>100 280 384 > > > cellular<>phones<>96 214 384 > > > > > > mobile phones > > > cellular phones > > > > > > 2. I need to remove punctuation . and , I've tried within my stopword > > > list, but I don't have the tags quite right. How should I enter into > > > my stop file? > > > > > > Thanks! > > > > > > Patrick > > > > > > > > > > >