Re: [ngram] Re: formatting + punctuation removal

Ted Pedersen Wed, 17 Aug 2011 15:12:29 -0700

Hi Patrick,

NSP makes no real distinction between punctuation and words, so if you do
not do anything with tokenization via --token or --nontoken or
preprocessing, the punctuation marks will be treated just like words and
will affect your results. --token and --nontoken essentially remove them
from the data, so the bigrams you find are affected as is the total sample
size.


Hope this helps!
Ted

On Wed, Aug 17, 2011 at 4:30 PM, semiotica24 <semiotic...@yahoo.com> wrote:

> **
>
>
> So in other words punctuation such as . and , are not used at all by the
> algorithms/measures and I should get the same results if I remove them
> before I run count.pl and stat.pl, correct?
>
>
> --- In ngram@yahoogroups.com, Ying Liu <liux0395@...> wrote:
> >
> > Hi Patrick,
> >
> > You need to pre-process the text (data cleaning) to remove
> > punctuations before run by count.pl. The same idea, you
> > need to post-process to get the format you want of the bigrams
> > or trigrams.
> >
> > Thanks,
> > Ying
> >
> > semiotica24 wrote:
> > >
> > > Sorry for the basic questions:
> > > 1. I need 2 versions of output for each list of bigrams and trigrams
> > > that I create using the various measures in count.pl and statistic.pl:
>
> > > one with the default statistics and one without. How do I format to
> > > exclude the statistics?
> > > e.g.:
> > > mobile<>phones<>100 280 384
> > > cellular<>phones<>96 214 384
> > >
> > > mobile phones
> > > cellular phones
> > >
> > > 2. I need to remove punctuation . and , I've tried within my stopword
> > > list, but I don't have the tags quite right. How should I enter into
> > > my stop file?
> > >
> > > Thanks!
> > >
> > > Patrick
> > >
> > >
> >
>
>  
>

Re: [ngram] Re: formatting + punctuation removal

Reply via email to