Re: [ngram] formatting + punctuation removal

Ted Pedersen Wed, 17 Aug 2011 14:43:08 -0700

Hi Patrick,

One other thing to think about is that the stoplist is designed to be used
with bigrams - so the stoplist is really intended to remove bigrams after a
text has been chopped up into bigrams, and not so much for removing
individual words.


More on stoplists here...
http://search.cpan.org/~tpederse/Text-NSP-1.23/doc/README.pod#5.6._"Stopping"_the_Ngrams:<http://search.cpan.org/~tpederse/Text-NSP-1.23/doc/README.pod#5.6._>

Also, in addition to my --token suggestion, you could consider the use of
--nontoken...the --token option excludes anything not defined in your token
regex, whereas --nontoken excludes anything that is defined in the regex (so
they are two sides of the same coin I suppose...)

http://search.cpan.org/~tpederse/Text-NSP-1.23/doc/README.pod#5.4_Removing_character_strings_via_--nontoken_option:

Hope this helps...

Ted

On Wed, Aug 17, 2011 at 1:30 PM, semiotica24 <semiotic...@yahoo.com> wrote:

> **
>
>
> Sorry for the basic questions:
> 1. I need 2 versions of output for each list of bigrams and trigrams that I
> create using the various measures in count.pl and statistic.pl: one with
> the default statistics and one without. How do I format to exclude the
> statistics?
> e.g.:
> mobile<>phones<>100 280 384
> cellular<>phones<>96 214 384
>
> mobile phones
> cellular phones
>
> 2. I need to remove punctuation . and , I've tried within my stopword list,
> but I don't have the tags quite right. How should I enter into my stop file?
>
> Thanks!
>
> Patrick
>
>  
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [ngram] formatting + punctuation removal

Reply via email to