[ngram] Re: negate stoplist

ftwaroch Mon, 23 Feb 2009 08:03:07 -0800

Dear all,

  Thanks all for the fast help and the very comprehensive inputs. I
ended up in keeping it simple and wrote a perl script to filter the
XTOPOX ngrams out on my own.


thanks again,

Florian


--- In ngram@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>
> Greetings all,
> 
> On Mon, Feb 23, 2009 at 8:11 AM, robsteranium <robsteran...@...> wrote:
> > I would run count.pl without a stop list (or perhaps just a normal
> > stop list) and then process the output in another program (e.g. sed).
> > This one liner would do the trick:
> >
> > sed -ne '/XTOPOX/p' count-output.cnt
> >
> 
> I think this is a great suggestion. One could also use grep or egrep
> from the command line too, but sed is quite powerful and useful to
> know about.
> 
> I don't know if I recommend the following, but it does make some
> useful points about using regular expressions in NSP, so I thought I
> would pass it along.
> 
> This is my input:
> 
> marimba(51): cat testin
> I went to XTOPOX and then left XTOPOX.
> I like XTOPOX pretty well, but I don't think you like XTOPOX at all.
> Where is XTOPOX XTOPOX is in Namibia
> 
> I would like to find all the bigrams that include XTOPOX. Another
> option would be to define the tokens such that they are two words long
> and must include XTOPOX as either the first or second word. Note that
> we will need to treat tokens of this form as unigrams in order to get
> the desired result.
> 
> I define my tokens as follows:
> 
> marimba(52): cat token.txt
> /\b\w+\b\s+\bXTOPOX\b/
> /\bXTOPOX\b\s+\b\w+\b/
> 
> And then run count.pl using that as the tokenization file....
> 
> marimba(54): count.pl --ngram 1 testout testin --token token.txt
> 
> My output is as follows...
> 
> marimba(55): cat testout
> 6
> like XTOPOX<>2
> XTOPOX is<>1
> is XTOPOX<>1
> left XTOPOX<>1
> to XTOPOX<>1
> 
> Note that we are getting counts here of how many times this two word
> string has occurred, and that these two word strings are indivisible
> (these are the tokens, the atomic units of our counting...)
> 
> If you were to run this without using --ngram 1 you would get --ngram
> 2 by default, and the results will look rather odd...
> 
> marimba(111): count.pl --ngram 2 test2out testin --token token.txt
> 
> marimba(112): more test2out
> 5
> is XTOPOX<>XTOPOX is<>1 1 1
> like XTOPOX<>like XTOPOX<>1 2 2
> left XTOPOX<>like XTOPOX<>1 1 2
> to XTOPOX<>left XTOPOX<>1 1 1
> like XTOPOX<>is XTOPOX<>1 2 1
> 
> However, it makes sense in that it's giving bigram counts for these
> two word unigrams (separated by the <>).
> 
> Again, I don't know that I recommend this as a solution, but I think
> it makes some useful points about using --token to represent the data
> in a somewhat different than expected way.
> 
> I hope this all helps!
> 
> Cordially,
> Ted
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>




------------------------------------

Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ngram/

<*> Your email settings:
    Individual Email | Traditional

<*> To change settings online go to:
    http://groups.yahoo.com/group/ngram/join
    (Yahoo! ID required)

<*> To change settings via email:
    mailto:ngram-dig...@yahoogroups.com 
    mailto:ngram-fullfeatu...@yahoogroups.com

<*> To unsubscribe from this group, send an email to:
    ngram-unsubscr...@yahoogroups.com

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/

[ngram] Re: negate stoplist

Reply via email to