Hi Identifying sentences is indeed non trivial. I use the treetagger to do the job. But if the input has no sentence-marker like interpunktuation it will fail spectacularly ;-)
Cheers, Christos ________________________________ From: Ted Pedersen <tpede...@d.umn.edu> To: ngram@yahoogroups.com Sent: Friday, February 6, 2009 3:05:19 PM Subject: Re: [ngram] No ngram over sentence Hi Jayaram, Yes, in order to restrict ngrams to individual sentences you would need to use the -newLine option, and make sure that you had one sentence per line, one line per sentence. Identifying sentences boundaries is a non-trivial problem, but we have some simple code available as a part of our WordNet::SenseRelat e::AllWords package that could be a useful starting point for a sentence boundary detector. http://cpansearch. perl.org/ src/TPEDERSE/ WordNet-SenseRel ate-AllWords- 0.13/utils/ sentence_ split.pl This is not intended to "solve" the problem, but it will do a reasonable approximation of sentence boundary detection. I hope this helps! Cordially, Ted On Fri, Feb 6, 2009 at 1:01 AM, jayaram raji <jayaram_raji2002@ yahoo.com> wrote: > Dear Ted, > > In order to achieve what Christos has asked, Is it necessary to arrange the > data in such a way that there is only one sentence per line? If it is a > running text, how does it identify the end of the sentence? > > Thanks > Jayaram > > --- On Thu, 2/5/09, Ted Pedersen <duluth...@gmail. com> wrote: > > From: Ted Pedersen <duluth...@gmail. com> > Subject: Re: [ngram] No ngram over sentence > To: ng...@yahoogroups. com > Date: Thursday, February 5, 2009, 9:41 PM > > Hi Christos, > > In order to count as you describe, you just need to use the --newLine > option. > > If you run > > count.pl --help > > you can see all the command line options. Among them is ... > > --newLine Prevents n-grams from spanning across the > new-line character. > > which should do exactly as you wish! > > Happy Counting, :) > Ted > > On Thu, Feb 5, 2009 at 8:29 AM, christos.braeunle > <christos.braeunle@ yahoo.com> wrote: >> Hello >> >> I started using the NSP package and i am realy impressed by its power. >> First of all thanks for that great tool! >> >> Now i run into a problem when building ngrams. I want to tell count.pl >> not to create ngrams over the end of a sentence. >> >> For example: i have two sentences. >> >> Vincent loves Honey Bunny >> A women snorts >> >> Now when building bigrams i would like to get: >> >> Vincent<>loves >> loves<>Honey >> Honey<>Bunny >> A<>women >> women<>snorts >> >> so i want that the bigram Bunny<>A is not created (and don't gets counted) >> >> Is there a way to achieve this? >> >> I hope my question is understandable and has not been ask bevor. >> >> If i missed some relevant documentation, i would be glad to be pointet >> to it. >> >> Thanks a lot >> >> Christos Bräunle >> >> > > -- > Ted Pedersen > http://www.d. umn.edu/~ tpederse > > -- Ted Pedersen http://www.d. umn.edu/~ tpederse