Re: [ngram] No ngram over sentence

Christos Bräunle Fri, 06 Feb 2009 06:57:29 -0800

Hi

Identifying sentences is indeed non trivial. I use the treetagger to do the job.
But if the input has no sentence-marker like interpunktuation it will fail 
spectacularly ;-)


Cheers,

Christos




________________________________
From: Ted Pedersen <tpede...@d.umn.edu>
To: ngram@yahoogroups.com
Sent: Friday, February 6, 2009 3:05:19 PM
Subject: Re: [ngram] No ngram over sentence


Hi Jayaram,

Yes, in order to restrict ngrams to individual sentences you would
need to use the -newLine option, and make sure that you had one
sentence per line, one line per sentence. Identifying sentences
boundaries is a non-trivial problem, but we have some simple code
available as a part of our WordNet::SenseRelat e::AllWords package that
could be a useful starting point for a sentence boundary detector.

http://cpansearch. perl.org/ src/TPEDERSE/ WordNet-SenseRel ate-AllWords- 
0.13/utils/ sentence_ split.pl

This is not intended to "solve" the problem, but it will do a
reasonable approximation of sentence boundary detection.

I hope this helps!

Cordially,
Ted

On Fri, Feb 6, 2009 at 1:01 AM, jayaram raji <jayaram_raji2002@ yahoo.com> 
wrote:
> Dear Ted,
>
> In order to achieve what Christos has asked, Is it necessary to arrange the
> data in such a way that there is only one sentence per line?  If it is a
> running text, how does it identify the end of the sentence?
>
> Thanks
> Jayaram
>
> --- On Thu, 2/5/09, Ted Pedersen <duluth...@gmail. com> wrote:
>
> From: Ted Pedersen <duluth...@gmail. com>
> Subject: Re: [ngram] No ngram over sentence
> To: ng...@yahoogroups. com
> Date: Thursday, February 5, 2009, 9:41 PM
>
> Hi Christos,
>
> In order to count as you describe, you just need to use the --newLine
> option.
>
> If you run
>
> count.pl --help
>
> you can see all the command line options. Among them is ...
>
> --newLine Prevents n-grams from spanning across the
> new-line character.
>
> which should do exactly as you wish!
>
> Happy Counting, :)
> Ted
>
> On Thu, Feb 5, 2009 at 8:29 AM, christos.braeunle
> <christos.braeunle@ yahoo.com> wrote:
>> Hello
>>
>> I started using the NSP package and i am realy impressed by its power.
>> First of all thanks for that great tool!
>>
>> Now i run into a problem when building ngrams. I want to tell count.pl
>> not to create ngrams over the end of a sentence.
>>
>> For example: i have two sentences.
>>
>> Vincent loves Honey Bunny
>> A women snorts
>>
>> Now when building bigrams i would like to get:
>>
>> Vincent<>loves
>> loves<>Honey
>> Honey<>Bunny
>> A<>women
>> women<>snorts
>>
>> so i want that the bigram Bunny<>A is not created (and don't gets counted)
>>
>> Is there a way to achieve this?
>>
>> I hope my question is understandable and has not been ask bevor.
>>
>> If i missed some relevant documentation, i would be glad to be pointet
>> to it.
>>
>> Thanks a lot
>>
>> Christos Bräunle
>>
>>
>
> --
> Ted Pedersen
> http://www.d. umn.edu/~ tpederse
>
> 

-- 
Ted Pedersen
http://www.d. umn.edu/~ tpederse

Re: [ngram] No ngram over sentence

Reply via email to