Of course I'd be remiss if I didn't mention there's a Perl module that does 
sentence boundary detection reasonably well...

http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm

marengo(18): cat test.txt
Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital. Mr.
Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember.
What a day - I'm glad it's over. My boss ... what a mean son-of-a-gun.

marengo(19): cat sentence.pl

        use Lingua::EN::Sentence qw( get_sentences add_acronyms );


        add_acronyms('lt','gen');               ## adding support for 'Lt. Gen.'

        # read entire file into array
        @lines = <>;
        $text = join (/\s+/,@lines);
        $text =~ s/[\n\t\r]/ /g;

        my $sentences=get_sentences($text);     ## Get the sentences.
        foreach my $sentence (@$sentences) {
                print "$sentence\n";
        }

marengo(21): perl sentence.pl test.txt
Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital.
Mr. Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember.
What a day - I'm glad it's over.
My boss ... what a mean son-of-a-gun.

Enjoy,
Ted

--- In nlpatumd@yahoogroups.com, Ted Pedersen <duluth...@...> wrote:
>
> Season's Greetings to one and all,
> 
> Splitta is an easy to use sentence boundary detector that was
> presented at NAACL 2009 in Boulder. It has a clean implementation in
> python that actually works well out of the box, and I was able to run
> it using both a model based on Naive Bayes and SVM within about 5
> minutes of download. I know I always bark about making code available,
> and that's the first step. But the truth is that's not enough, we also
> have to make sure that people can (easily) install and use whatever
> code we make available.
> 
> I'd suggest setting a value of X as in "able to install and start a
> run that will lead to useful results in less than X minutes", where X
> should always be an integer smaller than 10. Note that I don't say we
> should expect the results in less than X minutes, just because some
> computations do take a bit of time. But of course we should work on
> making things run fast too, as well as available and easy to install
> and use. Do all that, well, then you've got something. :)
> 
> Anyway, if you need a nice sentence boundary detector, Splitta seems
> like a good choice. License is GPL, and source code is available along
> with some pre-built models and instructions on how to set up and use.
> 
> http://code.google.com/p/splitta/
> 
> Below is a test case that I made up. This is just on one line of a file.
> 
> marengo(115): cat test.txt
> Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital.  Mr.
> Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember.
> What a day - I'm glad it's over. My boss ... what a mean son-of-a-gun.
> 
> And here's the output...
> 
> marengo(116): python sbd.py -m model_nb/ test.txt
> loading model from [model_nb/]... done!
> reading [test.txt]
> featurizing... done!
> NB classifying... done!
> Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital.
> Mr. Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember.
> What a day - I'm glad it's over.
> My boss ... what a mean son-of-a-gun.
> 
> Enjoy,
> Ted
> -- 
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>


Reply via email to