Of course I'd be remiss if I didn't mention there's a Perl module that does sentence boundary detection reasonably well...
http://search.cpan.org/~shlomoy/Lingua-EN-Sentence-0.25/lib/Lingua/EN/Sentence.pm marengo(18): cat test.txt Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital. Mr. Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember. What a day - I'm glad it's over. My boss ... what a mean son-of-a-gun. marengo(19): cat sentence.pl use Lingua::EN::Sentence qw( get_sentences add_acronyms ); add_acronyms('lt','gen'); ## adding support for 'Lt. Gen.' # read entire file into array @lines = <>; $text = join (/\s+/,@lines); $text =~ s/[\n\t\r]/ /g; my $sentences=get_sentences($text); ## Get the sentences. foreach my $sentence (@$sentences) { print "$sentence\n"; } marengo(21): perl sentence.pl test.txt Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital. Mr. Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember. What a day - I'm glad it's over. My boss ... what a mean son-of-a-gun. Enjoy, Ted --- In nlpatumd@yahoogroups.com, Ted Pedersen <duluth...@...> wrote: > > Season's Greetings to one and all, > > Splitta is an easy to use sentence boundary detector that was > presented at NAACL 2009 in Boulder. It has a clean implementation in > python that actually works well out of the box, and I was able to run > it using both a model based on Naive Bayes and SVM within about 5 > minutes of download. I know I always bark about making code available, > and that's the first step. But the truth is that's not enough, we also > have to make sure that people can (easily) install and use whatever > code we make available. > > I'd suggest setting a value of X as in "able to install and start a > run that will lead to useful results in less than X minutes", where X > should always be an integer smaller than 10. Note that I don't say we > should expect the results in less than X minutes, just because some > computations do take a bit of time. But of course we should work on > making things run fast too, as well as available and easy to install > and use. Do all that, well, then you've got something. :) > > Anyway, if you need a nice sentence boundary detector, Splitta seems > like a good choice. License is GPL, and source code is available along > with some pre-built models and instructions on how to set up and use. > > http://code.google.com/p/splitta/ > > Below is a test case that I made up. This is just on one line of a file. > > marengo(115): cat test.txt > Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital. Mr. > Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember. > What a day - I'm glad it's over. My boss ... what a mean son-of-a-gun. > > And here's the output... > > marengo(116): python sbd.py -m model_nb/ test.txt > loading model from [model_nb/]... done! > reading [test.txt] > featurizing... done! > NB classifying... done! > Dr. Smith decided to refer Jimmy St. John to St. Jude's Hospital. > Mr. Jones was born on Nov. 19, 1933 or possibly 1934, he can't remember. > What a day - I'm glad it's over. > My boss ... what a mean son-of-a-gun. > > Enjoy, > Ted > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse >