It's not very hard to collect the abbreviations. It may be less work than coding what's in the paper.
On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <[email protected]> wrote: > The exceptions are a list of abbreviations that have a terminal full stop, > but which are customarily terminated by a capitalized word which is not the > start of a new sentence. > > It looks to me like machine learning has come a long way in this regard. > This is the best paper on the subject that I have seen in a quick search. > > Unsupervised Multilingual Sentence Boundary Detection > <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by > Kiss and Strunk. > > It doesn't require any lexical resources and can improve performance on the > fly by adapting to the language that it is working against. > > The fundamental insight is that abbreviations are situations where a stem is > very commonly followed by a full stop and a sentence start marker is > something that is very commonly preceded by a full stop or other sentence > marker. From these and a few other intuitions, they build a system that is > pretty darned accurate. One major component of its accuracy is due to the > ability to adapt on the fly to the corpus in use. > > A deficiency in our use case would be the requirement for training text, but > that could be solved with a few moderate sized resources that are the result > of training on reference texts for different languages. > > On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]> wrote: > >> I've found abbrevs, various identifiers etc are sort of a typical case >> where these things fall flat. I'll see how it performs viz writing >> something from scratch and see what I can come up with. >> >> > Right, although just slightly ironic that we are using a rule-based >> system for a machine learning project. >> >> Heh, indeed, but it seems entirely appropriate in this case. Of >> course, now I need to go read about statistical approaches to sentence >> boundary detection. >> > > > > -- > Ted Dunning, CTO > DeepDyve >
