One of their points was that the set of abbreviations is productive and
using a static set is sub-optimal.

Whether we need those last few percent is a matter of discussion, I would
say.

On Fri, Jan 15, 2010 at 6:41 PM, Benson Margulies <[email protected]>wrote:

> It's not very hard to collect the abbreviations. It may be less work
> than coding what's in the paper.
>
> On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <[email protected]>
> wrote:
> > The exceptions are a list of abbreviations that have a terminal full
> stop,
> > but which are customarily terminated by a capitalized word which is not
> the
> > start of a new sentence.
> >
> > It looks to me like machine learning has come a long way in this regard.
> > This is the best paper on the subject that I have seen in a quick search.
> >
> > Unsupervised Multilingual Sentence Boundary Detection
> > <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by
> > Kiss and Strunk.
> >
> > It doesn't require any lexical resources and can improve performance on
> the
> > fly by adapting to the language that it is working against.
> >
> > The fundamental insight is that abbreviations are situations where a stem
> is
> > very commonly followed by a full stop and a sentence start marker is
> > something that is very commonly preceded by a full stop or other sentence
> > marker.  From these and a few other intuitions, they build a system that
> is
> > pretty darned accurate.  One major component of its accuracy is due to
> the
> > ability to adapt on the fly to the corpus in use.
> >
> > A deficiency in our use case would be the requirement for training text,
> but
> > that could be solved with a few moderate sized resources that are the
> result
> > of training on reference texts for different languages.
> >
> > On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]>
> wrote:
> >
> >> I've found abbrevs, various identifiers etc are sort of a typical case
> >> where these things fall flat. I'll see how it performs viz writing
> >> something from scratch and see what I can come up with.
> >>
> >> > Right, although just slightly ironic that we are using a rule-based
> >> system for a machine learning project.
> >>
> >> Heh, indeed, but it seems entirely appropriate in this case. Of
> >> course, now I need to go read about statistical approaches to sentence
> >> boundary detection.
> >>
> >
> >
> >
> > --
> > Ted Dunning, CTO
> > DeepDyve
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Reply via email to