On Wed, Jun 15, 2011 at 2:07 PM, [email protected] < [email protected]> wrote:
> Hi, > > I have a few questions about abbreviation in sentence detector. I'd like to > understand how it is working and improve it if possible. > > 1) How is the setence detector using the abbreviation dictionary? All train > methods in SentenceDetectorME takes an abbreviation dictionary as argument, > but is only saving it to the model. It is not using the dictionary to > create > the context generator, but it should, shouldn't it? > > I thought it did, though I haven't looked at that bit of code for a while. > 2) The command line trainer does not allow to pass an abbreviation > dictionary. Maybe it should allow to pass a file name that contains the > dictionary. > > +1 > 3) Maybe we should include tools to extract the abbreviation dictionary > from > the train corpus. Optionally this could be executed during training too. > > Doing that extraction actually requires a bit of work to figure out what is an abbreviation. Something of interest here is PUNKT, an unsupervised method for detecting sentences/abbreviations: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf Implementation in NLTK: http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html -Jason -- Jason Baldridge Assistant Professor, Department of Linguistics The University of Texas at Austin http://www.jasonbaldridge.com http://twitter.com/jasonbaldridge
