On Wed, Jun 15, 2011 at 2:07 PM, [email protected] <
[email protected]> wrote:

> Hi,
>
> I have a few questions about abbreviation in sentence detector. I'd like to
> understand how it is working and improve it if possible.
>
> 1) How is the setence detector using the abbreviation dictionary? All train
> methods in SentenceDetectorME takes an abbreviation dictionary as argument,
> but is only saving it to the model. It is not using the dictionary to
> create
> the context generator, but it should, shouldn't it?
>
>
I thought it did, though I haven't looked at that bit of code for a while.


> 2) The command line trainer does not allow to pass an abbreviation
> dictionary. Maybe it should allow to pass a file name that contains the
> dictionary.
>
>
+1


> 3) Maybe we should include tools to extract the abbreviation dictionary
> from
> the train corpus. Optionally this could be executed during training too.
>
>
Doing that extraction actually requires a bit of work to figure out what is
an abbreviation.  Something of interest here is PUNKT, an unsupervised
method for detecting sentences/abbreviations:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.5017&rep=rep1&type=pdf

Implementation in NLTK:
http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html

-Jason

-- 
Jason Baldridge
Assistant Professor, Department of Linguistics
The University of Texas at Austin
http://www.jasonbaldridge.com
http://twitter.com/jasonbaldridge

Reply via email to