Re: Problems training my own sentence splitter, with dictionary

Riccardo Tasso Wed, 28 Sep 2011 03:12:10 -0700

On 28/09/2011 11:46, Jörn Kottmann wrote:

On 9/28/11 11:34 AM, Riccardo Tasso wrote:
This isn't a bug, but why can I load a POSDictionary from an xmlformat which is undocumented?
We previously had a plain/text format, which was replaced by this xmlformat. Because ofencoding issues. I think we will do a couple of refactoring andredesign of the POS Tagger andthen again improve the POS Dictionary and other dictionaries wecurrently have.
There are a couple of things which can be done better, e.g. when thedictionary only allows one tagwe do not need to call the classifier to make a decision, thedictionary should also support token sequences,
etc.

Hence at this moment the POSDictionary has the only aim to filter outinvalid tags?

You are welcome to submit a patch to document our pos dict xml format.

I'll look for the xml when I have enough time, and I woul be happy tocontribute. By now I'm just trying to extend it, because I really can'ttake a full dictionary of possible tags in memory.

A first improvement, from my poin of view would be that of making thefields of the class protected, to make extension more clean.

I would prefear a String[] get(String word) and a void put(Stringword, String[] tags) methods.
For safety and thready safety reasons all our resources used duringtagging should be immutable,well, that doesn't mean that we should not have an easy way to createthese resources.
We have the get method, but it is called getTags.

Jörn

What do you think about this structure for the next version ofPOSDictionary?

Map<List<String>, Set<String>>
or the ligthweight version:
Map<String[], String[]>

My two cents,
    Riccardo

Re: Problems training my own sentence splitter, with dictionary

Reply via email to