On 28/09/2011 11:46, Jörn Kottmann wrote:
On 9/28/11 11:34 AM, Riccardo Tasso wrote:
This isn't a bug, but why can I load a POSDictionary from an xml format which is undocumented?

We previously had a plain/text format, which was replaced by this xml format. Because of encoding issues. I think we will do a couple of refactoring and redesign of the POS Tagger and then again improve the POS Dictionary and other dictionaries we currently have.

There are a couple of things which can be done better, e.g. when the dictionary only allows one tag we do not need to call the classifier to make a decision, the dictionary should also support token sequences,
etc.

Hence at this moment the POSDictionary has the only aim to filter out invalid tags?

You are welcome to submit a patch to document our pos dict xml format.

I'll look for the xml when I have enough time, and I woul be happy to contribute. By now I'm just trying to extend it, because I really can't take a full dictionary of possible tags in memory.

A first improvement, from my poin of view would be that of making the fields of the class protected, to make extension more clean.



I would prefear a String[] get(String word) and a void put(String word, String[] tags) methods.

For safety and thready safety reasons all our resources used during tagging should be immutable, well, that doesn't mean that we should not have an easy way to create these resources.

We have the get method, but it is called getTags.

Jörn

What do you think about this structure for the next version of POSDictionary?
Map<List<String>, Set<String>>
or the ligthweight version:
Map<String[], String[]>

My two cents,
    Riccardo

Reply via email to