On 28/09/2011 11:46, Jörn Kottmann wrote:
On 9/28/11 11:34 AM, Riccardo Tasso wrote:
This isn't a bug, but why can I load a POSDictionary from an xml
format which is undocumented?
We previously had a plain/text format, which was replaced by this xml
format. Because of
encoding issues. I think we will do a couple of refactoring and
redesign of the POS Tagger and
then again improve the POS Dictionary and other dictionaries we
currently have.
There are a couple of things which can be done better, e.g. when the
dictionary only allows one tag
we do not need to call the classifier to make a decision, the
dictionary should also support token sequences,
etc.
Hence at this moment the POSDictionary has the only aim to filter out
invalid tags?
You are welcome to submit a patch to document our pos dict xml format.
I'll look for the xml when I have enough time, and I woul be happy to
contribute. By now I'm just trying to extend it, because I really can't
take a full dictionary of possible tags in memory.
A first improvement, from my poin of view would be that of making the
fields of the class protected, to make extension more clean.
I would prefear a String[] get(String word) and a void put(String
word, String[] tags) methods.
For safety and thready safety reasons all our resources used during
tagging should be immutable,
well, that doesn't mean that we should not have an easy way to create
these resources.
We have the get method, but it is called getTags.
Jörn
What do you think about this structure for the next version of
POSDictionary?
Map<List<String>, Set<String>>
or the ligthweight version:
Map<String[], String[]>
My two cents,
Riccardo