James, make sure also to look at this other thread we had here about this
issue before.
There we decided to add a case flag to our dictionaries.
On 7/28/11 5:23 AM, James Kosin wrote:
The case sensitivity flag was either a bad idea or somewhere it got
lost in the usage.
In the POS Tagger dictionary it should be supported as it was in 1.3 or
1.4. I simply missed
this flag when I added the new training, evaluation and model package code.
Questions & Discussion Points:
-----------------------------------------
(a) When building the dictionary, usually if we have case-sensitivity
set to false entries really only need to be added once if they are
already not there. 'a' and 'A' in a case-insensitive dictionary are
really the same and only one will match. If we impose this assumption
then we really need the false setting to mean that we will always
compare without regard to case even if we are comparing to an entry
that wants case sensitivity and is set to true.
The behavior of our dictionary is currently not really defined if it
contains duplicate entries, I guess our current implementation just
adds every entry, and overwrites existing entries. So the last specified
entry wins. We could change this and make the dictionary fail
fast. This way a user can fix any issues.
Maybe that is annoying because then he might need to manually fix a
couple of issues.
(b) When using the dictionary, since the caseSensitivity flag is not
final, the dictionary default can be changed for new entries ONLY, the
change here doesn't affect already added items to the dictionary.
This is both a good and bad thing. Good in that we could change the
default for the comparisons, bad in that if we allow the change the
dictionary could be modified to add new entries with the flag not set
to the creation setting. It isn't a problem now; but, if we allow the
user to change the flag without forcing it at creation; we could end
up with issues.
The dictionary should be immutable, because it can be access from more
than one thread, and we encourage our users to do so.
I know Dictionary is not, but it really should be. POS Dictionary can
only be changed if extended, right?
(c) Coming to usage. The change I talked about for the
isCaseSensitive test for the other entry doesn't really make sense
since the dictionary object itself will create a new string list with
a caseSensitive flag for the dictionary. There really isn't any way
to change this without creating a new dictionary with the flag set to
true/false.
This we don't do anymore when we make the Dictionary immutable, right?
(d) The case-sensitivity setting needs to be saved with the
dictionary to the file. This is one place where we really need to be
careful. I've looked somewhat at the problem and unfortunately, there
isn't an easy fix. Saving is okay, it is getting the setting from the
file... reason being is that due to the way some of it works, we could
append dictionaries causing a mixed case-sensitivity setting. Really
bad news; since, the dictionary has one flag and each entry has
another copy of the flag for the StringListWrapper class. Another way
would be adding the settings to the properties for the model and
saving the dictionary inside the model as well.
When we load a dictionary (or create one) we need to know the case flag,
when we serialize it, the case flag should be written in the
dictionary, but it not important for the way the entries are written.
Lets refactor the Dictionary class a bit to get rid of this
StringListWrapper badness.
Jörn