Hi Jörn,
Thanks for your quick response.
Primarily the language is English, probably more American rather than
European.
Domain-wise for the NER 'date' related otherwise, input data is domain
independent. The current implementation/model for NER date detection is
very good, it is the odd edge case such as lower case days, which cause
problems.
I could go to the lengths of probably writing a regex for it, but it
would be better to have a NLP solution, as these are already scanning
input texts.
Your UIMA based annotation tooling sounds interesting and worth a look.
Thanks
Mark
On 18/01/2012 21:05, Jörn Kottmann wrote:
On 1/18/12 8:35 PM, mark meiklejohn wrote:
James,
I agree the correct way is to ensure upper-case. But when you have no
control over input it makes things a little more difficult.
So, I may look at a training set. What is the recommended size of a
training set?
In an annotation project I was doing lately our models started to work
after a couple
of hundred news articles. It of course depends on your language, domain
and the entities you
want to detect.
To make training easier I started to work on UIMA based annotation
tooling, let me know
if you would like to try that, any feedback is very welcome.
Jörn