On 3/9/11 3:33 PM, Grant Ingersoll wrote:
On Mar 7, 2011, at 7:16 AM, Jörn Kottmann wrote:
On 3/6/11 1:37 PM, Grant Ingersoll wrote:
On Mar 5, 2011, at 2:13 PM, Jörn Kottmann wrote:
I actually tried to ask how you would do that. I don't think it is super
simple. Can you please shortly
explain what you have in mind?
From the looks of it, we'd just need to return the bestSequence object (or
some larger containing object) out to the user and not use it (or other pieces
that may change) as a member variable. Granted, I'm still learning the code,
so I likely am misreading some things. From the looks of it, though, simply
changing the tag method to return the bestSequence would let the user make the
appropriate calls to best outcome and to get the probabilities (or the probs()
method could take in the bestSequence object if you wanted to keep that
convenience)
I suppose I should just work up a patch, it would be a lot easier than
discussing it in the abstract.
There is also a cache which must be created then per call, we need to do some
measuring
how expensive that is compared to the current solution.
The POS Tagger should also use the new feature generation stuff we made
for the name finder, but that is not thread safe by design, because it has a
state. The state is necessary to support per document features like we have it
in
the name finder.
Do you think making the name finder and other components thread safe in the
same way is also possible?
Not sure. I only noticed it in the POS tagger.
Right now we have the same thread-safety convention
for all components, which I like because it is easy for some one new to learn.
When it is mixed, e.g. POS Tagger thread safe and name finder not, then people
will get confused.
It is no doubt a hard problem. There is always this tradeoff between easy to
learn and fast, it seems. In my experience, most programmers aren't good at
concurrent programming (and I certainly don't claim to be either) and so it is
hard to get it right. I think one of the big wins for us could be to make
OpenNLP really fast, which will increase its viability and attract others.
Making OpenNLP much faster is of course good. When we discuss
performance changes we also need to
know how much that change would speed things up. In my eyes the most to
gain is currently with
optimizing the feature generation, making the caching more efficient, etc..
How much faster do you think the POS Tagger will be with your proposed
change?
Jörn