+1 to make SentenceDectorME and TokenizerME thread safe and everything else where it works out for us.
Making it thread safe only makes sense if you can get the throughput almost multiplied by using more cores. This works with the current model. For the POSTagger we would have to change the API a bit to make this work nicely, and it might conflict with OPENNLP-125. The idea is to make the POS Tagger use the same feature generation code as we use for the name finder. The POS Tagger is also using caching internally which probably doesn't play well with multiple threads. An idea for the name finder from a long time ago is to add another layer which accepts an entire document, or much lager part than a sentence. This would allow to make it stateless as well. Maybe we should do the same for the POS Tagger? Jörn On Wed, Jan 11, 2017 at 2:38 PM, Thilo Goetz <[email protected]> wrote: > Correct me if I'm wrong, but that approach only works if you control the > thread creation yourself. In my case, for example, I was using Scala's > parallel collection API, and had no control over the threading. I will > usually want to create one service that does tokenization or POS tagging or > whatever, which can be accessed by many threads. I don't want to have to > mess around with an object pool, or thread locals, or anything like that. > Especially since there is really no good reason IMHO. You could very easily > just return the probabilities together with the spans, and whoever doesn't > need them can ignore them. Or have two methods, one with probabilities, one > without. Maybe it's just where I'm coming from, but I fail to see the > advantages of the current approach. > > --Thilo > > > > On 11/01/2017 13:58, Joern Kottmann wrote: > >> Hello Thilo, >> >> I am interested in your opinion about how this is done currently. >> We say: "Share the model between threads and create one instance of the >> component per thread". >> >> Wouldn't that work well in your use case? >> >> Jörn >> >> >> >> On Wed, Jan 11, 2017 at 11:05 AM, Thilo Goetz <[email protected]> wrote: >> >> Hi, >>> >>> in a recent project, I was using SentenceDetectorME, TokenizerME and >>> POSTaggerME. It turns out that none of those is thread safe. This is >>> because the classification probabilities for the last tag() call (for >>> example) are stored in a member variable and can be retrieved by a >>> separate >>> API call. >>> >>> I'm planning to build thread safe versions for myself, and I'd be happy >>> to >>> contribute a patch if there is interest. This could be done as a >>> conservative extension with an additional method such as tagReentrant, >>> where the old API calls would continue to work as before and would still >>> not be thread safe. Alternatively, one could remodel the API so that >>> everything was thread safe, but that would break backwards compatibility. >>> >>> Final question: if I do this for the classes mentioned above, are there >>> other tools that should be made thread safe while we're at it? >>> >>> Opinions? >>> >>> --Thilo >>> >>> >>> >>> >
