[
https://issues.apache.org/jira/browse/OPENNLP-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867704#comment-16867704
]
ASF GitHub Bot commented on OPENNLP-1267:
-----------------------------------------
tballison commented on issue #357: OPENNLP-1267 -- add a
ProbingLanguageDetector that can stop early.
URL: https://github.com/apache/opennlp/pull/357#issuecomment-503584545
I refactored the probing parts into LanguageDetectorME. I removed the
optimization that would allow for early stopping within a chunk for simplicity
and to rely on the current {{ContextGenerator}} api.
I haven't squashed this because I do want to be able to revert if I
misunderstood any of the above guidance. I can squash before the final commit.
Let me know what you think.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Allow the LanguageDetector to stop before processing the full string
> --------------------------------------------------------------------
>
> Key: OPENNLP-1267
> URL: https://issues.apache.org/jira/browse/OPENNLP-1267
> Project: OpenNLP
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Major
>
> On TIKA-2790, I found that Yalder is stopping after computing character
> ngrams on roughly the first 60 characters. That _likely_ explains its
> impressive speed. Let's make this "stopping short" feature available in
> OpenNLP.
>
> Ideally, the language detector wouldn't copy the full String, it wouldn't
> normalize the full String, and it wouldn't compute ngrams on the full String.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)