[
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916899#action_12916899
]
Doğacan Güney commented on NUTCH-894:
-------------------------------------
+1 from me.
If there are no objections for the next couple days or so, I would like to
commit this patch.
> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
> Key: NUTCH-894
> URL: https://issues.apache.org/jira/browse/NUTCH-894
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.0
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the
> indexing step, whereas the detection based on HTTP header and HTML code is
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if
> nothing has been found with the previous methods but as part of the parsing.
> This would be useful for ParseFilters which need the language information or
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely
> on them instead of maintaining our own.
> Any thoughts on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.