[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-894.
-------------------------------

      Assignee: Doğacan Güney  (was: Julien Nioche)
    Resolution: Fixed

Committed as of rev. 1003608.

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the 
> indexing step, whereas the detection based on HTTP header and HTML code is 
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if 
> nothing has been found with the previous methods but as part of the parsing. 
> This would be useful for ParseFilters which need the language information or 
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely 
> on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to