[ 
https://issues.apache.org/jira/browse/NUTCH-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

İlhami KALKAN updated NUTCH-1663:
---------------------------------

    Attachment: README.txt
                language-filter.patch

I added lanuage-filter plugin for filter pages by languages while crawling. For 
use, language-identifier plugin must run before this plugin. 
language-identifier plugin looks metadata for language of url and remove or 
accept its outlinks according to "language.filter.type" in nutch.site.xml. if 
this parameter set to accept, this plugin allow only 
"language.filter.languages" entries which must be ISO-639 language codes and 
remove outlinks to pages in other languages. If set to filter, remove outlinks 
of pages which page lang equals "language.filter.languages" entries.    

> Crawl page with specified language
> ----------------------------------
>
>                 Key: NUTCH-1663
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1663
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2.1
>            Reporter: İlhami KALKAN
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: README.txt, language-filter.patch
>
>
> User can crawl pages with specified language. For example, we want to crawl 
> pages which language is Turkish.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to