[ 
https://issues.apache.org/jira/browse/NUTCH-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13813805#comment-13813805
 ] 

İlhami KALKAN edited comment on NUTCH-1663 at 11/5/13 1:35 PM:
---------------------------------------------------------------

I added lanuage-filter plugin for filter pages by languages while crawling. For 
use, language-identifier plugin must run before this plugin. language-filter 
plugin looks metadata for language of url and remove or accept its outlinks 
according to "language.filter.type" in nutch.site.xml. if this parameter set to 
accept, this plugin allow only "language.filter.languages" entries which must 
be ISO-639 language codes and remove outlinks to pages in other languages. If 
set to filter, remove outlinks of pages which page lang equals 
"language.filter.languages" entries.        


was (Author: ilhamikalkan):
I added lanuage-filter plugin for filter pages by languages while crawling. For 
use, language-identifier plugin must run before this plugin. 
language-identifier plugin looks metadata for language of url and remove or 
accept its outlinks according to "language.filter.type" in nutch.site.xml. if 
this parameter set to accept, this plugin allow only 
"language.filter.languages" entries which must be ISO-639 language codes and 
remove outlinks to pages in other languages. If set to filter, remove outlinks 
of pages which page lang equals "language.filter.languages" entries.    

> Crawl page with specified language
> ----------------------------------
>
>                 Key: NUTCH-1663
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1663
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.2.1
>            Reporter: İlhami KALKAN
>            Priority: Minor
>             Fix For: 2.3
>
>         Attachments: README.txt, language-filter.patch
>
>
> User can crawl pages with specified language. For example, we want to crawl 
> pages which language is Turkish.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to