[ 
https://issues.apache.org/jira/browse/NUTCH-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144237#comment-16144237
 ] 

Markus Jelsma commented on NUTCH-2414:
--------------------------------------

Although filtering on lang field is a good idea, i think we can take this a 
further if we provide a filter that supports a whole variety of JEXL 
expressions. If an expression evaluates to true pass a document, otherwise 
discard it.

This would solve your problem, and provide anyone a flexible means of 
discarding documents anyway they want.

> Allow LanguageIndexingFilter to actually filter documents by language.
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-2414
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2414
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.13
>            Reporter: Yossi Tamari
>            Priority: Minor
>
> It is often useful to only index pages in select languages (e.g. only those 
> languages that we intend to search in). At first glance it seems that this is 
> done by LanguageIndexingFilter, but currently all the filter does is add the 
> language as a field to the index.
> We can add a configuration property to LanguageIndexingFilter that will allow 
> it to only index languages specified in this property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to