[ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma reopened NUTCH-2231:
----------------------------------

If no expression is set, an error is logged which shouldn't.

> Jexl support in generator job
> -----------------------------
>
>                 Key: NUTCH-2231
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2231
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2231.patch, NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to