[
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma reopened NUTCH-2231:
----------------------------------
If no expression is set, an error is logged which shouldn't.
> Jexl support in generator job
> -----------------------------
>
> Key: NUTCH-2231
> URL: https://issues.apache.org/jira/browse/NUTCH-2231
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 1.11
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2231.patch, NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to
> implement focussing crawlers that rely on information stored in the CrawlDB.
> With the HostDB it is possible to restrict the generator to select only
> interesting records but it is very cumbersome and involves
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type ==
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be
> escaped by backslash
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)