[ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2231:
---------------------------------
    Description: 
Generator should support Jexl expressions. This would make it much easier to 
implement focussing crawlers that rely on information stored in the CrawlDB. 
With the HostDB it is possible to restrict the generator to select only 
interesting records but it is very cumbersome and involves 
domainblacklist-urlfiltering.

With Jexl support, it is no hassle!

Crawl only english records:
{code}
bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
{code}

Crawl only HTML records:
{code}
bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
'text/html' || Content_Type == 'application/xhtml+xml')"
{code}

Keep in mind:
* Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
to underscores
* string literals must be in quotes, only surrounding qoute needs to be escaped 
by backslash


  was:
Generator should support Jexl expressions. This would make it much easier to 
implement focussing crawlers that rely on information stored in the CrawlDB. 
With the HostDB it is possible to restrict the generator to select only 
interesting records but it is very cumbersome and involves 
domainblacklist-urlfiltering.

With Jexl support, it is no hassle!


> Jexl support in generator job
> -----------------------------
>
>                 Key: NUTCH-2231
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2231
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to