[ 
https://issues.apache.org/jira/browse/NUTCH-2231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15163337#comment-15163337
 ] 

Hudson commented on NUTCH-2231:
-------------------------------

FAILURE: Integrated in Nutch-trunk #3355 (See 
[https://builds.apache.org/job/Nutch-trunk/3355/])
NUTCH-2231 Jexl support in generator job (markus: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1732177])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/crawl/CrawlDatum.java
* trunk/src/java/org/apache/nutch/crawl/CrawlDbReader.java
* trunk/src/java/org/apache/nutch/crawl/Generator.java
* trunk/src/java/org/apache/nutch/util/JexlUtil.java


> Jexl support in generator job
> -----------------------------
>
>                 Key: NUTCH-2231
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2231
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 1.11
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.12
>
>         Attachments: NUTCH-2231.patch, NUTCH-2231.patch
>
>
> Generator should support Jexl expressions. This would make it much easier to 
> implement focussing crawlers that rely on information stored in the CrawlDB. 
> With the HostDB it is possible to restrict the generator to select only 
> interesting records but it is very cumbersome and involves 
> domainblacklist-urlfiltering.
> With Jexl support, it is no hassle!
> Crawl only english records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(lang == 'en'')"
> {code}
> Crawl only HTML records:
> {code}
> bin/nutch generate crawl/crawldb/ crawl/segments/ -expr "(Content_Type == 
> 'text/html' || Content_Type == 'application/xhtml+xml')"
> {code}
> Keep in mind:
> * Jexl doesn't allow a hyphen/minus in field identifier, they are transformed 
> to underscores
> * string literals must be in quotes, only surrounding qoute needs to be 
> escaped by backslash



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to