[
https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14392129#comment-14392129
]
Jorge Luis Betancourt Gonzalez commented on NUTCH-1980:
-------------------------------------------------------
+1 this looks awesome, can't wait to test
> Jexl expressions for CrawlDbReader
> ----------------------------------
>
> Key: NUTCH-1980
> URL: https://issues.apache.org/jira/browse/NUTCH-1980
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.11
>
>
> We are already using Jexl expressions to filter records from HostDb dumps and
> it is really helpful when your CrawlDb is stuffed with metadata generated by
> parser filters, in our case mostly scores generated by classification plugins
> that run on text or structure.
> In the case of the HostDb, it operates on hosts only, so it is easy to
> collect a set of sites that host mostly a specific language, pornographic
> content, or just host topics that your classifiers are trained for.
> By adding this magic to the CrawlDbReader, you can get lists of actual
> records that contain the stuff you are looking for.
> Most work is already in the HostDb patch so it is easy to translate to
> individual records. Patch tomorrow, probably...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)