[
https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14609671#comment-14609671
]
Markus Jelsma commented on NUTCH-1980:
--------------------------------------
Committed to trunk in revision 1688569.
> Jexl expressions for CrawlDbReader
> ----------------------------------
>
> Key: NUTCH-1980
> URL: https://issues.apache.org/jira/browse/NUTCH-1980
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch,
> NUTCH-1980-1.9.patch, NUTCH-1980.patch
>
>
> We are already using Jexl expressions to filter records from HostDb dumps and
> it is really helpful when your CrawlDb is stuffed with metadata generated by
> parser filters, in our case mostly scores generated by classification plugins
> that run on text or structure.
> In the case of the HostDb, it operates on hosts only, so it is easy to
> collect a set of sites that host mostly a specific language, pornographic
> content, or just host topics that your classifiers are trained for.
> By adding this magic to the CrawlDbReader, you can get lists of actual
> records that contain the stuff you are looking for.
> Most work is already in the HostDb patch so it is easy to translate to
> individual records. Patch tomorrow, probably...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)