[
https://issues.apache.org/jira/browse/NUTCH-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1980:
---------------------------------
Attachment: NUTCH-1980-1.9.patch
New patch also works in case the numeric values are added to the metadata as
strings.
> Jexl expressions for CrawlDbReader
> ----------------------------------
>
> Key: NUTCH-1980
> URL: https://issues.apache.org/jira/browse/NUTCH-1980
> Project: Nutch
> Issue Type: New Feature
> Components: crawldb
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1980-1.9.patch, NUTCH-1980-1.9.patch,
> NUTCH-1980-1.9.patch
>
>
> We are already using Jexl expressions to filter records from HostDb dumps and
> it is really helpful when your CrawlDb is stuffed with metadata generated by
> parser filters, in our case mostly scores generated by classification plugins
> that run on text or structure.
> In the case of the HostDb, it operates on hosts only, so it is easy to
> collect a set of sites that host mostly a specific language, pornographic
> content, or just host topics that your classifiers are trained for.
> By adding this magic to the CrawlDbReader, you can get lists of actual
> records that contain the stuff you are looking for.
> Most work is already in the HostDb patch so it is easy to translate to
> individual records. Patch tomorrow, probably...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)