Markus Jelsma created NUTCH-1980:
------------------------------------

             Summary: Jexl expressions for CrawlDbReader
                 Key: NUTCH-1980
                 URL: https://issues.apache.org/jira/browse/NUTCH-1980
             Project: Nutch
          Issue Type: New Feature
          Components: crawldb
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
             Fix For: 1.11


We are already using Jexl expressions to filter records from HostDb dumps and 
it is really helpful when your CrawlDb is stuffed with metadata generated by 
parser filters, in our case mostly scores generated by classification plugins 
that run on text or structure.

In the case of the HostDb, it operates on hosts only, so it is easy to collect 
a set of sites that host mostly a specific language, pornographic content, or 
just host topics that your classifiers are trained for.

By adding this magic to the CrawlDbReader, you can get lists of actual records 
that contain the stuff you are looking for.

Most work is already in the HostDb patch so it is easy to translate to 
individual records. Patch tomorrow, probably...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to