I'd like to use Julien's approach because I found the scoring filter complex to understand.
My use case is the following : 1. during scoring after parsing, I want to tag interesting pages for me, say meta="HIT" 2. in the next step (to be created) I would like to prune the segment of NON-HIT content in order to optimize segment space (I use nutch caching), I typically need to ditch 90% of segment data. Also considering to 4. focus recrawls on HIT pages and their outlinks Today I don't know really if & how one can retrieve these meta data, I have manage to avoid storing "text" content for NON-HIT but it is a dirty trick. 2010/1/19 Andrzej Bialecki (JIRA) <j...@apache.org> > > [ > https://issues.apache.org/jira/browse/NUTCH-779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802175#action_12802175] > > Andrzej Bialecki commented on NUTCH-779: > ----------------------------------------- > > Personally I would use ScoringFilters because I'm familiar with the API, > but the approach that you propose is certainly more user friendly especially > for novice users. > > > Mechanism for passing metadata from parse to crawldb > > ---------------------------------------------------- > > > > Key: NUTCH-779 > > URL: https://issues.apache.org/jira/browse/NUTCH-779 > > Project: Nutch > > Issue Type: New Feature > > Reporter: Julien Nioche > > Attachments: NUTCH-779 > > > > > > The patch attached allows to pass parse metadata to the corresponding > entry of the crawldb. > > Comments are welcome > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > -- -MilleBii-