[
https://issues.apache.org/jira/browse/NUTCH-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105994#comment-13105994
]
Markus Jelsma commented on NUTCH-1005:
--------------------------------------
{quote}
you are right. I'd read your comments too quickly
'db.parsemeta.to.crawldb' could be used to copy the values extracted by your
parser into the crawldb and from there reuse URLMetaIndexingFilter which will
index any metadata stored in the crawldb and listed in urlmeta.tags.
This means using the crawldb as a temporary storage, which probably does not
make too much sense.
{quote}
Indeed. The less data there's in the CrawlDB, the better.
{quote}
What we should probably do is to rename url-meta into something more meaningful
and make it more generic. We should have an indexer able to index anything
store as crawldb, fetch or parse metadata via configuration. Then people would
have to define custom parsers only, the indexing part should be doable in a
generic way.
I seem to remember that I had filed a patch for parsing / indexing description
and keywords from HTML docs which is quite close to what you are offering to
have. Why not having it all in one parser or at least in one plugin?
{quote}
I believe this is what you're looking for:
https://issues.apache.org/jira/browse/NUTCH-809
I agree it would be nice to have such a mechanism but does it mean this plugin
should not be included in your opinion?
> Index headings plugin
> ---------------------
>
> Key: NUTCH-1005
> URL: https://issues.apache.org/jira/browse/NUTCH-1005
> Project: Nutch
> Issue Type: New Feature
> Components: indexer, parser
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.4
>
> Attachments: HeadingsIndexingFilter.java, HeadingsParseFilter.java,
> NUTCH-1005-1.4-2.patch, NUTCH-1005-1.4-3.patch
>
>
> Very simple plugin for extracting and indexing a comma separated list of
> headings via the headings configuration directive.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira