[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809: -------------------------------- Attachment: NUTCH-809.patch Modified version of the plugin which is compatible with parse-tika > Parse-metatags plugin > --------------------- > > Key: NUTCH-809 > URL: https://issues.apache.org/jira/browse/NUTCH-809 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Julien Nioche > Assignee: Julien Nioche > Attachments: NUTCH-809.patch > > > h2. Parse-metatags plugin > *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see > [TIKA-379]).* > To use the legacy HTML parser specify in parse-plugins.xml > {code:xml} > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > {code} > The parse-metatags plugin consists of a HTMLParserFilter which takes as > parameter a list of metatag names with '*' as default value. The values are > separated by ';'. > In order to extract the values of the metatags description and keywords, you > must specify in nutch-site.xml > {code:xml} > <property> > <name>metatags.names</name> > <value>description;keywords</value> > </property> > {code} > The MetatagIndexer uses the output of the parsing above to create two fields > 'keywords' and 'description'. Note that keywords is multivalued. > The MetaTagsQueryFilter allows to include the fields above in the Nutch > queries. > This code has been developed by DigitalPebble Ltd and offered to the > community by ANT.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.