[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061962#comment-13061962
 ] 

Julien Nioche commented on NUTCH-809:
-------------------------------------

It's been a long time and I'd forgotten about this one :-)

Obviously we don't need the QueryFilter anymore. Am not entirely happy with the 
indexing part of it though as we handle only 2 values (description and 
keywords) whereas the parsing step is open to any values specified by the users.

We also have the urlmeta plugin which allows to track md specified in the seed 
lists and index them. The name of this plugin should be improved BTW

(thinking aloud) why don't we have a generic indexing implementation which 
could index any metadata specified by the user be it from the crawldb or the 
parse metadata? The parse-metatags plugin would then only deal with the parsing 
step and leave the indexing to this indexer, which could also be used by the 
existing urlmeta (which would then only help with the transfer of the MD from a 
root page to its outlinks).

We can also leave things as they are and just rename urlmeta into something 
like seed-metadata-propagation (or anything better) and keep the possibility to 
do specific things in the indexing part of the metadata like for instance 
splitting the keywords into multiple fields.

 

> Parse-metatags plugin
> ---------------------
>
>                 Key: NUTCH-809
>                 URL: https://issues.apache.org/jira/browse/NUTCH-809
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>         Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> <property>
>   <name>metatags.names</name>
>   <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in 
> nutch-site.xml
> {code:xml}
> <property>
>   <name>query.basic.description.boost</name>
>   <value>2.0</value>
> </property>
> <property>
>   <name>query.basic.keywords.boost</name>
>   <value>2.0</value>
> </property>
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to