[ 
https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066494#comment-13066494
 ] 

Lewis John McGibbney commented on NUTCH-62:
-------------------------------------------

There are various comments above which create slight confusion about what to do 
to resolve this issue... or infact what exactly the issue is that needs to be 
resolved!

Is there a requirement to rework the htmlMetaProcessor class to incorporate the 
suggestions above e.g. "consistent schema in both cases..."

Protocol.metadata aside, what we are essentially talking about is picking up 
all Parsedata.metadata included within meta tags which I assume we would wish 
to index at a later stage. Focussing on the HTMLMetaProcessor class we already 
acquire name, http-equiv and content attributes from meta tags. WOuld an 
improvement be to configure the class to pick up other attributes not already 
mentioned?

> Add html META tag information into metaData in index-more plugin
> ----------------------------------------------------------------
>
>                 Key: NUTCH-62
>                 URL: https://issues.apache.org/jira/browse/NUTCH-62
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jack Tang
>            Priority: Trivial
>         Attachments: index-more.patch.zip
>
>
> Now(version dev-0.7), only some metaData  in http response such as type, 
> date, content-length are available int the index-more plugin. And we cannot 
> index/sotre the meta data in html header (<META> exactly)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to