[ 
http://issues.apache.org/jira/browse/NUTCH-62?page=comments#action_12312857 ] 

Andrzej Bialecki  commented on NUTCH-62:
----------------------------------------

The latest SVN version already contains similar code (see 
parse-html/..../HTMLMetaProcessor.java). The only thing that is missing is to 
put the content meta tags into ParseData.metadata.

As you know, we actually have two places to put metadata into: one is 
Protocol.metadata, where all protocol-related metadata should be stored, and 
the other is ParseData.metadata, where parse-related metadata should be stored, 
which is the case here.

However... potentially this may overwrite other properties coming from protocol 
handlers, or discovered by other plugins or other parts of the code. E.g. the 
"lang" tag is such example, "content-encoding" and "charset" are other 
examples. The language identifier plugin works around this by using an 
"X-meta-lang" property name. (BTW: it could be cleaned up to avoid traversing 
the node tree once again, but instead make use of the already discovered meta 
tags, which are now passed as an argument to HtmlParseFilters).

I suggest to rework this to use a consistent schema in both cases (i.e. 
Content.metadata and ParseData.metadata): let's put them  under 
"X-nutch-<name>-" (where <name> is e.g. the value of the key retrieved from 
HtmlMetaTags.getGeneralTags()), or "X-nutch-http-equiv-<name>" prefix (where 
name is the value of the key retrieved from HtmlMetaTags.getHtpEquivTags)), and 
so on. So, this would be e.g. "X-nutch-robots", "X-nutch-base", 
"X-nutch-http-equiv-pragma", "X-nutch-http-equiv-refresh").

This way we can store all <meta> information, without any danger of overwriting 
the original values.

> Add html META tag information into metaData in index-more plugin
> ----------------------------------------------------------------
>
>          Key: NUTCH-62
>          URL: http://issues.apache.org/jira/browse/NUTCH-62
>      Project: Nutch
>         Type: Improvement
>   Components: indexer
>     Reporter: Jack Tang
>     Priority: Trivial
>  Attachments: index-more.patch.zip
>
> Now(version dev-0.7), only some metaData  in http response such as type, 
> date, content-length are available int the index-more plugin. And we cannot 
> index/sotre the meta data in html header (<META> exactly)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.  How far can you shotput
a projector? How fast can you ride your desk chair down the office luge track?
If you want to score the big prize, get to know the little guy.  
Play to win an NEC 61" plasma display: http://www.necitguy.com/?r=20
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to