[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360906 ] 

Jerome Charron commented on NUTCH-139:
--------------------------------------

Andrzej,

Here are more comments about my doubts, and how to handle metadata names.

if for instance a protocol plugin doesn't have any Content-Length information 
(no header like in FTP), then is should compute the content-length and add it 
in X-nutch-content-length attribute.
But what do you suggest if a protocol have a Content-Length header (HTTP may 
provide one)?
My feeling is adding the two metadata:
1. One for the Content-Length header in the Content-Length attribute
2. One for the real Content-Length (computed) in the X-nutch-content-length 
attribute.

In other words and more generally:
* When adding a native protocol header, if an equivalent x-nutch attribute 
exists in MetadataNames, then it must be added too with the same value, or with 
a more precise value.
* If no header information is available, tries to fill the more x-nutch 
attribute the protocol level can.

Do you agree with that?



> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to