[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] 

Doug Cutting commented on NUTCH-139:
------------------------------------

Jerome,

Some HTTP headers have multiple values.  Correctly reflecting that was I 
thought the primary motivation for adding multiple values, not for recording 
historical values.

I still don't see a reason why the derived content type needs to be stored 
anywhere but in the contentType field of the Content.  And if a derived value 
ever needs to go into the metadata, it should always use an x-nutch key, so 
that it can be clearly distinguished from original values.

Chis,

The content length is not expensive to compute, it's simply the length of the 
content byte array.  Are there uses of content length where this is 
impractical?  If so, then perhaps we could, for performance, cache a 
protocol-independent, derived content length in an x-nutch header. 

Alternately, we could prefix all protocol headers with the protocol name, so 
that the HTTP "Content-Language" header could be stored as something like 
"http:Content-Language".  Then Nutch could avoid using the x-nutch prefix, and 
instead store the derived, protocol-independent value as simply "language".

Yes, these are issues of policy, but this patch violates my ideas about the 
correct policy.  We should not confuse protocol-specific HTTP headers with 
protocol-independent derived values.  And multiple-values should be the 
exception, used in cases where multiple values are really sensible (like email 
"Received" headers) not to store the historic values.

> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to