Standard metadata property names in the ParseData metadata 
-----------------------------------------------------------

         Key: NUTCH-139
         URL: http://issues.apache.org/jira/browse/NUTCH-139
     Project: Nutch
        Type: Improvement
  Components: fetcher  
    Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev    
 Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
although bug is independent of environment
    Reporter: Chris A. Mattmann
 Assigned to: Chris A. Mattmann 
     Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6


Currently, people are free to name their string-based properties anything that 
they want, such as having names of "Content-type", "content-TyPe", 
"CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
solution in which all property names be converted to lower case, but in essence 
this really only fixes half the problem right (the case of identifying that 
"CONTENT_TYPE"
and "conTeNT_TyPE" and all the permutations are really the same). What about
if I named it "Content     Type", or "ContentType"?

 I propose that a way to correct this would be to create a standard set of 
named Strings in the ParseData class that the protocol framework and the 
parsing framework could use to identify common properties such as 
"Content-type", "Creator", "Language", etc.

 The properties would be defined at the top of the ParseData class, something 
like:

 public class ParseData{

   .....

    public static final String CONTENT_TYPE = "content-type";
    public static final String CREATOR = "creator";

   ....

}


In this fashion, users could at least know what the name of the standard 
properties that they can obtain from the ParseData are, for example by making a 
call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the content 
type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
"text/xml"); Of course, this wouldn't preclude users from doing what they are 
currently doing, it would just provide a standard method of obtaining some of 
the more common, critical metadata without pouring over the code base to figure 
out what they are named.

I'll contribute a patch near the end of the this week, or beg. of next week 
that addresses this issue.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to