[ http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364218 ]
Jerome Charron commented on NUTCH-139: -------------------------------------- > I think we're near agreement here. I really hope ... ;-) > We should add an add() method to Metadata, and change set() to replace all > values rather than add a new value. I'm not sure we are looking at the same piece of code, since this how add() and set() methods works in the last attached patch (http://issues.apache.org/jira/secure/attachment/12321740/NUTCH-139.060105.patch) > MetadataNames belongs in the protocol package, not util +1 (but in my mind there is no more MetadatNames.... only MetaData, ContentProperties and ParseProperties, no?) > We should rename ContentProperties to Metadata What about having a generic Metadata container extended by ContentProperties and ParseProperties? (as described in a previous comment : http://issues.apache.org/jira/browse/NUTCH-139#action_12362618) By having two separate maps (one for Content and one for Parse in ParseData) we easily handle the problem of original value / final value and we avoid the copying af the Content metadata map to the Parse metadata map in all parsers: ContentProperties metadata = new ContentProperties(); metadata.putAll(content.getMetadata()); // copy through > Standard metadata property names in the ParseData metadata > ---------------------------------------------------------- > > Key: NUTCH-139 > URL: http://issues.apache.org/jira/browse/NUTCH-139 > Project: Nutch > Type: Improvement > Components: fetcher > Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev > Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, > although bug is independent of environment > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6 > Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, > NUTCH-139.jc.review.patch.txt > > Currently, people are free to name their string-based properties anything > that they want, such as having names of "Content-type", "content-TyPe", > "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a > solution in which all property names be converted to lower case, but in > essence this really only fixes half the problem right (the case of > identifying that "CONTENT_TYPE" > and "conTeNT_TyPE" and all the permutations are really the same). What about > if I named it "Content Type", or "ContentType"? > I propose that a way to correct this would be to create a standard set of > named Strings in the ParseData class that the protocol framework and the > parsing framework could use to identify common properties such as > "Content-type", "Creator", "Language", etc. > The properties would be defined at the top of the ParseData class, something > like: > public class ParseData{ > ..... > public static final String CONTENT_TYPE = "content-type"; > public static final String CREATOR = "creator"; > .... > } > In this fashion, users could at least know what the name of the standard > properties that they can obtain from the ParseData are, for example by making > a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the > content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, > "text/xml"); Of course, this wouldn't preclude users from doing what they are > currently doing, it would just provide a standard method of obtaining some of > the more common, critical metadata without pouring over the code base to > figure out what they are named. > I'll contribute a patch near the end of the this week, or beg. of next week > that addresses this issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira