[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360681 ]
Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------
Hi Doug, Jerome,
>I'm confused as to why all of the constant names have "X_nutch" in them. I'd
>expect to see something like that in their string values, but their
> names are already qualified by org.apache.nutch.ParseData, no?
Err, whoops. My fault. I misinterpreted what Andrzej was saying his comment:
"I agree, too. Perhaps we should use the names as they appear in the Dublin
Core for those properties that are defined there - just prepended them with
"X-nutch-" in order to avoid name-clashes with other properties (e.g. blindly
copied from the protocol headers). "
I'll fix this right quick.
>Also, it would be easier if these were all defined in an interface, something
>like MetadataNames. That way a class can "implement" that interface
>and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.
Yuppers, I agree on this one too. In fact, while I was making the patch, I was
thinking in my head ("hey this would probably be a good idea to have in its own
interface class..."), but since no one objected to my initial proposition to
the dev list to put in into ParseData, I just put them there. So, yeah I'll fix
this right quick as well.
Updated patch...on its way!
> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
> Key: NUTCH-139
> URL: http://issues.apache.org/jira/browse/NUTCH-139
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM,
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything
> that they want, such as having names of "Content-type", "content-TyPe",
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a
> solution in which all property names be converted to lower case, but in
> essence this really only fixes half the problem right (the case of
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
> I propose that a way to correct this would be to create a standard set of
> named Strings in the ParseData class that the protocol framework and the
> parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> The properties would be defined at the top of the ParseData class, something
> like:
> public class ParseData{
> .....
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
> ....
> }
> In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by making
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE,
> "text/xml"); Of course, this wouldn't preclude users from doing what they are
> currently doing, it would just provide a standard method of obtaining some of
> the more common, critical metadata without pouring over the code base to
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week
> that addresses this issue.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira