[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360681 ]
Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------
Hi Doug, Jerome,
>I'm confused as to why all of the constant names have "X_nutch" in them. I'd
>expect to see something like that in their string values, but their
> names are already qualified by org.apache.nutch.ParseData, no?
Err, whoops. My fault. I misinterpreted what Andrzej was saying his comment:
"I agree, too. Perhaps we should use the names as they appear in the Dublin
Core for those properties that are defined there - just prepended them with
"X-nutch-" in order to avoid name-clashes with other properties (e.g. blindly
copied from the protocol headers). "
I'll fix this right quick.
>Also, it would be easier if these were all defined in an interface, something
>like MetadataNames. That way a class can "implement" that interface
>and then simply use the short names in code, e.g. CONTENT_TYPE, AUTHOR, etc.
Yuppers, I agree on this one too. In fact, while I was making the patch, I was
thinking in my head ("hey this would probably be a good idea to have in its own
interface class..."), but since no one objected to my initial proposition to
the dev list to put in into ParseData, I just put them there. So, yeah I'll fix
this right quick as well.
Updated patch...on its way!
> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
> Key: NUTCH-139
> URL: http://issues.apache.org/jira/browse/NUTCH-139
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM,
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything
> that they want, such as having names of "Content-type", "content-TyPe",
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a
> solution in which all property names be converted to lower case, but in
> essence this really only fixes half the problem right (the case of
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
> I propose that a way to correct this would be to create a standard set of
> named Strings in the ParseData class that the protocol framework and the
> parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> The properties would be defined at the top of the ParseData class, something
> like:
> public class ParseData{
> .....
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
> ....
> }
> In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by making
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE,
> "text/xml"); Of course, this wouldn't preclude users from doing what they are
> currently doing, it would just provide a standard method of obtaining some of
> the more common, critical metadata without pouring over the code base to
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week
> that addresses this issue.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers