[
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364116 ]
Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------
Just to add to Jerome's last comment, I think the key here is simplicity. As a
software developer, and ultimately as an end user of Nutch, I identified the
issue that there we several places where a developer has to remember the exact
string used in a particular piece of code hidden under layers of OO
abstraction, etc., just to get the value of a metadata property returned from
the protocol layer. For example, did you know that in order to get the content
encoding at the protocol level, you have to use the EXACT string
"Content-Encoding", not "ContentEncoding", or "ContenT-ENcodING", etc, but
"Content-Encoding". There are numerous other examples at the protocol level,
such "Content-Length", and "Content-Type" (even though Doug by now I'm sure
hates that example :-) ). The whole point is that if you go look at the
protocol level plugins, they all share the fact that they are reading these
properties, and in some cases writing them to a metadata map. The whole issue
is, why, as a writer of a protocol layer plugin, should I have to worry about
the exact format of the String to get the "Content-Encoding" from the protocol
layer? Wouldn't it be nice to standardize public static final Strings and then
reference them instead of replicate them at the protocol plugin levels?
So, instead of having within protocol-http/HttpResponse.java:
String contentLengthString = (String)headers.get("Content-Length");
and then in protocol-file/FileResponse.java
hdrs.put("Content-Length", new Long(size).toString());
wouldn't it be nice to have a public static final String CONTENT_LENGTH =
"Content-Length", and then replacing the hard coded strings in the protocol
plugin code? So the above becomes:
protocol-http/HttpResponse.java:
String contentLengthString = (String)headers.get(CONTENT_LENGTH);
protocol-file/FileResponse.java
hdrs.put(CONTENT_LENGTH, new Long(size).toString());
Of course, that's just one layer of the issue. As we've all identified these
so-called "magic" strings exist at the parsing layers too. For example, in the
rtf parser, there are * 17 * of these so called magic strings, ranging from
"Security" to "Last-Save-Date" to "Last-Printed". Of course it would be naive
to put every single metadata string that is written or read from a Map in the
parsing and protocol layers of nutch into a single monolithic metadata class,
but in the end, there are several standard metadata properties (* cough cough
Dublin Core *) that deserve such first class status, along with certain other
commonly used metadata properties at each respective layer, protocol and
parsing. I believe that the purpose of this patch should not only to provide an
extensible Metadata class, but also let's not forget the simple stuff too. And
also, let's not turn this issue into 993939393 different things that need to be
done. It should be phased into several capabilities, and the first phase would
be providing standard metadata names container at protocol and parsing layers
which Jerome and I are working towards. I guess what I'm just trying to
advocate is to not just forget about this issue by adding a million things to
it, and making it difficult to complete that it never gets completed and
accepted. Let's just keep it focused and simple, because in the end, as a user
of Nutch, and as a software developer, I think it is very time-saving and
helpful to have common Strings defined in one-place, or a few places rather
than spread out across 20 or 30 classes, where you have to inspect each class
to find out the exact way to read/write a String to make stuff work. That's all
I'm saying.
> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
> Key: NUTCH-139
> URL: http://issues.apache.org/jira/browse/NUTCH-139
> Project: Nutch
> Type: Improvement
> Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
> Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM,
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
> Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt,
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything
> that they want, such as having names of "Content-type", "content-TyPe",
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a
> solution in which all property names be converted to lower case, but in
> essence this really only fixes half the problem right (the case of
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
> I propose that a way to correct this would be to create a standard set of
> named Strings in the ParseData class that the protocol framework and the
> parsing framework could use to identify common properties such as
> "Content-type", "Creator", "Language", etc.
> The properties would be defined at the top of the ParseData class, something
> like:
> public class ParseData{
> .....
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
> ....
> }
> In this fashion, users could at least know what the name of the standard
> properties that they can obtain from the ParseData are, for example by making
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE,
> "text/xml"); Of course, this wouldn't preclude users from doing what they are
> currently doing, it would just provide a standard method of obtaining some of
> the more common, critical metadata without pouring over the code base to
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week
> that addresses this issue.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers