[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Chris A. Mattmann (JIRA) Tue, 20 Dec 2005 07:13:54 -0800

    [ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360929 ]


Chris A. Mattmann commented on NUTCH-139:
-----------------------------------------

Hi Andrzej,

> I have an objection, in fact I think the patches miss the main point of using 
> of prefixed property names.

D'oh!

> In this patch only some of the property names, specifically those 
> corresponding to the Dublin Core, are prefixed with PREFIX. Why? 

Well the reason behind this was kind of like this. I wanted the metadata 
property names to be reusable, across the protocol level code, the parser code, 
pretty much anywhere that you used what I would call  "common" metadata 
properties in Nutch. Now, at the protocol level especially, there were bits and 
pieces of code like, "readHeaders("Content-type"), or String someValue = 
getHeader("Content-length"), blah blah blah", where the code was physically 
reading properties that were already written to an object, and that nutch has 
no control over. In these cases, in order to make all the calls synonomous, 
e.g., a call to readHeaders("Content-type") gets replaced by 
readHeaders(CONTENT_TYPE), I couldn't use the "_X_nutch" prefix on the names, 
because I didn't write the value into those objects originally.

On the other hand, anywhere that I was able to physically add metadata 
properties that were under our control, at the protocol level, or parsing 
level, etc., in particular, all of the DC properties, we had control as to how 
they were getting added into the properties object that was being passed 
around: both input control, and control over where it was being read, so we 
could use the X_nutch prefix.

So, in my mind I saw two distinct types of standard metadata properties: those 
which we can control both the input and output data flow from, and those which 
we really can only control the output  data flow from.

> The original reason for introducing the prefix was this: as Nutch processes 
> the raw data, it extracts certain metadata, either directly or > using 
> heuristics (like with LANG or content type). In order to distinguish these 
> values from the original raw values, the metadata 
> processed by Nutch was to be prefixed by "X-nutch-", and all other metadata 
> that we don't use was to be left alone as it was.

This was followed to the T, except for the case above, which I mention and 
which you pointed out. For example, what would have happened if I put 
CONTENT_TYPE="X_nutch_content_type", and then I had a call in 
getHeaders(CONTENT_TYPE) in the protocol level? Well, since we don't ever put 
CONTENT_TYPE into the headers properties object, that would really never help 
us, and then everywhere we read CONTENT_TYPE, the value would have nothing. 

> So, e.g. the Content-Type metadata is sometimes wrong. Nutch checks this with 
> e.g. the mime-type detection plugin, and it should 
> put the final value of Content-Type in metadata - but under the name of 
> "X-nutch-Content-Type", in order to avoid overwriting the 
> original value (Chris's comment in MSWordParser.java reflects this doubt - 
> that's the reason for prefixing).

Yup, exactly. Good job catching that comment!

> Now, this convention is not followed in the patches. E.g. LANG is missing 
> (should be PREFIX + "lang"). 

Not sure I follow this one. In the patch, there's a line:

 public static final String LANGUAGE = NUTCH_PREFIX + "language";

?



> CharEncodingForConversion 
> doesn't have a prefix either. Properties extracted in plugins (e.g. msword, 
> zip, file, etc) are put under the standard, non-prefixed 
> names, thus overwriting the original values.

This isn't really true at all. I didn't overwrite any of the original values. 
In fact, no values are really overwritten at all. There are only two cases 
really:

1. Places where I standardized on how the names are read: you see these at the 
bottom of MetadataNames.java. These are properties that we don't really have 
control over how they got written into properties object, or properties that I 
at least couldn't figure out how they got placed into the properties objects at 
their particular layers. In this case, I've omitted the NUTCH_PREFIX in order 
to make reading/(post-writing) of the properties work fine.

2. Places where I standardized on how the names are read/written. These are at 
the top of MetadataNames.java. I could find the entire data flow in and out of 
the properties objects at the respective layers for all of these properties, 
and what's why they have the X-nutch Prefix.  Make sense?





> Standard metadata property names in the ParseData metadata
> ----------------------------------------------------------
>
>          Key: NUTCH-139
>          URL: http://issues.apache.org/jira/browse/NUTCH-139
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
>     Reporter: Chris A. Mattmann
>     Assignee: Chris A. Mattmann
>     Priority: Minor
>      Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt, NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content     Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>    .....
>     public static final String CONTENT_TYPE = "content-type";
>     public static final String CREATOR = "creator";
>    ....
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

Reply via email to