Believe it or not I don't think that meta tags are currently stored. I looked through the html parsing code and didn't see anywhere that it could be storing it except in html filters. I see that meta tags are parsed and passed to the html filters but I didn't see any default filter that was storing them.

If there isn't a reason why we shouldn't be storing meta tags, if we aren't currently storing them (I could be missing where this is happening :) ), and this is something that people want then I can create an html filter that will store the meta-tags in the Parse MetaData.

Dennis Kubes

rubdabadub wrote:
Super thanks! Nice explanation. I finally got it :-) I mean how things
loads and why! Thank you! I do have one question though, however its a
bit different. But if you do have time a lengthy answer is always
welcome :-)

Question:
When the content is parsed by -- let say parse-html or in order to
parse meta data for example..
src/java/org/apache/nutch/parseHTMLMetaTags.java

Now when I run the ParserChecker.java main method.. I don't see the
extracted data parsed the way it shows in parseHTMLMetaTags.. I see
only content, outlink, title etc.. no meta tag.. How is that happen ..
cos I am trying my best to read the code but I can't go beyond parse..
I started at crawl :-)

After looking through it

I don't want to hi jack the thread i just thought you answered the
question so clearly..
Regards

On 3/2/07, Dennis Kubes <[EMAIL PROTECTED]> wrote:
Here goes with the short answer. :)

Configuration has two levels, default and final.  It is supplied by the
org.apache.hadoop.conf.Configuration class and extended in Nutch by the
  org.apache.nutch.util.NutchConfiguration class.

Although it is configurable, by default hadoop-default.xml and
nutch-default.xml are default resources and hadoop-site.xml and
nutch-site.xml are final resources.  Resources (i.e. resource files) can
be added by filename to either the default or final resource set and in
fact this is how Nutch extends the Configuration class, by adding
nutch-default.xml and nutch-site.xml.

Final resource values overwrite default resource values and final
resource values added later will overwrite final resource values added
earlier.  When I say values I am talking about the individual properties
not the resource files.  Resource files are found by name in the
classpath with the HADOOP_CONF_DIR or NUTCH_CONF_DIR being configured in
the nutch and hadoop scripts as the first setting in the classpath.  You
can change the conf dir to pull configuration files from different
directories and many tools in nutch and hadoop now provide a -conf
options on the command line to set the conf directory.

So for your example if you define the property in hadoop-default.xml or
nutch-default.xml and it is not defined in either hadoop-site.xml or
nutch-site.xml then the property will stand.  If you define the property
in either nutch-site.xml or hadoop-site.xml then it will override
nutch-default.xml and hadoop-default.xml settings.  And if you define it
in both hadoop-site.xml and nutch-site.xml then the nutch-site.xml will
override the hadoop-site.xml settings because nutch-site.xml is added
after hadoop-site.xml.  And remember only individual properties are
overridden not the entire file.

Practically you should define properties having to do with Hadoop (i.e.
the DFS, Mapreduce, etc) in the hadoop-site.xml and properties having to
do with Nutch (i.e. fetcher, url-normalizers, etc) in the nutch-site.xml.

Dennis Kubes

Ricardo J. Méndez wrote:
> Hi Gal,
>
> Thanks for the reply.
>
> What has me wondering is that several other plugins _are_ being loaded
> when I define it on hadoop-site.xml, and actually that defining
> plugin.folders on that file is the only way I've found so far of getting
> plugins loaded at all when testing from Eclipse.
>
> Moreover, I get this problem even if I define it in both nutch-site and
> hadoop-site, which would make it seem that the definition in
> hadoop-site.xml does have an effect.  I was assuming they overrode the
> options from nutch-site.xml - am I mistaken?
>
>
> Ricardo J. Méndez
> http://ricardo.strangevistas.net/
>
> Gal Nitzan wrote:
>> Hi,
>>
>> Nutch loads its configuration from nutch-site and nutch-default.xml and not
>> from hadoop conf files so the behavior is correct.
>>
>> HTH,
>>
>> Gal.
>>
>>
>> On 3/1/07, "Ricardo J. Méndez" <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>> I'm using nutch-0.9, from the trunk.    I've noticed a behavior
>>> difference on a plugin unit test if I set the plugin.folders property on >>> nutch-site.xml vs. hadoop-site.xml. If I set it on nutch-site.xml, the >>> unit test works well, but an error is raised if it's on hadoop-site.xml
>>>
>>> The error is:
>>>
>>>    [junit]  WARN [main] (ParserFactory.java:196) - Canno initialize
>>> parser parse-html (cause:
>>> org.apache.nutch.plugin.PluginRuntimeException:
>>> java.lang.ClassNotFoundException: org.apache.nutch.parse.html.HtmlParser
>>>
>>>
>>> Is there a reason why the HtmlParser wouldn't be loaded when the
>>> directory is specified on hadoop-site.xml?
>>>
>>> Thanks in advance,
>>>
>>>
>>>
>>>
>>> Ricardo J. Méndez
>>> http://ricardo.strangevistas.net/
>>>

Reply via email to