Re: [Nutch-general] Nutch doesn't dive deeper

Chris Mattmann Sun, 27 Aug 2006 10:51:55 -0700

Hi Sami,

  I'm not sure that I agree that the entire set of mime types that you list
below should be removed from the parse-plugins.xml default mapping. For
instance, if you look at the current mapping file, many of the types below
would have no other option for parsing them besides the TextParser. I think
it makes a lot of sense to parse some of the below documents with the
TextParser because, in fact, they are text documents. A LaTeX document is a
plan text document. Text/css is essentially a plain text document. An rfc822
message is indeed (stripped of headers), a plain text document.


   There's a careful tradeoff that must be made in terms of having a default
config file that allows the greatest coverage of mime tyeps that are
available, and the handling of them with at least * one * parser, in
contrast to not including any parser at all for a particular mime type. I
struggled with this very issue when I initially created that file and what
you see in there now represents a "best guess" of mime types mapped to the
available parsers that exist in Nutch. The other option on that file is that
people can modify it on their own. For instance, in a domain-specific
deployment, a user can add and remove whatever mime type to plugin mappings
she wants from the parse-plugins.xml file: it was never meant to be
something that was "set in stone" per se. It would be good to see some
experiments to see what the best config set for parse-plugins.xml is.

Thanks!

Cheers,
  Chris



On 8/27/06 12:30 AM, "sami siren" <[EMAIL PROTECTED]> wrote:

> This is yet another side effect of applying TextParser to non plain text
> documents and in this particular case it comes short with namespace
> declarations. I propose that we remove the PlainText parser from at least
> the following mime types:
> 
> * (default)
> application/rss+xml
> application/vnd.wap.wbxml
> application/vnd.wap.wmlc
> application/vnd.wap.wmlscriptc
> application/xhtml+xml
> application/x-latex
> application/x-netcdf
> application/x-tex
> application/x-texinfo
> application/x-troff
> application/x-troff-man
> application/x-troff-me
> application/x-troff-ms
> message/news
> message/rfc822
> text/css
> text/sgml
> text/vnd.wap.wml
> text/xml
> text/x-setext
> 
> I would guess that handling of text/xhtml+xml mimetpe should  be done with
> html parser anyway.
> 
> --
>  Sami Siren
> 
> 2006/8/25, Michael Wechner <[EMAIL PROTECTED]>:
>> 
>> I think the problem is as follows with XHTML files:
>> 
>> 2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
>> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
>> contentType application/xhtml+xml via parse-plugins.xml, but its
>> plugin.xml file does not claim to support contentType:
>> application/xhtml+xml
>> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
>> java.net.MalformedURLException: unknown protocol: xmlns
>>         at java.net.URL.<init>(URL.java:544)
>>         at java.net.URL.<init>(URL.java:434)
>>         at java.net.URL.<init>(URL.java:383)
>> 
>> 
>> whereas maybe this could be resolved with
>> 
>> http://issues.apache.org/jira/browse/NUTCH-359
>> 
>> I am kind of suprised that nobody else is having this problem with
>> proper XHTML ;-)
>> 
>> Thanks
>> 
>> Michi
>> 
>> Ken Gregoire wrote:
>> 
>>> look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>>> 
>>> User-agent: *
>>> Disallow: /foo/bar.html
>>> 
>>> User-agent: lenya
>>> Disallow: /foo/bar.html
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Michael Wechner wrote:
>>> 
>>>> Hi
>>>> 
>>>> I am trying to index http://ulysses.wyona.org/ but somehow it just
>>>> indexes the homepage but doesn't seem to follow
>>>> any links. I have set "depth 3" and other sites are being crawled
>>>> deeper without a problem but not the Ulysses page.
>>>> 
>>>> Has anyone made similar experiences?
>>>> 
>>>> Is it possible that Nutch has problem with well-formed XHTML
>>>> (application/xhtml+xml)?
>>>> 
>>>> Thanks
>>>> 
>>>> Michi
>>>> 
>>> 
>> 
>> 
>> --
>> Michael Wechner
>> Wyona      -   Open Source Content Management   -    Apache Lenya
>> http://www.wyona.com                      http://lenya.apache.org
>> [EMAIL PROTECTED]                        [EMAIL PROTECTED]
>> +41 44 272 91 61
>> 
>> 



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch doesn't dive deeper

Reply via email to