Re: [Nutch-general] Nutch doesn't dive deeper

Michael Wechner Sun, 27 Aug 2006 12:38:53 -0700

sami siren wrote:

> This is yet another side effect of applying TextParser to non plain text
> documents and in this particular case it comes short with namespace
> declarations. I propose that we remove the PlainText parser from at least
> the following mime types:
>
> * (default)
> application/rss+xml
> application/vnd.wap.wbxml
> application/vnd.wap.wmlc
> application/vnd.wap.wmlscriptc
> application/xhtml+xml
> application/x-latex
> application/x-netcdf
> application/x-tex
> application/x-texinfo
> application/x-troff
> application/x-troff-man
> application/x-troff-me
> application/x-troff-ms
> message/news
> message/rfc822
> text/css
> text/sgml
> text/vnd.wap.wml
> text/xml
> text/x-setext
>
> I would guess that handling of text/xhtml+xml



I guess you mean application/xhtml+xml (as you actually note above)

> mimetpe should  be done with
> html parser anyway.


yes, I would say so

Thanks

Michi

>
> -- 
> Sami Siren
>
> 2006/8/25, Michael Wechner <[EMAIL PROTECTED]>:
>
>>
>> I think the problem is as follows with XHTML files:
>>
>> 2006-08-25 16:06:11,925 WARN  parse.ParserFactory -
>> ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser mapped to
>> contentType application/xhtml+xml via parse-plugins.xml, but its
>> plugin.xml file does not claim to support contentType:
>> application/xhtml+xml
>> 2006-08-25 16:06:11,965 ERROR parse.OutlinkExtractor - getOutlinks
>> java.net.MalformedURLException: unknown protocol: xmlns
>>         at java.net.URL.<init>(URL.java:544)
>>         at java.net.URL.<init>(URL.java:434)
>>         at java.net.URL.<init>(URL.java:383)
>>
>>
>> whereas maybe this could be resolved with
>>
>> http://issues.apache.org/jira/browse/NUTCH-359
>>
>> I am kind of suprised that nobody else is having this problem with
>> proper XHTML ;-)
>>
>> Thanks
>>
>> Michi
>>
>> Ken Gregoire wrote:
>>
>> > look here, it is blocking robots: http://ulysses.wyona.org/robots.txt
>> >
>> > User-agent: *
>> > Disallow: /foo/bar.html
>> >
>> > User-agent: lenya
>> > Disallow: /foo/bar.html
>> >
>> >
>> >
>> >
>> >
>> > Michael Wechner wrote:
>> >
>> >> Hi
>> >>
>> >> I am trying to index http://ulysses.wyona.org/ but somehow it just
>> >> indexes the homepage but doesn't seem to follow
>> >> any links. I have set "depth 3" and other sites are being crawled
>> >> deeper without a problem but not the Ulysses page.
>> >>
>> >> Has anyone made similar experiences?
>> >>
>> >> Is it possible that Nutch has problem with well-formed XHTML
>> >> (application/xhtml+xml)?
>> >>
>> >> Thanks
>> >>
>> >> Michi
>> >>
>> >
>>
>>
>> -- 
>> Michael Wechner
>> Wyona      -   Open Source Content Management   -    Apache Lenya
>> http://www.wyona.com                      http://lenya.apache.org
>> [EMAIL PROTECTED]                        [EMAIL PROTECTED]
>> +41 44 272 91 61
>>
>>
>


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
[EMAIL PROTECTED]                        [EMAIL PROTECTED]
+41 44 272 91 61


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch doesn't dive deeper

Reply via email to