Re: Does anybody know how to let nutch crawl this kind of website?

Alexander Aristov Thu, 13 Nov 2008 01:08:57 -0800

hi

I got what you mean. Look at the source for the
http://svn.macosforge.org/repository/macports/ site. It contains SVN
internal xml markup and it is not HTML. When brawser downloads content from
this page it automaticaly applies XSL stylesheet refernced from the XML and
whcih produces HTML.


Nutch cannot do it by default. When it download content it tries to parse it
with HTML parser and of cource doesn't see  the <a> tag and so doesn't
produce new links.

I am affraid you should develope special plugin which would apply XML
stylesheet and place it before HTML paser.
I haven't heard of such plugin, maybe other falks know if such exists.

You may want to start a new thread with this question.

Alexander

2008/11/12 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
> Really appreciate for your help. Thanks.
>
> Without adding that new entry for plugin application/xml, the error message
> is:
> fetching http://svn.macosforge.org/repository/macports/
> fetching http://svn.collab.net/repos/svn/
> Error parsing: http://svn.macosforge.org/repository/macports/:
> org.apache.nutch.parse.ParseException: parser not found for
> contentType=application/xml
> url=http://svn.macosforge.org/repository/macports/
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
>        at
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)
>
>
> I did the following things:
> 1.added the entry into parse-plugins.xml
>       <mimeType name="application/xml">
>                <plugin id="parse-html" />
>                <plugin id="parse-rss" />
>        <plugin id="feed" />
>         </mimeType>
> 2. rm -rf crawl
> 3. bin/nutch crawl urls -dir crawl -depth -topN 50 > crawl.log
>
> The result is:
> 1. no "parser not found for application/xml" message;
> 2. still no urls other than http://svn.macosforge.org/repository/macports/
> being fetched.
> 3. all other urls being fetched are under http://svn.collab.net/repos/svn/.
>
>
> Crawl-urlfilter.txt:
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*macosforge.org/
> +^http://([a-z0-9]*\.)*collab.net/
> +^https://([a-z0-9]*\.)*smartlabs.com.au/
>
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, 12 November 2008 10:37 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of website?
>
> You have next plugin defined for parsing the text/xml mime type
> <mimeType name="text/xml">
>                <plugin id="parse-html" />
>                <plugin id="parse-rss" />
>        <plugin id="feed" />
>
>
>
> You can add anothe entry to the parse-plugins file to support the
> application/xml type
>
> <mimeType name="application/xml">
>                <plugin id="parse-html" />
>                <plugin id="parse-rss" />
> <plugin id="feed" />
>
> Actual implementations of the parse-html and parse-rss
>
> "org.apache.nutch.parse.html.HtmlParser
> org.apache.nutch.parse.rss.RSSParser
>
>
> Alex
>
> 2008/11/12 Windflying <[EMAIL PROTECTED]>
>
> > Hi Alex,
> >
> > Thanks for your try.
> > I just downloaded the latest nightly build of nutch-2008-11-11_04-01-21,
> > and
> > copy the property configuration in
> > http://zillionics.com/resources/articles/NutchGuideForDummies.htm
> > Into my nutch-site.xml, and changed the crawl-urlfilter.txt.
> >
> > It did work when searching those two websites.
> > For http://svn.collab.net/repos/svn/, it works.
> > For http://svn.macosforge.org/repository/macports/, it showed a error:
> > Parser not found for contentType=application/xml
> > url=http://svn.macosforge.org/repository/macports/
> >
> > Also I didn't find application/xml in my parse-plugins.xml.
> > Could pls tell me how to add it?
> >
> > Thanks.
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, 12 November 2008 5:43 PM
> > To: [email protected]
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> > Hi,
> >
> > I have just tried to crawl the sites with my server - no problems, works
> as
> > expected.
> >
> > I used the crawl command with params from the Nutch how-to page.
> >
> > bin/nutch crawl urls -dir crawl -depth 3 -topN 50
> >
> >
> > Do you clean previouse crawled data from the disk? Generator might not
> > produce links to re-fetch already fetched resources. There is a special
> > policy that it won't recrawl recently crawled data untill some time
> passes.
> > (configured parameter)
> >
> > And so generator produces no more links to fetch.
> >
> > Alexander
> >
> >
> > 2008/11/12 Windflying <[EMAIL PROTECTED]>
> >
> > > Hi Alex,
> > >
> > > Good day. Sorry to interrupt you again.
> > >
> > > I fine two website,
> > > http://svn.macosforge.org/repository/macports/
> > > http://svn.collab.net/repos/svn/
> > >
> > > When I use my nutch to crawl them, I got:
> > > Generator: 0 records selected for fetching, exiting ...
> > > Stopping at depth=0 - no more URLs to fetch.
> > >
> > > I have configured the nutch-site.xml and crawl-urlfilter.txt.
> > > As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I
> > assume
> > > my configuration is ok. Do u think so?
> > > Just make sure no more work with my nutch configuration.
> > >
> > > Thanks.
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 11:07 PM
> > > To: [email protected]
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > > No, you do not. Forget about it then, Nutch should crawl such sites
> > without
> > > any problems. So you have problem with something else.
> > >
> > > Alexander
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> > > > Do I need to add one? Sorry for my silly questions.
> > > >
> > > > Thanks.
> > > >
> > > > -----Original Message-----
> > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, 11 November 2008 10:41 PM
> > > > To: [email protected]
> > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > website?
> > > >
> > > > The robots.txt file is available by this address
> > > >
> > > > http://your_host/robots.txt
> > > >
> > > > for example : http://svn.apache.org/robots.txt
> > > >
> > > > Check it and if the file is like you wrote then it's not surprisingly
> > > that
> > > > Nutch doesn't crawl your svn.
> > > >
> > > > Alexander
> > > >
> > > >
> > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > >
> > > > > I guess we don't have robots.txt in svn. Only found this file in
> > > > > folder/usr/share/Nagios/ as following:
> > > > >   "User-agent: *
> > > > >    Disallow: /"
> > > > >
> > > > > What's this file for?
> > > > >
> > > > > -----Original Message-----
> > > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > > Sent: Tuesday, 11 November 2008 4:50 PM
> > > > > To: [email protected]
> > > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > > website?
> > > > >
> > > > >  I don't know how to configure your svn and add XSLT. But if your
> svn
> > > can
> > > > > be
> > > > > viewed from a brawser then it should always be crawled by Nutch.
> One
> > > > note,
> > > > > does your svn has the robots.txt file? Nutch is polite to public
> > > > resources
> > > > > and respects their rules. Check the file if it exists and allows
> > > robots.
> > > > >
> > > > > Are you using inranet crawling or internet? There are differences
> in
> > > > > configuration.
> > > > >
> > > > > Alexander
> > > > >
> > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > > >
> > > > > > Hi Alex,
> > > > > > Thanks for your reply. :)
> > > > > >
> > > > > > Yes, you are right. I just tried to search
> > > > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > > > > >
> > > > > > But I still can not search my own svn repository site.
> > > > > > Generator: 0 records selected for fetching, exiting...
> > > > > > Stopping at depth=0 - no more URLs to fetch.
> > > > > > Authentication is not a problem. I already used the https-client
> > > > plugin.
> > > > > > Some resources stored in this svn repository are also referenced
> by
> > > > > another
> > > > > > intranet website, and they all can be searched and indexed from
> > that
> > > > > > website.
> > > > > >
> > > > > > I am new here. What I was told is that in teh case of my company
> > svn
> > > > the
> > > > > > xml
> > > > > > files are just file/folder names, most of the useful stuff in the
> > svn
> > > > is
> > > > > > just referenced by the xml. What the XML Stylesheet does is turn
> > the
> > > > XML
> > > > > > into HTML so the broswers can follow the links.
> > > > > >
> > > > > > I guess there must be something difference inbetween NutchSVN and
> > my
> > > > > > company
> > > > > > SVN, which I do not know yet.
> > > > > >
> > > > > > Thanks & best regards,.
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > > > > To: [email protected]
> > > > > > Subject: Re: Does anybody know how to let nutch crawl this kind
> of
> > > > > website?
> > > > > >
> > > > > > this should work in the same way as for other sites. Folders are
> > > > regular
> > > > > > links. If you are talking about parsing content (files in the
> > > > repository)
> > > > > > then you should have necessary parsers, for example the text
> > parser,
> > > > xml
> > > > > > parser ...
> > > > > >
> > > > > > And you should give anonymouse access to svn or configure nutch
> to
> > > sign
> > > > > in.
> > > > > >
> > > > > > Alexander
> > > > > >
> > > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > My company intranet website is a svn repository, similar to :
> > > > > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > > > > >
> > > > > > > Does anybody have an idea on how to let nutch do search on it?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Bryan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards
> > > > > > Alexander Aristov
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards
> > > > > Alexander Aristov
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > > Alexander Aristov
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

Re: Does anybody know how to let nutch crawl this kind of website?

Reply via email to