RE: Does anybody know how to let nutch crawl this kind of website?

Windflying Wed, 12 Nov 2008 05:27:24 -0800

Hi Alex,
Really appreciate for your help. Thanks.

Without adding that new entry for plugin application/xml, the error message
is: 
fetching http://svn.macosforge.org/repository/macports/
fetching http://svn.collab.net/repos/svn/
Error parsing: http://svn.macosforge.org/repository/macports/:
org.apache.nutch.parse.ParseException: parser not found for
contentType=application/xml
url=http://svn.macosforge.org/repository/macports/
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:336)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:178)



I did the following things:
1.added the entry into parse-plugins.xml
      <mimeType name="application/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />
        </mimeType>
2. rm -rf crawl
3. bin/nutch crawl urls -dir crawl -depth -topN 50 > crawl.log

The result is:
1. no "parser not found for application/xml" message;
2. still no urls other than http://svn.macosforge.org/repository/macports/
being fetched.
3. all other urls being fetched are under http://svn.collab.net/repos/svn/ .


Crawl-urlfilter.txt:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*macosforge.org/
+^http://([a-z0-9]*\.)*collab.net/
+^https://([a-z0-9]*\.)*smartlabs.com.au/


-----Original Message-----
From: Alexander Aristov [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 12 November 2008 10:37 PM
To: [email protected]
Subject: Re: Does anybody know how to let nutch crawl this kind of website?

You have next plugin defined for parsing the text/xml mime type
<mimeType name="text/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />



You can add anothe entry to the parse-plugins file to support the
application/xml type

<mimeType name="application/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
<plugin id="feed" />

Actual implementations of the parse-html and parse-rss

"org.apache.nutch.parse.html.HtmlParser
org.apache.nutch.parse.rss.RSSParser


Alex

2008/11/12 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
>
> Thanks for your try.
> I just downloaded the latest nightly build of nutch-2008-11-11_04-01-21,
> and
> copy the property configuration in
> http://zillionics.com/resources/articles/NutchGuideForDummies.htm
> Into my nutch-site.xml, and changed the crawl-urlfilter.txt.
>
> It did work when searching those two websites.
> For http://svn.collab.net/repos/svn/, it works.
> For http://svn.macosforge.org/repository/macports/, it showed a error:
> Parser not found for contentType=application/xml
> url=http://svn.macosforge.org/repository/macports/
>
> Also I didn't find application/xml in my parse-plugins.xml.
> Could pls tell me how to add it?
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, 12 November 2008 5:43 PM
> To: [email protected]
> Subject: Re: Does anybody know how to let nutch crawl this kind of
website?
>
> Hi,
>
> I have just tried to crawl the sites with my server - no problems, works
as
> expected.
>
> I used the crawl command with params from the Nutch how-to page.
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
>
> Do you clean previouse crawled data from the disk? Generator might not
> produce links to re-fetch already fetched resources. There is a special
> policy that it won't recrawl recently crawled data untill some time
passes.
> (configured parameter)
>
> And so generator produces no more links to fetch.
>
> Alexander
>
>
> 2008/11/12 Windflying <[EMAIL PROTECTED]>
>
> > Hi Alex,
> >
> > Good day. Sorry to interrupt you again.
> >
> > I fine two website,
> > http://svn.macosforge.org/repository/macports/
> > http://svn.collab.net/repos/svn/
> >
> > When I use my nutch to crawl them, I got:
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> >
> > I have configured the nutch-site.xml and crawl-urlfilter.txt.
> > As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I
> assume
> > my configuration is ok. Do u think so?
> > Just make sure no more work with my nutch configuration.
> >
> > Thanks.
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 11:07 PM
> > To: [email protected]
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> > No, you do not. Forget about it then, Nutch should crawl such sites
> without
> > any problems. So you have problem with something else.
> >
> > Alexander
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> > > Do I need to add one? Sorry for my silly questions.
> > >
> > > Thanks.
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 10:41 PM
> > > To: [email protected]
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > > The robots.txt file is available by this address
> > >
> > > http://your_host/robots.txt
> > >
> > > for example : http://svn.apache.org/robots.txt
> > >
> > > Check it and if the file is like you wrote then it's not surprisingly
> > that
> > > Nutch doesn't crawl your svn.
> > >
> > > Alexander
> > >
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > I guess we don't have robots.txt in svn. Only found this file in
> > > > folder/usr/share/Nagios/ as following:
> > > >   "User-agent: *
> > > >    Disallow: /"
> > > >
> > > > What's this file for?
> > > >
> > > > -----Original Message-----
> > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, 11 November 2008 4:50 PM
> > > > To: [email protected]
> > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > website?
> > > >
> > > >  I don't know how to configure your svn and add XSLT. But if your
svn
> > can
> > > > be
> > > > viewed from a brawser then it should always be crawled by Nutch. One
> > > note,
> > > > does your svn has the robots.txt file? Nutch is polite to public
> > > resources
> > > > and respects their rules. Check the file if it exists and allows
> > robots.
> > > >
> > > > Are you using inranet crawling or internet? There are differences in
> > > > configuration.
> > > >
> > > > Alexander
> > > >
> > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > >
> > > > > Hi Alex,
> > > > > Thanks for your reply. :)
> > > > >
> > > > > Yes, you are right. I just tried to search
> > > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > > > >
> > > > > But I still can not search my own svn repository site.
> > > > > Generator: 0 records selected for fetching, exiting...
> > > > > Stopping at depth=0 - no more URLs to fetch.
> > > > > Authentication is not a problem. I already used the https-client
> > > plugin.
> > > > > Some resources stored in this svn repository are also referenced
by
> > > > another
> > > > > intranet website, and they all can be searched and indexed from
> that
> > > > > website.
> > > > >
> > > > > I am new here. What I was told is that in teh case of my company
> svn
> > > the
> > > > > xml
> > > > > files are just file/folder names, most of the useful stuff in the
> svn
> > > is
> > > > > just referenced by the xml. What the XML Stylesheet does is turn
> the
> > > XML
> > > > > into HTML so the broswers can follow the links.
> > > > >
> > > > > I guess there must be something difference inbetween NutchSVN and
> my
> > > > > company
> > > > > SVN, which I do not know yet.
> > > > >
> > > > > Thanks & best regards,.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > > > To: [email protected]
> > > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > > website?
> > > > >
> > > > > this should work in the same way as for other sites. Folders are
> > > regular
> > > > > links. If you are talking about parsing content (files in the
> > > repository)
> > > > > then you should have necessary parsers, for example the text
> parser,
> > > xml
> > > > > parser ...
> > > > >
> > > > > And you should give anonymouse access to svn or configure nutch to
> > sign
> > > > in.
> > > > >
> > > > > Alexander
> > > > >
> > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > My company intranet website is a svn repository, similar to :
> > > > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > > > >
> > > > > > Does anybody have an idea on how to let nutch do search on it?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Bryan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards
> > > > > Alexander Aristov
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > > Alexander Aristov
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

RE: Does anybody know how to let nutch crawl this kind of website?

Reply via email to