RE: Does anybody know how to let nutch crawl this kind of website?

Windflying Wed, 12 Nov 2008 05:13:58 -0800

Hi Alex,

I did the following things:
1.added the entry into parse-plugins.xml
      <mimeType name="application/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />
        </mimeType>
2. rm -rf crawl
3. bin/nutch crawl urls -dir crawl -depth -topN 50 > crawl.log


The result is:
1. no "parser not found for application/xml" message;
2. still no urls other than http://svn.macosforge.org/repository/macports/
being fetched.
3. all other urls being fetched are under http://svn.collab.net/repos/svn/ .


Crawl-urlfilter.txt:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*macosforge.org/
+^http://([a-z0-9]*\.)*collab.net/
+^https://([a-z0-9]*\.)*smartlabs.com.au/


-----Original Message-----
From: Alexander Aristov [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, 12 November 2008 10:37 PM
To: nutch-user@lucene.apache.org
Subject: Re: Does anybody know how to let nutch crawl this kind of website?

You have next plugin defined for parsing the text/xml mime type
<mimeType name="text/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
        <plugin id="feed" />



You can add anothe entry to the parse-plugins file to support the
application/xml type

<mimeType name="application/xml">
                <plugin id="parse-html" />
                <plugin id="parse-rss" />
<plugin id="feed" />

Actual implementations of the parse-html and parse-rss

"org.apache.nutch.parse.html.HtmlParser
org.apache.nutch.parse.rss.RSSParser


Alex

2008/11/12 Windflying <[EMAIL PROTECTED]>

> Hi Alex,
>
> Thanks for your try.
> I just downloaded the latest nightly build of nutch-2008-11-11_04-01-21,
> and
> copy the property configuration in
> http://zillionics.com/resources/articles/NutchGuideForDummies.htm
> Into my nutch-site.xml, and changed the crawl-urlfilter.txt.
>
> It did work when searching those two websites.
> For http://svn.collab.net/repos/svn/, it works.
> For http://svn.macosforge.org/repository/macports/, it showed a error:
> Parser not found for contentType=application/xml
> url=http://svn.macosforge.org/repository/macports/
>
> Also I didn't find application/xml in my parse-plugins.xml.
> Could pls tell me how to add it?
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, 12 November 2008 5:43 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Does anybody know how to let nutch crawl this kind of
website?
>
> Hi,
>
> I have just tried to crawl the sites with my server - no problems, works
as
> expected.
>
> I used the crawl command with params from the Nutch how-to page.
>
> bin/nutch crawl urls -dir crawl -depth 3 -topN 50
>
>
> Do you clean previouse crawled data from the disk? Generator might not
> produce links to re-fetch already fetched resources. There is a special
> policy that it won't recrawl recently crawled data untill some time
passes.
> (configured parameter)
>
> And so generator produces no more links to fetch.
>
> Alexander
>
>
> 2008/11/12 Windflying <[EMAIL PROTECTED]>
>
> > Hi Alex,
> >
> > Good day. Sorry to interrupt you again.
> >
> > I fine two website,
> > http://svn.macosforge.org/repository/macports/
> > http://svn.collab.net/repos/svn/
> >
> > When I use my nutch to crawl them, I got:
> > Generator: 0 records selected for fetching, exiting ...
> > Stopping at depth=0 - no more URLs to fetch.
> >
> > I have configured the nutch-site.xml and crawl-urlfilter.txt.
> > As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I
> assume
> > my configuration is ok. Do u think so?
> > Just make sure no more work with my nutch configuration.
> >
> > Thanks.
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 11:07 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> > No, you do not. Forget about it then, Nutch should crawl such sites
> without
> > any problems. So you have problem with something else.
> >
> > Alexander
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> > > Do I need to add one? Sorry for my silly questions.
> > >
> > > Thanks.
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 10:41 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > > The robots.txt file is available by this address
> > >
> > > http://your_host/robots.txt
> > >
> > > for example : http://svn.apache.org/robots.txt
> > >
> > > Check it and if the file is like you wrote then it's not surprisingly
> > that
> > > Nutch doesn't crawl your svn.
> > >
> > > Alexander
> > >
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > I guess we don't have robots.txt in svn. Only found this file in
> > > > folder/usr/share/Nagios/ as following:
> > > >   "User-agent: *
> > > >    Disallow: /"
> > > >
> > > > What's this file for?
> > > >
> > > > -----Original Message-----
> > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > Sent: Tuesday, 11 November 2008 4:50 PM
> > > > To: nutch-user@lucene.apache.org
> > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > website?
> > > >
> > > >  I don't know how to configure your svn and add XSLT. But if your
svn
> > can
> > > > be
> > > > viewed from a brawser then it should always be crawled by Nutch. One
> > > note,
> > > > does your svn has the robots.txt file? Nutch is polite to public
> > > resources
> > > > and respects their rules. Check the file if it exists and allows
> > robots.
> > > >
> > > > Are you using inranet crawling or internet? There are differences in
> > > > configuration.
> > > >
> > > > Alexander
> > > >
> > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > >
> > > > > Hi Alex,
> > > > > Thanks for your reply. :)
> > > > >
> > > > > Yes, you are right. I just tried to search
> > > > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > > > >
> > > > > But I still can not search my own svn repository site.
> > > > > Generator: 0 records selected for fetching, exiting...
> > > > > Stopping at depth=0 - no more URLs to fetch.
> > > > > Authentication is not a problem. I already used the https-client
> > > plugin.
> > > > > Some resources stored in this svn repository are also referenced
by
> > > > another
> > > > > intranet website, and they all can be searched and indexed from
> that
> > > > > website.
> > > > >
> > > > > I am new here. What I was told is that in teh case of my company
> svn
> > > the
> > > > > xml
> > > > > files are just file/folder names, most of the useful stuff in the
> svn
> > > is
> > > > > just referenced by the xml. What the XML Stylesheet does is turn
> the
> > > XML
> > > > > into HTML so the broswers can follow the links.
> > > > >
> > > > > I guess there must be something difference inbetween NutchSVN and
> my
> > > > > company
> > > > > SVN, which I do not know yet.
> > > > >
> > > > > Thanks & best regards,.
> > > > >
> > > > > -----Original Message-----
> > > > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > > > To: nutch-user@lucene.apache.org
> > > > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > > > website?
> > > > >
> > > > > this should work in the same way as for other sites. Folders are
> > > regular
> > > > > links. If you are talking about parsing content (files in the
> > > repository)
> > > > > then you should have necessary parsers, for example the text
> parser,
> > > xml
> > > > > parser ...
> > > > >
> > > > > And you should give anonymouse access to svn or configure nutch to
> > sign
> > > > in.
> > > > >
> > > > > Alexander
> > > > >
> > > > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > My company intranet website is a svn repository, similar to :
> > > > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > > > >
> > > > > > Does anybody have an idea on how to let nutch do search on it?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Bryan
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards
> > > > > Alexander Aristov
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best Regards
> > > > Alexander Aristov
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- HTTP properties -->

<property>

  <name>http.agent.name</name>

  <value>Peter Wang</value>

  <description>Peter Pu Wang

  </description>

</property>

 

<property>

  <name>http.agent.description</name>

  <value>Nutch spiderman</value>

  <description> Nutch spiderman

  </description>

</property>

 

<property>

  <name>http.agent.url</name>

  <value>http://peterpuwang.googlepages.com </value>

  <description>http://peterpuwang.googlepages.com

  </description>

</property>

 

<property>

  <name>http.agent.email</name>

  <value>MyEmail</value>

  <description>[EMAIL PROTECTED]

  </description>

</property>

</configuration>

<?xml version="1.0" encoding="UTF-8"?>
<!--
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements.  See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to You under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License.  You may obtain a copy of the License at
	
	http://www.apache.org/licenses/LICENSE-2.0
	
	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	
	Author     : mattmann 
	Description: This xml file represents a natural ordering for which parsing 
	plugin should get called for a particular mimeType. 
-->

<parse-plugins>

	<mimeType name="application/msword">
		<plugin id="parse-msword" />
	</mimeType>

	<mimeType name="application/pdf">
		<plugin id="parse-pdf" />
	</mimeType>

	<mimeType name="application/postscript">
		<plugin id="parse-pdf" />
	</mimeType>

	<mimeType name="application/rss+xml">
	    <plugin id="parse-rss" />
	    <plugin id="feed" />
	</mimeType>

	<mimeType name="application/vnd.ms-excel">
		<plugin id="parse-msexcel" />
	</mimeType>

	<mimeType name="application/vnd.ms-powerpoint">
		<plugin id="parse-mspowerpoint" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.text">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.text-template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.text-master">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.text-web">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.presentation">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.presentation-template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.spreadsheet">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.oasis.opendocument.spreadsheet-template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.calc">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.calc.template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.impress">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.impress.template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.writer">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/vnd.sun.xml.writer.template">
		<plugin id="parse-oo" />
	</mimeType>

	<mimeType name="application/xhtml+xml">
		<plugin id="parse-html" />
	</mimeType>

	<mimeType name="application/x-bzip2">
		<!--  try and parse it with the zip parser -->
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-csh">
		<plugin id="parse-text" />
	</mimeType>

	<mimeType name="application/x-gzip">
		<!--  try and parse it with the zip parser -->
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="application/x-javascript">
		<plugin id="parse-js" />
	</mimeType>

	<mimeType name="application/x-kword">
		<!--  try and parse it with the word parser -->
		<plugin id="parse-msword" />
	</mimeType>

	<mimeType name="application/x-kspread">
		<!--  try and parse it with the msexcel parser -->
		<plugin id="parse-msexcel" />
	</mimeType>

	<mimeType name="application/x-shockwave-flash">
		<plugin id="parse-swf" />
	</mimeType>

	<mimeType name="application/zip">
		<plugin id="parse-zip" />
	</mimeType>

	<mimeType name="text/html">
		<plugin id="parse-html" />
	</mimeType>

	<mimeType name="text/plain">
		<plugin id="parse-text" />
	</mimeType>

	<mimeType name="text/richtext">
		<plugin id="parse-rtf" />
		<plugin id="parse-msword" />
	</mimeType>

	<mimeType name="text/rtf">
		<plugin id="parse-rtf" />
		<plugin id="parse-msword" />
	</mimeType>

	<mimeType name="text/sgml">
		<plugin id="parse-html" />
	</mimeType>

	<mimeType name="text/tab-separated-values">
		<plugin id="parse-msexcel" />
	</mimeType>

      <mimeType name="text/xml">
		<plugin id="parse-html" />
		<plugin id="parse-rss" />
        <plugin id="feed" />
	</mimeType>

      <mimeType name="application/xml">
		<plugin id="parse-html" />
		<plugin id="parse-rss" />
        <plugin id="feed" />
	</mimeType>

       <!-- Types for parse-ext plugin: required for unit tests to pass. -->

	<mimeType name="application/vnd.nutch.example.cat">
		<plugin id="parse-ext" />
	</mimeType>

	<mimeType name="application/vnd.nutch.example.md5sum">
		<plugin id="parse-ext" />
	</mimeType>

	<!--  alias mappings for parse-xxx names to the actual extension implementation 
	ids described in each plugin's plugin.xml file -->
	<aliases>
		<alias name="parse-ext" extension-id="ExtParser" />
		<alias name="parse-html"
			extension-id="org.apache.nutch.parse.html.HtmlParser" />
		<alias name="parse-js" extension-id="JSParser" />
		<alias name="parse-mp3"
			extension-id="org.apache.nutch.parse.mp3.MP3Parser" />
		<alias name="parse-msexcel"
			extension-id="org.apache.nutch.parse.msexcel.MSExcelParser" />
		<alias name="parse-mspowerpoint"
			extension-id="org.apache.nutch.parse.mspowerpoint.MSPowerPointParser" />
		<alias name="parse-msword"
			extension-id="org.apache.nutch.parse.msword.MSWordParser" />
		<alias name="parse-oo"
			extension-id="org.apache.nutch.parse.oo.OpenDocument.Text" />
		<alias name="parse-pdf"
			extension-id="org.apache.nutch.parse.pdf.PdfParser" />
		<alias name="parse-rss"
			extension-id="org.apache.nutch.parse.rss.RSSParser" />
        <alias name="feed"
            extension-id="org.apache.nutch.parse.feed.FeedParser" />
		<alias name="parse-rtf"
			extension-id="org.apache.nutch.parse.rtf.RTFParseFactory" />
		<alias name="parse-swf"
			extension-id="org.apache.nutch.parse.swf.SWFParser" />
		<alias name="parse-text"
			extension-id="org.apache.nutch.parse.text.TextParser" />
		<alias name="parse-zip"
			extension-id="org.apache.nutch.parse.zip.ZipParser" />
	</aliases>
	
</parse-plugins>

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*macosforge.org/
+^http://([a-z0-9]*\.)*collab.net/
+^https://([a-z0-9]*\.)*smartlabs.com.au/

# skip everything else
-.

RE: Does anybody know how to let nutch crawl this kind of website?

Reply via email to