Thank you, Vishal.
This part is working good now. Still figuring out why URLs have not been
properly categorized though.

Dima.

----- Original Message -----
From: "Vishal Shah" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Monday, September 04, 2006 5:23 AM
Subject: RE: adding new URLs to nutch index


> Hi Dima,
>
>   Which version of Nutch are you using? From 0.8 onwards, the name of
> the urls file has to be urls.txt, and it's parent dir has to be passed
> to inject. For e.g., if your urls.txt is in a dir called NewUrls, then
> your inject cmd would be:
>
> bin/nutch inject crawl/crawldb NewUrls
>
> Also, check your crawl-urlfilter.txt to make sure that these new URLs
> won't be filtered.
>
> Regards,
>
> -vishal.
>
> -----Original Message-----
> From: Dima Gritsenko [mailto:[EMAIL PROTECTED]
> Sent: Monday, September 04, 2006 3:36 PM
> To: [email protected]
> Subject: adding new URLs to nutch index
>
> Hi,
>
> We are indexing DMOZ + we want to add too other URLs for indexing and
> seem to have a problem searching those 2 newly added URLs (no results
> returned).
> Here's what we do to add new URL to nutch index:
> 1) Created a dir  /url with "url" file that contains these two URLs:
>     http://www.newsvine.com/_feeds/rss2/index
>     http://www.technorati.com/blogs/
>
> 2) Then the following command is run (it should be adding our extra URLs
> to nutch DB/index)
>     bin/nutch inject crawl/crawldb urls
>
> 3) Then start recrawl
>     bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/
> 3 0
>
> We are also using index-url-category plugin that ascribes URLs to
> different categories for future filtered search:
> Here's what we do:
>
> Add patterns used in regex-urlfilter.txt
>
> # accept anything else
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
> -.
>
> Add patterns used in crawl-urlfilter.txt
>
> # accept hosts in MY.DOMAIN.NAME
> +^http:\/\/www\.technorati\.com\/blogs.*
> +.*rss.*
>
>
> # skip everything else
> -.
>
>
> Patterns used in index-url-category plugin
>
> rules.properties file
>
> # News
> http://newsrss.bbc.co.uk/rss/*=news
> http://www.newsvine.com/*=news
> .*rss.*=news
> .*\.xml=news
>
> # Blogs
> .*technorati\.com\/blogs.*=blogs
>
> # Web
> .*=web
>
> Thank you.
> Dima.
>
>
>
>


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to