Thank you, Vishal. This part is working good now. Still figuring out why URLs have not been properly categorized though.
Dima. ----- Original Message ----- From: "Vishal Shah" <[EMAIL PROTECTED]> To: <[email protected]> Sent: Monday, September 04, 2006 5:23 AM Subject: RE: adding new URLs to nutch index > Hi Dima, > > Which version of Nutch are you using? From 0.8 onwards, the name of > the urls file has to be urls.txt, and it's parent dir has to be passed > to inject. For e.g., if your urls.txt is in a dir called NewUrls, then > your inject cmd would be: > > bin/nutch inject crawl/crawldb NewUrls > > Also, check your crawl-urlfilter.txt to make sure that these new URLs > won't be filtered. > > Regards, > > -vishal. > > -----Original Message----- > From: Dima Gritsenko [mailto:[EMAIL PROTECTED] > Sent: Monday, September 04, 2006 3:36 PM > To: [email protected] > Subject: adding new URLs to nutch index > > Hi, > > We are indexing DMOZ + we want to add too other URLs for indexing and > seem to have a problem searching those 2 newly added URLs (no results > returned). > Here's what we do to add new URL to nutch index: > 1) Created a dir /url with "url" file that contains these two URLs: > http://www.newsvine.com/_feeds/rss2/index > http://www.technorati.com/blogs/ > > 2) Then the following command is run (it should be adding our extra URLs > to nutch DB/index) > bin/nutch inject crawl/crawldb urls > > 3) Then start recrawl > bin/recrawl /home/dima/workspace/hapool/ /usr/share/nutch-0.8/crawl/ > 3 0 > > We are also using index-url-category plugin that ascribes URLs to > different categories for future filtered search: > Here's what we do: > > Add patterns used in regex-urlfilter.txt > > # accept anything else > +^http:\/\/www\.technorati\.com\/blogs.* > +.*rss.* > > -. > > Add patterns used in crawl-urlfilter.txt > > # accept hosts in MY.DOMAIN.NAME > +^http:\/\/www\.technorati\.com\/blogs.* > +.*rss.* > > > # skip everything else > -. > > > Patterns used in index-url-category plugin > > rules.properties file > > # News > http://newsrss.bbc.co.uk/rss/*=news > http://www.newsvine.com/*=news > .*rss.*=news > .*\.xml=news > > # Blogs > .*technorati\.com\/blogs.*=blogs > > # Web > .*=web > > Thank you. > Dima. > > > > ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
