Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "RunningNutchAndSolr" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/RunningNutchAndSolr?action=diff&rev1=68&rev2=69

  {{{
  http://nutch.apache.org/
  }}}
+ * Edit the file conf/regex-urlfilter.txt and replace 
+ {{{
+ # accept anything else
+ +.  
+ }}}
+ 
+ with a regular expression matching the domain you wish to crawl. For example, 
if you wished to limit the crawl to the nutch.apache.org domain, the line 
should read:
+ 
+ {{{
+  +^http://([a-z0-9]*\.)*nutch.apache.org/ 
+ }}} 
+ 
+ This will include any url in the domain nutch.apache.org.
   * Run the following command:
  {{{
  bin/nutch crawl urls -dir crawl -depth 3 -topN 5
@@ -102, +115 @@

  <field name="content" type="text" stored="true" indexed="true"/>
  }}}
  
- '''This tutorial was originally constructed and posted by 'waycool' on the 
user lists. It has been edited slightly for integration into the Apache Nutch 
project.'''
- 

Reply via email to