Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "SetupNutchAndTor" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/SetupNutchAndTor?action=diff&rev1=1&rev2=2

  This tutorial provides an end-to-end example of accessing the Tor network(s) 
and getting Nutch crawling .onion pages for which the suffix designates an 
anonymous or pseudonymous address reachable via the Tor network.
  
  == Quick Notes ==
- This tutorial has worked best on Debian and Ubuntu however it has also been 
run on Mac OSX 10.9.4. Best efforts have been made to ensure that documentation 
covers these OS. If not, then [[|please let us know]]
+ This tutorial has worked best on Debian and Ubuntu however it has also been 
run on Mac OSX 10.9.4. Best efforts have been made to ensure that documentation 
covers these OS. If not, then 
[[http://nutch.apache.org/mailing_lists.html|please let us know]]
  
  == Install Tor ==
  
@@ -60, +60 @@

  == Tor Logging ==
  
  If you want, you can configure your Tor to be more useful in its
- logging. For example, add these lines to your /etc/tor/torrc:
+ logging. For example, add these lines to your {{{/etc/tor/torrc}}}:
- 
+ {{{
  SafeLogging 0
  LogTimeGranularity 1 
- 
+ }}}
  == The Socks Proxy Anomaly ==
  
  If, as in the case of Nutch, your crawler can't interact with a socks proxy, 
but it can do an http proxy, then you'll need to run an http proxy and 
configure it to use a socks proxy. To achieve this we select one of the 
following proxies.
@@ -95, +95 @@

  
  That is, uncomment the '''forward-socks5''' option in 
{{{/etc/privoxy/config}}} and make sure it points to {{{127.0.0.1:9050}}}. 
  
- == Nutch Crawler Configuration == 
+ == Nutch Crawler Configuration ==
+ Configure Nutch to only follow domains that end in .onion. This can be done 
via simple urlfiltering as described in the main 
[[http://wiki.apache.org/nutch/NutchTutorial|Nutch Tutorial]]
+ 
+ http://duskgytldkxiuqc6.onion/ is a fine example url to test Nutch on, to 
make sure you're able to successfully fetch content and metadata.
+ 
+ Then https://ahmia.fi/onions/ has a list of many thousands more, most of 
which are down so it should be a good exercise for Nutch.
  
  == Conclusion ==
+ This tutorial acts as a mechanism for using Apache Nutch to crawl hidden 
services within the Tor network. The intention here is to 
extend/display/elaborate upon a use case other than typical HTTP protocol crawl 
cycles. Hopefully this tutorial provides that. The most important thing here is 
for people to maintain this docuementation. If there is something which does 
not work, then please [[http://nutch.apache.org/mailing_lists.html|let us know]]
  

Reply via email to