Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "SetupNutchAndTor" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/SetupNutchAndTor?action=diff&rev1=1&rev2=2 This tutorial provides an end-to-end example of accessing the Tor network(s) and getting Nutch crawling .onion pages for which the suffix designates an anonymous or pseudonymous address reachable via the Tor network. == Quick Notes == - This tutorial has worked best on Debian and Ubuntu however it has also been run on Mac OSX 10.9.4. Best efforts have been made to ensure that documentation covers these OS. If not, then [[|please let us know]] + This tutorial has worked best on Debian and Ubuntu however it has also been run on Mac OSX 10.9.4. Best efforts have been made to ensure that documentation covers these OS. If not, then [[http://nutch.apache.org/mailing_lists.html|please let us know]] == Install Tor == @@ -60, +60 @@ == Tor Logging == If you want, you can configure your Tor to be more useful in its - logging. For example, add these lines to your /etc/tor/torrc: + logging. For example, add these lines to your {{{/etc/tor/torrc}}}: - + {{{ SafeLogging 0 LogTimeGranularity 1 - + }}} == The Socks Proxy Anomaly == If, as in the case of Nutch, your crawler can't interact with a socks proxy, but it can do an http proxy, then you'll need to run an http proxy and configure it to use a socks proxy. To achieve this we select one of the following proxies. @@ -95, +95 @@ That is, uncomment the '''forward-socks5''' option in {{{/etc/privoxy/config}}} and make sure it points to {{{127.0.0.1:9050}}}. - == Nutch Crawler Configuration == + == Nutch Crawler Configuration == + Configure Nutch to only follow domains that end in .onion. This can be done via simple urlfiltering as described in the main [[http://wiki.apache.org/nutch/NutchTutorial|Nutch Tutorial]] + + http://duskgytldkxiuqc6.onion/ is a fine example url to test Nutch on, to make sure you're able to successfully fetch content and metadata. + + Then https://ahmia.fi/onions/ has a list of many thousands more, most of which are down so it should be a good exercise for Nutch. == Conclusion == + This tutorial acts as a mechanism for using Apache Nutch to crawl hidden services within the Tor network. The intention here is to extend/display/elaborate upon a use case other than typical HTTP protocol crawl cycles. Hopefully this tutorial provides that. The most important thing here is for people to maintain this docuementation. If there is something which does not work, then please [[http://nutch.apache.org/mailing_lists.html|let us know]]

