Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by RenaudRichardet: http://wiki.apache.org/nutch/SetupProxyForNutch The comment on the change is: new page about setting up Nutch to use a proxy New page: = Install Tinyproxy = == Install == {{{ sudo apt-get install tinyproxy }}} == Configure == {{{ sudo vi /etc/tinyproxy/tinyproxy.conf }}} Sample configuration, make sure you set up the Port and Allow (here, I'm using my localhost) {{{ Port 5555 Allow 127.0.0.1 Allow 192.168.1.110/24 Filter "/etc/tinyproxy/filter" FilterURLs On FilterDefaultDeny No #filters will act as a blacklist User nobody Group nogroup ViaProxyName "tinyproxy" ConnectPort 443 ConnectPort 563 Timeout 600 DefaultErrorFile "/usr/share/tinyproxy/default.html" StatFile "/usr/share/tinyproxy/stats.html" Logfile "/var/log/tinyproxy.log" LogLevel Info PidFile "/var/run/tinyproxy.pid" MaxClients 100 MinSpareServers 5 MaxSpareServers 20 StartServers 10 MaxRequestsPerChild 0 }}} == Create filters == If necessary (will act as a blacklist, because of FilterDefaultDeny No) {{{ sudo vi /etc/tinyproxy/filter }}} and add sites urls to be blocked {{{ google.com apache.org }}} == Commands to Stop,Start, and Restart == {{{ sudo /etc/init.d/tinyproxy stop sudo /etc/init.d/tinyproxy start sudo /etc/init.d/tinyproxy restart }}} == Test the proxy with your browser == * For Firefox, menu Preferences, tab General, button Connection settings. Then select Manual Proxy Configuration and enter the host you defined above and the port. * If you have created the filter above, and browse to google.com, the proxy should block you. = Configure Nutch = Copy the proxy configuration (see below) from conf/nutch-default.xml to conf/nutch-site.xml and fill up with the values of your proxy {{{ <property> <name>http.proxy.host</name> <value>192.168.0.157</value> <description>The proxy hostname. If empty, no proxy is used.</description> </property> <property> <name>http.proxy.port</name> <value>5555</value> <description>The proxy port.</description> </property> }}} Now if you crawl sites, Nutch will use your proxy. You can monitor it by looking at the logs of Tinyproxy during a crawl: {{{ sudo tail -f /var/log/tinyproxy.log }}} = More resources = * http://ubuntuforums.org/showthread.php?t=122011 * http://doc.gwos.org/index.php/TinyProxy * http://doc.ubuntu-fr.org/serveur/tinyproxy ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs