Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by RenaudRichardet:
http://wiki.apache.org/nutch/SetupProxyForNutch

The comment on the change is:
new page about setting up Nutch to use a proxy

New page:
= Install Tinyproxy =

== Install ==
{{{
sudo apt-get install tinyproxy
}}}

== Configure ==
{{{
sudo vi /etc/tinyproxy/tinyproxy.conf
}}}
Sample configuration, make sure you set up the Port and Allow (here, I'm using 
my localhost)
{{{
Port 5555
Allow 127.0.0.1
Allow 192.168.1.110/24
Filter "/etc/tinyproxy/filter"
FilterURLs On
FilterDefaultDeny No #filters will act as a blacklist

User nobody
Group nogroup
ViaProxyName "tinyproxy"
ConnectPort 443
ConnectPort 563
Timeout 600
DefaultErrorFile "/usr/share/tinyproxy/default.html"
StatFile "/usr/share/tinyproxy/stats.html"
Logfile "/var/log/tinyproxy.log"
LogLevel Info
PidFile "/var/run/tinyproxy.pid"
MaxClients 100
MinSpareServers 5
MaxSpareServers 20
StartServers 10
MaxRequestsPerChild 0
}}}

== Create filters ==
If necessary (will act as a blacklist, because of FilterDefaultDeny No)
{{{
sudo vi /etc/tinyproxy/filter
}}}
and add sites urls to be blocked
{{{
google.com
apache.org
}}}

== Commands to Stop,Start, and Restart ==
{{{
sudo /etc/init.d/tinyproxy stop
sudo /etc/init.d/tinyproxy start
sudo /etc/init.d/tinyproxy restart
}}}

== Test the proxy with your browser ==
 * For Firefox, menu Preferences, tab General, button Connection settings. Then 
select Manual Proxy Configuration and enter the host you defined above and the 
port.
 * If you have created the filter above, and browse to google.com, the proxy 
should block you.

= Configure Nutch =
Copy the proxy configuration (see below) from conf/nutch-default.xml to 
conf/nutch-site.xml and fill up with the values of your proxy
{{{
<property>
  <name>http.proxy.host</name>
  <value>192.168.0.157</value>
  <description>The proxy hostname.  If empty, no proxy is used.</description>
</property>

<property>
  <name>http.proxy.port</name>
  <value>5555</value>
  <description>The proxy port.</description>
</property>
}}}

Now if you crawl sites, Nutch will use your proxy. You can monitor it by 
looking at the logs of Tinyproxy during a crawl:
{{{
sudo tail -f /var/log/tinyproxy.log
}}}

= More resources =
* http://ubuntuforums.org/showthread.php?t=122011
* http://doc.gwos.org/index.php/TinyProxy
* http://doc.ubuntu-fr.org/serveur/tinyproxy

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to