How to get all the crawled pages for perticular domain
Hi, I have setup nutch 1.0 on cluster of 3 nodes. We are running two application. 1. Nutch based search application. We have successfully crawled approx. 25m pages on 3 nodes. It's working as per expectation. 2. I am running application which needs to extract some information for perticular domain. As of date this application uses heritrix based crawler which crawls the given domain, algorithms goes into pages and extract required information. As we are crawling in Nutch in distributed mode. we don't want to recrawl using other tool like Heritrix for 2nd application. I want to utilize same crawled data for 2nd application also. But extraction algorithms requires all the crawled pages for perticular domain, to extract all relevant information about that domain. I have thought of if somehow by writing some plugin in Nutch if i can feed nutch crawled data to 2nd application then it will really save our work, money and effort by not recrawling again. But how do i get all the crawled pages for perticular domain in my plugin? Where i should look in nutch code. Any pointer / idea in this direction will really help. Thanks. Bhavin -- - Bhavin
Re: Nutch Hadoop 0.20 - Exception
Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh I don't know what to do! I've tried all kind of stuff but with no luck... :( *hadoop-eran-jobtracker-master.log* 2009-12-09 12:04:53,965 FATAL mapred.JobTracker - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1610) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:180) at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:172) at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3699) *hadoop-eran-namenode-master.log* 2009-12-09 12:04:27,583 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Thanks for trying to help, Eran On Sun, Dec 6, 2009 at 3:51 PM, Eran Zinman zze...@gmail.com wrote: Hi, Just upgraded to the latest version of Nutch with Hadoop 0.20. I'm getting the following exception in the namenode log and DFS doesn't start: 2009-12-06 15:48:32,523 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded at java.net.URLClassLoader.defineClass(URLClassLoader.java:235) at java.net.URLClassLoader.access$000(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) Any help will be appreciated ... quite stuck with this. Thanks, Eran
Re: Nutch Hadoop 0.20 - Exception
Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - Exception
Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Thanks, Eran On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Nutch Hadoop 0.20 - Exception
Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Thanks, Eran On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System
Re: Nutch Hadoop 0.20 - Exception
1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314) at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) Caused by: java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) at sun.nio.ch.IOUtil.read(IOUtil.java:206) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) Thanks, Eran On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote: Eran Zinman wrote: Hi, Sorry to bother you guys again, but it seems that no matter what I do I can't run the new version of Nutch with Hadoop 0.20. I am getting the following exceptions in my logs when I execute bin/start-all.sh Do you use the scripts in place, i.e. without deploying the nutch*.job to a separate Hadoop cluster? Could you please try it with a standalone Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information
Re: Nutch Hadoop 0.20 - Exception
Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at org.apache.hadoop.ipc.Client.wrapException(Client.java:774) at org.apache.hadoop.ipc.Client.call(Client.java:742) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at
Re: Nutch Hadoop 0.20 - Exception
Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name value/my_crawler/filesystem/data/value /property property namedfs.replication/name value2/value /property /configuration Thanks, Eran On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote: Hi Andrzej, Thanks for your help (as always). Still getting same exception when running on standalone Hadoop cluster. Getting same exceptions as before - also in the datanode log I'm getting: 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection reset by peer at
Re: Nutch Hadoop 0.20 - Exception
Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant package. 5) Run the new Nutch... and got this error. What did I miss? Thanks, Eran On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote: Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value/my_crawler/filesystem/mapreduce/system/value /property property namemapred.local.dir/name value/my_crawler/filesystem/mapreduce/local/value /property /configuration *hdfs-site.xml* configuration property namedfs.name.dir/name value/my_crawler/filesystem/name/value /property property namedfs.data.dir/name
Re: Nutch Hadoop 0.20 - Exception
Hi all, thanks Dennis - you helped me solve the problem. The problem was that I had two versions of jetty in my lib folder. I deleted the old version and viola - it works. The problem is that both versions exist in the SVN! Altough I took a fresh copy of the SVN I had both versions in my lib folder. I think we need to remove the old version from the SVN so people like me won't get confused ... Thanks ! Eran. On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote: Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant package. 5) Run the new Nutch... and got this error. What did I miss? Thanks, Eran On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote: Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce job tracker runs at. If
Re: Nutch Hadoop 0.20 - Exception
Done. I have removed the old Jetty jars from the SVN. Thanks for bringing this issue forward. Dennis Eran Zinman wrote: Hi all, thanks Dennis - you helped me solve the problem. The problem was that I had two versions of jetty in my lib folder. I deleted the old version and viola - it works. The problem is that both versions exist in the SVN! Altough I took a fresh copy of the SVN I had both versions in my lib folder. I think we need to remove the old version from the SVN so people like me won't get confused ... Thanks ! Eran. On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote: Hi Dennis, Thanks for trying to help. I don't know what fresh install means exactly. Here is what I've done: 1) Downloaded latest version of Nutch from the SVN to a new folder. 2) Copied all the custom plugins I've written to the new folder 3) Edited all configuration files. 4) Executed ant package. 5) Run the new Nutch... and got this error. What did I miss? Thanks, Eran On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote: Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy over the new jars? The sealing violation is multiple of the same jars being loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop? Dennis Eran Zinman wrote: Hi Dennis, 1) I've initially tried to run on my existing DFS and it didn't work. I then made a backup of my DFS and performed a format and it still didn't work... 2) I'm using: java version 1.6.0_0 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12) OpenJDK Client VM (build 14.0-b08, mixed mode, sharing) 3) My environment variables: ORBIT_SOCKETDIR=/tmp/orbit-eran SSH_AGENT_PID=3533 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1 TERM=xterm SHELL=/bin/bash XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2 WINDOWID=54653392 GTK_MODULES=canberra-gtk-module USER=eran LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.mid i=0 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36: GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387 USERNAME=eran DESKTOP_SESSION=default PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games GDM_XSERVER_LOCATION=local PWD=/home/eran JAVA_HOME=/usr/lib/jvm/default-java/ LANG=en_US.UTF-8 GDM_LANG=en_US.UTF-8 GDMSESSION=default HISTCONTROL=ignoreboth SHLVL=1 HOME=/home/eran GNOME_DESKTOP_SESSION_ID=this-is-deprecated LOGNAME=eran XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/ DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3 LESSOPEN=| /usr/bin/lesspipe %s WINDOWPATH=7 DISPLAY=:0.0 LESSCLOSE=/usr/bin/lesspipe %s %s XAUTHORITY=/home/eran/.Xauthority COLORTERM=gnome-terminal _=/usr/bin/printenv Thanks, Eran On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote: 1) Is this a new or existing Hadoop cluster? 2) What Java version are you using and what is your environment? Dennis Eran Zinman wrote: Hi, Running new Nutch version status: 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode). 2. Nutch doesn't work when I setup it to work with Hadoop either in a single or cluster setup. *I'm getting an exception: * ERROR namenode.NameNode - java.lang.SecurityException: sealing violation: can't seal package org.mortbay.util: already loaded I thought it might be a good idea that I'll attach my Hadoop conf files, so here they are: *core-site.xml* configuration property namefs.default.name/name valuehdfs://10.0.0.2:9000//value description The name of the default file system. Either the literal string local or a host:port for NDFS. /description /property /configuration *mapred-site.xml* configuration property namemapred.job.tracker/name value10.0.0.2:9001/value description The host and port that the MapReduce
Nutch 1.0 and Office 2007 documents
Hi, I'm also curious as to whether anyone has had success with Nutch and parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same errors as seen here - http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do cuments-in-Nutch-1.0-td26640949.html#a26640949 Is a separate plugin required to parse these documents (i.e., parse-msexcel, parse-mspowerpoint, etc. will *not* work?) I noticed the comment on the above thread - docx should be parsed,A plugin can be used to Parsed docx file. you get some help info from parse-html plugin and so on. - but didn't find it really helpful. Regards, Joe This message is confidential to Prodea Systems, Inc unless otherwise indicated or apparent from its nature. This message is directed to the intended recipient only, who may be readily determined by the sender of this message and its contents. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient:(a)any dissemination or copying of this message is strictly prohibited; and(b)immediately notify the sender by return message and destroy any copies of this message in any form(electronic, paper or otherwise) that you have.The delivery of this message and its information is neither intended to be nor constitutes a disclosure or waiver of any trade secrets, intellectual property, attorney work product, or attorney-client communications. The authority of the individual sending this message to legally bind Prodea Systems is neither apparent nor implied,and must be independently verified.
how to force nutch to do a recrawl
I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
Re: how to force nutch to do a recrawl
What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
RE: how to force nutch to do a recrawl
I tried that and it worked a few times, but now I get 0 records selected for fetching. $ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50 crawl started in: crawl9a rootUrlDir = urls threads = 10 depth = 15 topN = 50 Injector: starting Injector: crawlDb: crawl9a/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: crawl9a/segments/20091209124308 Generator: filtering: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl9a Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:19 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143.
Re: how to force nutch to do a recrawl
Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -- -MilleBii-
RE: how to force nutch to do a recrawl
I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 1:27 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -- -MilleBii-
Re: how to force nutch to do a recrawl
What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 1:27 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -- -MilleBii-
RE: how to force nutch to do a recrawl
I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:41 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 1:27 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -- -MilleBii-
Re: how to force nutch to do a recrawl
I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:41 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 1:27 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to a different one with the last crawl. On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I'm running Nutch 1.0 in windows. How do I force Nutch to do a complete recrawl? thanks, - Vijaya Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com http://www.sra.com/ Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. --
RE: how to force nutch to do a recrawl
Okay. I'll dig a little deeper. I saw a few scripts that people had created, but I couldn't get them to work. Thanks much. Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 4:05 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl I don't that you can use nutch crawl command to do that, this is a one stop shop command. You probably want to use individual commands. Type nutch generate to get the help and you will see the option -adddays, read that page on the wiki to get a feel how you should do: http://wiki.apache.org/nutch/Crawl 2009/12/9 Peters, Vijaya vijaya_pet...@sra.com I didn't see a setting to override in crawl-urlfilter. How do I set numberDays? I have regular expressions to include/exclude certain extensions and certain urls, but that's all I have in there. Please send me an example and I'll give it a try. Thanks! Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: xiao yang [mailto:yangxiao9...@gmail.com] Sent: Wednesday, December 09, 2009 1:41 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl What about the configuration in crawl-urlfilter.txt? On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote: I tried that too. in Nutch-site.xml, I added in the below, but this had no effect. property namedb.default.fetch.interval/name value0/value description(DEPRECATED) The default number of days between re-fetches of a page. value was 30 /description /property property namedb.fetch.interval.default/name value3600/value descriptionThe default number of seconds between re-fetches of a page (30 days). value was 2592000 (30 days) /description /property property namedb.fetch.interval.max/name value3600/value descriptionThe maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. value was 7776000 /description /property Vijaya Peters SRA International, Inc. 4350 Fair Lakes Court North Room 4004 Fairfax, VA 22033 Tel: 703-502-1184 www.sra.com Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive years P Please consider the environment before printing this e-mail This electronic message transmission contains information from SRA International, Inc. which may be confidential, privileged or proprietary. The information is intended for the use of the individual or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution, or use of the contents of this information is strictly prohibited. If you have received this electronic information in error, please notify us immediately by telephone at 866-584-2143. -Original Message- From: MilleBii [mailto:mille...@gmail.com] Sent: Wednesday, December 09, 2009 1:27 PM To: nutch-user@lucene.apache.org Subject: Re: how to force nutch to do a recrawl Nutch only recrawl every 30 days by default. So you set the numberDays adequately and it wil recrawl read nutch-default.xml to get the details 2009/12/9, xiao yang yangxiao9...@gmail.com: What do you mean by recrawl? Does the following command meets what you need? bin/nutch crawl urls -dir crawl -depth 3 -topN 50 Change the destination directory to