date:20091209

How to get all the crawled pages for perticular domain

2009-12-09 Thread bhavin pandya

Hi,

I have setup nutch 1.0 on cluster of 3 nodes.

We are running two application.

1. Nutch based search application.
We have successfully crawled approx. 25m pages on 3 nodes.
It's working as per expectation.

2. I am running application which needs to extract some information
for perticular domain.
As of date this application uses heritrix based crawler which crawls
the given domain, algorithms goes into pages and extract required
information.

As we are crawling in Nutch in distributed mode. we don't want to
recrawl using other tool like Heritrix for 2nd application.
I want to utilize same crawled data for 2nd application also.

But extraction algorithms requires all the crawled pages for
perticular domain, to extract all relevant information about that
domain.

I have thought of if somehow by writing some plugin in Nutch if i can
feed nutch crawled data to 2nd application then it will really save
our work, money and effort by not recrawling again.

But how do i get all the crawled pages for perticular domain in my
plugin?  Where i should look in nutch code.

Any pointer / idea in this direction will really help.

Thanks.
Bhavin




-- 
- Bhavin

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh

I don't know what to do! I've tried all kind of stuff but with no luck... :(

*hadoop-eran-jobtracker-master.log*
2009-12-09 12:04:53,965 FATAL mapred.JobTracker -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1610)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:180)
at org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:172)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3699)

*hadoop-eran-namenode-master.log*
2009-12-09 12:04:27,583 ERROR namenode.NameNode -
java.lang.SecurityException: sealing violation: can't seal package
org.mortbay.util: already loaded
at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

Thanks for trying to help,
Eran

On Sun, Dec 6, 2009 at 3:51 PM, Eran Zinman zze...@gmail.com wrote:

 Hi,

 Just upgraded to the latest version of Nutch with Hadoop 0.20.

 I'm getting the following exception in the namenode log and DFS doesn't
 start:

 2009-12-06 15:48:32,523 ERROR namenode.NameNode -
 java.lang.SecurityException: sealing violation: can't seal package
 org.mortbay.util: already loaded
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:235)
 at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:220)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:202)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:279)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

 Any help will be appreciated ... quite stuck with this.

 Thanks,
 Eran

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Andrzej Bialecki


Eran Zinman wrote:

Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh


Do you use the scripts in place, i.e. without deploying the nutch*.job 
to a separate Hadoop cluster? Could you please try it with a standalone 
Hadoop cluster (even if it's a pseudo-distributed, i.e. single node)?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi Andrzej,

Thanks for your help (as always).

Still getting same exception when running on standalone Hadoop cluster.
Getting same exceptions as before -  also in the datanode log I'm getting:

2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
to 10.0.0.2:9000 failed on local exception: java.io.IOException: Connection
reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote:

 Eran Zinman wrote:

 Hi,

 Sorry to bother you guys again, but it seems that no matter what I do I
 can't run the new version of Nutch with Hadoop 0.20.

 I am getting the following exceptions in my logs when I execute
 bin/start-all.sh


 Do you use the scripts in place, i.e. without deploying the nutch*.job to a
 separate Hadoop cluster? Could you please try it with a standalone Hadoop
 cluster (even if it's a pseudo-distributed, i.e. single node)?


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi,

Running new Nutch version status:

1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a single
or cluster setup.

*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
can't seal package org.mortbay.util: already loaded

I thought it might be a good idea that I'll attach my Hadoop conf files, so
here they are:

*core-site.xml*
configuration
property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
/property
/configuration

*mapred-site.xml*
configuration
property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
/property

property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
/property

property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
/property
/configuration

*hdfs-site.xml*
configuration
property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
/property

property
  namedfs.data.dir/name
  value/my_crawler/filesystem/data/value
/property

property
  namedfs.replication/name
  value2/value
/property
/configuration

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:

 Hi Andrzej,

 Thanks for your help (as always).

 Still getting same exception when running on standalone Hadoop cluster.
 Getting same exceptions as before -  also in the datanode log I'm getting:

 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
 to 10.0.0.2:9000 failed on local exception: java.io.IOException:
 Connection reset by peer
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
 at org.apache.hadoop.ipc.Client.call(Client.java:742)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
 at $Proxy4.getProtocolVersion(Unknown Source)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
 at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
 at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 Caused by: java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcher.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
 at sun.nio.ch.IOUtil.read(IOUtil.java:206)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
 at
 org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
 at
 org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
 at
 org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
 at java.io.FilterInputStream.read(FilterInputStream.java:116)
 at
 org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
 at java.io.DataInputStream.readInt(DataInputStream.java:370)
 at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
 at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote:

 Eran Zinman wrote:

 Hi,

 Sorry to bother you guys again, but it seems that no matter what I do I
 can't run the new version of Nutch with Hadoop 0.20.

 I am getting the following exceptions in my logs when I execute
 bin/start-all.sh


 Do you use the scripts in place, i.e. without deploying the nutch*.job to
 a separate Hadoop cluster? Could you please try it with a standalone Hadoop
 cluster (even if it's a pseudo-distributed, i.e. single node)?


 --
 Best regards,
 Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes


1) Is this a new or existing Hadoop cluster?
2) What Java version are you using and what is your environment?

Dennis

Eran Zinman wrote:

Hi,

Running new Nutch version status:

1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a single
or cluster setup.

*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
can't seal package org.mortbay.util: already loaded

I thought it might be a good idea that I'll attach my Hadoop conf files, so
here they are:

*core-site.xml*
configuration
property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
/property
/configuration

*mapred-site.xml*
configuration
property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
/property

property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
/property

property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
/property
/configuration

*hdfs-site.xml*
configuration
property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
/property

property
  namedfs.data.dir/name
  value/my_crawler/filesystem/data/value
/property

property
  namedfs.replication/name
  value2/value
/property
/configuration

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:


Hi Andrzej,

Thanks for your help (as always).

Still getting same exception when running on standalone Hadoop cluster.
Getting same exceptions as before -  also in the datanode log I'm getting:

2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException: Call
to 10.0.0.2:9000 failed on local exception: java.io.IOException:
Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:314)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:291)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:269)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
at
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:276)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at
org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)

Thanks,
Eran


On Wed, Dec 9, 2009 at 12:12 PM, Andrzej Bialecki a...@getopt.org wrote:


Eran Zinman wrote:


Hi,

Sorry to bother you guys again, but it seems that no matter what I do I
can't run the new version of Nutch with Hadoop 0.20.

I am getting the following exceptions in my logs when I execute
bin/start-all.sh


Do you use the scripts in place, i.e. without deploying the nutch*.job to
a separate Hadoop cluster? Could you please try it with a standalone Hadoop
cluster (even if it's a pseudo-distributed, i.e. single node)?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi Dennis,

1) I've initially tried to run on my existing DFS and it didn't work. I then
made a backup of my DFS and performed a format and it still didn't work...

2) I'm using:

java version 1.6.0_0
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

3) My environment variables:

ORBIT_SOCKETDIR=/tmp/orbit-eran
SSH_AGENT_PID=3533
GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
TERM=xterm
SHELL=/bin/bash
XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
WINDOWID=54653392
GTK_MODULES=canberra-gtk-module
USER=eran
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:
GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
USERNAME=eran
DESKTOP_SESSION=default
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
GDM_XSERVER_LOCATION=local
PWD=/home/eran
JAVA_HOME=/usr/lib/jvm/default-java/
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
GDMSESSION=default
HISTCONTROL=ignoreboth
SHLVL=1
HOME=/home/eran
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=eran
XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
LESSOPEN=| /usr/bin/lesspipe %s
WINDOWPATH=7
DISPLAY=:0.0
LESSCLOSE=/usr/bin/lesspipe %s %s
XAUTHORITY=/home/eran/.Xauthority
COLORTERM=gnome-terminal
_=/usr/bin/printenv

Thanks,
Eran


On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

 1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

 Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
The name of the default file system. Either the literal string
local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
The host and port that the MapReduce job tracker runs at. If
local, then jobs are run in-process as a single map and
reduce task.
  /description
 /property

 property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
 /property

 property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
 /property
 /configuration

 *hdfs-site.xml*
 configuration
 property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
 /property

 property
  namedfs.data.dir/name
  value/my_crawler/filesystem/data/value
 /property

 property
  namedfs.replication/name
  value2/value
 /property
 /configuration

 Thanks,
 Eran

 On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:

  Hi Andrzej,

 Thanks for your help (as always).

 Still getting same exception when running on standalone Hadoop cluster.
 Getting same exceptions as before -  also in the datanode log I'm
 getting:

 2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException:
 Call
 to 10.0.0.2:9000 failed on local exception: java.io.IOException:
 Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes

Did you do a fresh install of Nutch with Hadoop 0.20 or did you just 
copy over the new jars?  The sealing violation is multiple of the same 
jars being loaded and the Jetty versions changed between 0.19 and 0.20 
for Hadoop?


Dennis

Eran Zinman wrote:

Hi Dennis,

1) I've initially tried to run on my existing DFS and it didn't work. I then
made a backup of my DFS and performed a format and it still didn't work...

2) I'm using:

java version 1.6.0_0
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

3) My environment variables:

ORBIT_SOCKETDIR=/tmp/orbit-eran
SSH_AGENT_PID=3533
GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
TERM=xterm
SHELL=/bin/bash
XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
WINDOWID=54653392
GTK_MODULES=canberra-gtk-module
USER=eran
LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0

0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:

GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
USERNAME=eran
DESKTOP_SESSION=default
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
GDM_XSERVER_LOCATION=local
PWD=/home/eran
JAVA_HOME=/usr/lib/jvm/default-java/
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
GDMSESSION=default
HISTCONTROL=ignoreboth
SHLVL=1
HOME=/home/eran
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=eran
XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
LESSOPEN=| /usr/bin/lesspipe %s
WINDOWPATH=7
DISPLAY=:0.0
LESSCLOSE=/usr/bin/lesspipe %s %s
XAUTHORITY=/home/eran/.Xauthority
COLORTERM=gnome-terminal
_=/usr/bin/printenv

Thanks,
Eran


On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:


1) Is this a new or existing Hadoop cluster?
2) What Java version are you using and what is your environment?

Dennis


Eran Zinman wrote:


Hi,

Running new Nutch version status:

1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a
single
or cluster setup.

*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException: sealing violation:
can't seal package org.mortbay.util: already loaded

I thought it might be a good idea that I'll attach my Hadoop conf files,
so
here they are:

*core-site.xml*
configuration
property
 namefs.default.name/name
 valuehdfs://10.0.0.2:9000//value
 description
   The name of the default file system. Either the literal string
   local or a host:port for NDFS.
 /description
/property
/configuration

*mapred-site.xml*
configuration
property
 namemapred.job.tracker/name
 value10.0.0.2:9001/value
 description
   The host and port that the MapReduce job tracker runs at. If
   local, then jobs are run in-process as a single map and
   reduce task.
 /description
/property

property
 namemapred.system.dir/name
 value/my_crawler/filesystem/mapreduce/system/value
/property

property
 namemapred.local.dir/name
 value/my_crawler/filesystem/mapreduce/local/value
/property
/configuration

*hdfs-site.xml*
configuration
property
 namedfs.name.dir/name
 value/my_crawler/filesystem/name/value
/property

property
 namedfs.data.dir/name
 value/my_crawler/filesystem/data/value
/property

property
 namedfs.replication/name
 value2/value
/property
/configuration

Thanks,
Eran

On Wed, Dec 9, 2009 at 12:22 PM, Eran Zinman zze...@gmail.com wrote:

 Hi Andrzej,

Thanks for your help (as always).

Still getting same exception when running on standalone Hadoop cluster.
Getting same exceptions as before -  also in the datanode log I'm
getting:

2009-12-09 12:20:37,805 ERROR datanode.DataNode - java.io.IOException:
Call
to 10.0.0.2:9000 failed on local exception: java.io.IOException:
Connection reset by peer
   at

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi Dennis,

Thanks for trying to help.

I don't know what fresh install means exactly.

Here is what I've done:
1) Downloaded latest version of Nutch from the SVN to a new folder.
2) Copied all the custom plugins I've written to the new folder
3) Edited all configuration files.
4) Executed ant package.
5) Run the new Nutch... and got this error.

What did I miss?

Thanks,
Eran

On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote:

 Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy
 over the new jars?  The sealing violation is multiple of the same jars being
 loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop?

 Dennis


 Eran Zinman wrote:

 Hi Dennis,

 1) I've initially tried to run on my existing DFS and it didn't work. I
 then
 made a backup of my DFS and performed a format and it still didn't work...

 2) I'm using:

 java version 1.6.0_0
 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
 OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

 3) My environment variables:

 ORBIT_SOCKETDIR=/tmp/orbit-eran
 SSH_AGENT_PID=3533
 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
 TERM=xterm
 SHELL=/bin/bash

 XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
 WINDOWID=54653392
 GTK_MODULES=canberra-gtk-module
 USER=eran

 LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0


 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:

 GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
 SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
 SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
 USERNAME=eran
 DESKTOP_SESSION=default

 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
 GDM_XSERVER_LOCATION=local
 PWD=/home/eran
 JAVA_HOME=/usr/lib/jvm/default-java/
 LANG=en_US.UTF-8
 GDM_LANG=en_US.UTF-8
 GDMSESSION=default
 HISTCONTROL=ignoreboth
 SHLVL=1
 HOME=/home/eran
 GNOME_DESKTOP_SESSION_ID=this-is-deprecated
 LOGNAME=eran
 XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/

 DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
 LESSOPEN=| /usr/bin/lesspipe %s
 WINDOWPATH=7
 DISPLAY=:0.0
 LESSCLOSE=/usr/bin/lesspipe %s %s
 XAUTHORITY=/home/eran/.Xauthority
 COLORTERM=gnome-terminal
 _=/usr/bin/printenv

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

  1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

  Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing
 violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
   The name of the default file system. Either the literal string
   local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
   The host and port that the MapReduce job tracker runs at. If
   local, then jobs are run in-process as a single map and
   reduce task.
  /description
 /property

 property
  namemapred.system.dir/name
  value/my_crawler/filesystem/mapreduce/system/value
 /property

 property
  namemapred.local.dir/name
  value/my_crawler/filesystem/mapreduce/local/value
 /property
 /configuration

 *hdfs-site.xml*
 configuration
 property
  namedfs.name.dir/name
  value/my_crawler/filesystem/name/value
 /property

 property
  namedfs.data.dir/name

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Eran Zinman

Hi all,

thanks Dennis - you helped me solve the problem.

The problem was that I had two versions of jetty in my lib folder.

I deleted the old version and viola - it works.

The problem is that both versions exist in the SVN! Altough I took a fresh
copy of the SVN I had both versions in my lib folder. I think we need to
remove the old version from the SVN so people like me won't get confused ...

Thanks !
Eran.

On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote:

 Hi Dennis,

 Thanks for trying to help.

 I don't know what fresh install means exactly.

 Here is what I've done:
 1) Downloaded latest version of Nutch from the SVN to a new folder.
 2) Copied all the custom plugins I've written to the new folder
 3) Edited all configuration files.
 4) Executed ant package.
 5) Run the new Nutch... and got this error.

 What did I miss?

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote:

 Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy
 over the new jars?  The sealing violation is multiple of the same jars being
 loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop?

 Dennis


 Eran Zinman wrote:

 Hi Dennis,

 1) I've initially tried to run on my existing DFS and it didn't work. I
 then
 made a backup of my DFS and performed a format and it still didn't
 work...

 2) I'm using:

 java version 1.6.0_0
 OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
 OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

 3) My environment variables:

 ORBIT_SOCKETDIR=/tmp/orbit-eran
 SSH_AGENT_PID=3533
 GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
 TERM=xterm
 SHELL=/bin/bash

 XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
 GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
 WINDOWID=54653392
 GTK_MODULES=canberra-gtk-module
 USER=eran

 LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.midi=0


 0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:

 GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
 SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
 SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
 USERNAME=eran
 DESKTOP_SESSION=default

 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
 GDM_XSERVER_LOCATION=local
 PWD=/home/eran
 JAVA_HOME=/usr/lib/jvm/default-java/
 LANG=en_US.UTF-8
 GDM_LANG=en_US.UTF-8
 GDMSESSION=default
 HISTCONTROL=ignoreboth
 SHLVL=1
 HOME=/home/eran
 GNOME_DESKTOP_SESSION_ID=this-is-deprecated
 LOGNAME=eran
 XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/

 DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
 LESSOPEN=| /usr/bin/lesspipe %s
 WINDOWPATH=7
 DISPLAY=:0.0
 LESSCLOSE=/usr/bin/lesspipe %s %s
 XAUTHORITY=/home/eran/.Xauthority
 COLORTERM=gnome-terminal
 _=/usr/bin/printenv

 Thanks,
 Eran


 On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

  1) Is this a new or existing Hadoop cluster?
 2) What Java version are you using and what is your environment?

 Dennis


 Eran Zinman wrote:

  Hi,

 Running new Nutch version status:

 1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
 mode).
 2. Nutch doesn't work when I setup it to work with Hadoop either in a
 single
 or cluster setup.

 *I'm getting an exception: *
 ERROR namenode.NameNode - java.lang.SecurityException: sealing
 violation:
 can't seal package org.mortbay.util: already loaded

 I thought it might be a good idea that I'll attach my Hadoop conf
 files,
 so
 here they are:

 *core-site.xml*
 configuration
 property
  namefs.default.name/name
  valuehdfs://10.0.0.2:9000//value
  description
   The name of the default file system. Either the literal string
   local or a host:port for NDFS.
  /description
 /property
 /configuration

 *mapred-site.xml*
 configuration
 property
  namemapred.job.tracker/name
  value10.0.0.2:9001/value
  description
   The host and port that the MapReduce job tracker runs at. If

Re: Nutch Hadoop 0.20 - Exception

2009-12-09 Thread Dennis Kubes

Done.  I have removed the old Jetty jars from the SVN.  Thanks for 
bringing this issue forward.


Dennis

Eran Zinman wrote:

Hi all,

thanks Dennis - you helped me solve the problem.

The problem was that I had two versions of jetty in my lib folder.

I deleted the old version and viola - it works.

The problem is that both versions exist in the SVN! Altough I took a fresh
copy of the SVN I had both versions in my lib folder. I think we need to
remove the old version from the SVN so people like me won't get confused ...

Thanks !
Eran.

On Wed, Dec 9, 2009 at 4:10 PM, Eran Zinman zze...@gmail.com wrote:


Hi Dennis,

Thanks for trying to help.

I don't know what fresh install means exactly.

Here is what I've done:
1) Downloaded latest version of Nutch from the SVN to a new folder.
2) Copied all the custom plugins I've written to the new folder
3) Edited all configuration files.
4) Executed ant package.
5) Run the new Nutch... and got this error.

What did I miss?

Thanks,
Eran


On Wed, Dec 9, 2009 at 3:36 PM, Dennis Kubes ku...@apache.org wrote:


Did you do a fresh install of Nutch with Hadoop 0.20 or did you just copy
over the new jars?  The sealing violation is multiple of the same jars being
loaded and the Jetty versions changed between 0.19 and 0.20 for Hadoop?

Dennis


Eran Zinman wrote:


Hi Dennis,

1) I've initially tried to run on my existing DFS and it didn't work. I
then
made a backup of my DFS and performed a format and it still didn't
work...

2) I'm using:

java version 1.6.0_0
OpenJDK Runtime Environment (IcedTea6 1.4.1) (6b14-1.4.1-0ubuntu12)
OpenJDK Client VM (build 14.0-b08, mixed mode, sharing)

3) My environment variables:

ORBIT_SOCKETDIR=/tmp/orbit-eran
SSH_AGENT_PID=3533
GPG_AGENT_INFO=/tmp/seahorse-Gq6lRI/S.gpg-agent:3557:1
TERM=xterm
SHELL=/bin/bash

XDG_SESSION_COOKIE=1a02c2275727547fa7209ad54a91276c-1260199857.905267-2000911890
GTK_RC_FILES=/etc/gtk/gtkrc:/home/eran/.gtkrc-1.2-gnome2
WINDOWID=54653392
GTK_MODULES=canberra-gtk-module
USER=eran

LS_COLORS=no=00:fi=00:di=01;34:ln=01;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.svgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.rar=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.mid=00;36:*.mid

i=0



0;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:


GNOME_KEYRING_SOCKET=/tmp/keyring-0Vt0yu/socket
SSH_AUTH_SOCK=/tmp/keyring-0Vt0yu/socket.ssh
SESSION_MANAGER=local/eran:/tmp/.ICE-unix/3387
USERNAME=eran
DESKTOP_SESSION=default

PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
GDM_XSERVER_LOCATION=local
PWD=/home/eran
JAVA_HOME=/usr/lib/jvm/default-java/
LANG=en_US.UTF-8
GDM_LANG=en_US.UTF-8
GDMSESSION=default
HISTCONTROL=ignoreboth
SHLVL=1
HOME=/home/eran
GNOME_DESKTOP_SESSION_ID=this-is-deprecated
LOGNAME=eran
XDG_DATA_DIRS=/usr/local/share/:/usr/share/:/usr/share/gdm/

DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-E4IJ0hMrD8,guid=c3caaf3e590c65a58904ca7f4b1d1fb3
LESSOPEN=| /usr/bin/lesspipe %s
WINDOWPATH=7
DISPLAY=:0.0
LESSCLOSE=/usr/bin/lesspipe %s %s
XAUTHORITY=/home/eran/.Xauthority
COLORTERM=gnome-terminal
_=/usr/bin/printenv

Thanks,
Eran


On Wed, Dec 9, 2009 at 2:38 PM, Dennis Kubes ku...@apache.org wrote:

 1) Is this a new or existing Hadoop cluster?

2) What Java version are you using and what is your environment?

Dennis


Eran Zinman wrote:

 Hi,

Running new Nutch version status:

1. Nutch runs perfectly if Hadoop is disabled (i.e. running in normal
mode).
2. Nutch doesn't work when I setup it to work with Hadoop either in a
single
or cluster setup.

*I'm getting an exception: *
ERROR namenode.NameNode - java.lang.SecurityException: sealing
violation:
can't seal package org.mortbay.util: already loaded

I thought it might be a good idea that I'll attach my Hadoop conf
files,
so
here they are:

*core-site.xml*
configuration
property
 namefs.default.name/name
 valuehdfs://10.0.0.2:9000//value
 description
  The name of the default file system. Either the literal string
  local or a host:port for NDFS.
 /description
/property
/configuration

*mapred-site.xml*
configuration
property
 namemapred.job.tracker/name
 value10.0.0.2:9001/value
 description
  The host and port that the MapReduce

Nutch 1.0 and Office 2007 documents

2009-12-09 Thread Joe Bell

Hi,

 

I'm also curious as to whether anyone has had success with Nutch and
parsing Office 2007 documents (.pptx, .xlsx, .docx) - I get the same
errors as seen here -
http://old.nabble.com/How-to-successfully-crawl-and-index-office-2007-do
cuments-in-Nutch-1.0-td26640949.html#a26640949 

 

Is a separate plugin required to parse these documents (i.e.,
parse-msexcel, parse-mspowerpoint, etc. will *not* work?) 

 

I noticed the comment on the above thread - docx should be parsed,A
plugin can be used to Parsed docx file. you get some 
help info from parse-html plugin and so on. - but didn't find it really
helpful.

 

Regards,

Joe




This message is confidential to Prodea Systems, Inc unless otherwise indicated 
or apparent from its nature. This message is directed to the intended recipient 
only, who may be readily determined by the sender of this message and its 
contents. If the reader of this message is not the intended recipient, or an 
employee or agent responsible for delivering this message to the intended 
recipient:(a)any dissemination or copying of this message is strictly 
prohibited; and(b)immediately notify the sender by return message and destroy 
any copies of this message in any form(electronic, paper or otherwise) that you 
have.The delivery of this message and its information is neither intended to be 
nor constitutes a disclosure or waiver of any trade secrets, intellectual 
property, attorney work product, or attorney-client communications. The 
authority of the individual sending this message to legally bind Prodea Systems 
 
is neither apparent nor implied,and must be independently verified.

how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya

I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
recrawl?

 

thanks,

- Vijaya

 

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com http://www.sra.com/ 
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years

P Please consider the environment before printing this e-mail

This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang

What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.

On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya

I tried that and it worked a few times, but now I get 0 records selected for 
fetching.

$ bin/nutch crawl urls -dir crawl9a -depth 15 -topN 50
crawl started in: crawl9a
rootUrlDir = urls
threads = 10
depth = 15
topN = 50
Injector: starting
Injector: crawlDb: crawl9a/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: crawl9a/segments/20091209124308
Generator: filtering: true
Generator: topN: 50
Generator: jobtracker is 'local', generating exactly one
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.
No URLs to fetch - check your seed list and URL filters.
crawl finished: crawl9a

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive 
years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA 
International, Inc. which may be confidential, privileged or proprietary.  The 
information is intended for the use of the individual or entity named above.  
If you are not the intended recipient, be aware that any disclosure, copying, 
distribution, or use of the contents of this information is strictly 
prohibited.  If you have received this electronic information in error, please 
notify us immediately by telephone at 866-584-2143.
-Original Message-
From: xiao yang [mailto:yangxiao9...@gmail.com] 
Sent: Wednesday, December 09, 2009 1:19 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

What do you mean by recrawl?
Does the following command meets what you need?
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Change the destination directory to a different one with the last crawl.

On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii

Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details

2009/12/9, xiao yang yangxiao9...@gmail.com:
 What do you mean by recrawl?
 Does the following command meets what you need?
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 Change the destination directory to a different one with the last crawl.

 On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com
 wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.







-- 
-MilleBii-

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya

I tried that too.  
in Nutch-site.xml, I added in the below, but this had no effect.

property
  namedb.default.fetch.interval/name
  value0/value
  description(DEPRECATED) The default number of days between re-fetches of a 
page.  value was 30
  /description
/property

property
  namedb.fetch.interval.default/name
  value3600/value
  descriptionThe default number of seconds between re-fetches of a page (30 
days). value was 2592000 (30 days)
  /description
/property

property
  namedb.fetch.interval.max/name
  value3600/value
  descriptionThe maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.  value was 7776000
  /description
/property

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive 
years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA 
International, Inc. which may be confidential, privileged or proprietary.  The 
information is intended for the use of the individual or entity named above.  
If you are not the intended recipient, be aware that any disclosure, copying, 
distribution, or use of the contents of this information is strictly 
prohibited.  If you have received this electronic information in error, please 
notify us immediately by telephone at 866-584-2143.

-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Wednesday, December 09, 2009 1:27 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

Nutch only recrawl every 30 days by default. So you set the numberDays
adequately and it wil recrawl read nutch-default.xml to get the
details

2009/12/9, xiao yang yangxiao9...@gmail.com:
 What do you mean by recrawl?
 Does the following command meets what you need?
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 Change the destination directory to a different one with the last crawl.

 On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com
 wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.







-- 
-MilleBii-

Re: how to force nutch to do a recrawl

2009-12-09 Thread xiao yang

What about the configuration in crawl-urlfilter.txt?

On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
 I tried that too.
 in Nutch-site.xml, I added in the below, but this had no effect.

 property
  namedb.default.fetch.interval/name
  value0/value
  description(DEPRECATED) The default number of days between re-fetches of a 
 page.  value was 30
  /description
 /property

 property
  namedb.fetch.interval.default/name
  value3600/value
  descriptionThe default number of seconds between re-fetches of a page (30 
 days). value was 2592000 (30 days)
  /description
 /property

 property
  namedb.fetch.interval.max/name
  value3600/value
  descriptionThe maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.  value was 7776000
  /description
 /property

 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive 
 years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA 
 International, Inc. which may be confidential, privileged or proprietary.  
 The information is intended for the use of the individual or entity named 
 above.  If you are not the intended recipient, be aware that any disclosure, 
 copying, distribution, or use of the contents of this information is strictly 
 prohibited.  If you have received this electronic information in error, 
 please notify us immediately by telephone at 866-584-2143.

 -Original Message-
 From: MilleBii [mailto:mille...@gmail.com]
 Sent: Wednesday, December 09, 2009 1:27 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl

 Nutch only recrawl every 30 days by default. So you set the numberDays
 adequately and it wil recrawl read nutch-default.xml to get the
 details

 2009/12/9, xiao yang yangxiao9...@gmail.com:
 What do you mean by recrawl?
 Does the following command meets what you need?
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 Change the destination directory to a different one with the last crawl.

 On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com
 wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.







 --
 -MilleBii-

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya

I didn't see a setting to override in crawl-urlfilter.  How do I set 
numberDays? I have regular expressions to include/exclude certain extensions 
and certain urls, but that's all I have in there.

Please send me an example and I'll give it a try.

Thanks!

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive 
years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA 
International, Inc. which may be confidential, privileged or proprietary.  The 
information is intended for the use of the individual or entity named above.  
If you are not the intended recipient, be aware that any disclosure, copying, 
distribution, or use of the contents of this information is strictly 
prohibited.  If you have received this electronic information in error, please 
notify us immediately by telephone at 866-584-2143.

-Original Message-
From: xiao yang [mailto:yangxiao9...@gmail.com] 
Sent: Wednesday, December 09, 2009 1:41 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

What about the configuration in crawl-urlfilter.txt?

On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com wrote:
 I tried that too.
 in Nutch-site.xml, I added in the below, but this had no effect.

 property
  namedb.default.fetch.interval/name
  value0/value
  description(DEPRECATED) The default number of days between re-fetches of a 
 page.  value was 30
  /description
 /property

 property
  namedb.fetch.interval.default/name
  value3600/value
  descriptionThe default number of seconds between re-fetches of a page (30 
 days). value was 2592000 (30 days)
  /description
 /property

 property
  namedb.fetch.interval.max/name
  value3600/value
  descriptionThe maximum number of seconds between re-fetches of a page
  (90 days). After this period every page in the db will be re-tried, no
  matter what is its status.  value was 7776000
  /description
 /property

 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive 
 years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA 
 International, Inc. which may be confidential, privileged or proprietary.  
 The information is intended for the use of the individual or entity named 
 above.  If you are not the intended recipient, be aware that any disclosure, 
 copying, distribution, or use of the contents of this information is strictly 
 prohibited.  If you have received this electronic information in error, 
 please notify us immediately by telephone at 866-584-2143.

 -Original Message-
 From: MilleBii [mailto:mille...@gmail.com]
 Sent: Wednesday, December 09, 2009 1:27 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl

 Nutch only recrawl every 30 days by default. So you set the numberDays
 adequately and it wil recrawl read nutch-default.xml to get the
 details

 2009/12/9, xiao yang yangxiao9...@gmail.com:
 What do you mean by recrawl?
 Does the following command meets what you need?
 bin/nutch crawl urls -dir crawl -depth 3 -topN 50
 Change the destination directory to a different one with the last crawl.

 On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com
 wrote:
 I'm running Nutch 1.0 in windows.  How do I force Nutch to do a complete
 recrawl?



 thanks,

 - Vijaya



 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com http://www.sra.com/
 Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years

 P Please consider the environment before printing this e-mail

 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
 proprietary.  The information is intended for the use of the individual
 or entity named above.  If you are not the intended recipient, be aware
 that any disclosure, copying, distribution, or use of the contents of
 this information is strictly prohibited.  If you have received this
 electronic information in error, please notify us immediately by
 telephone at 866-584-2143.







 --
 -MilleBii-

Re: how to force nutch to do a recrawl

2009-12-09 Thread MilleBii

I don't that you can use nutch crawl command to do that, this is a one stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option -adddays,
read that page on the wiki to get a feel how you should do:
http://wiki.apache.org/nutch/Crawl

2009/12/9 Peters, Vijaya vijaya_pet...@sra.com

 I didn't see a setting to override in crawl-urlfilter.  How do I set
 numberDays? I have regular expressions to include/exclude certain extensions
 and certain urls, but that's all I have in there.

 Please send me an example and I'll give it a try.

 Thanks!

 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10 consecutive
 years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or proprietary.
  The information is intended for the use of the individual or entity named
 above.  If you are not the intended recipient, be aware that any disclosure,
 copying, distribution, or use of the contents of this information is
 strictly prohibited.  If you have received this electronic information in
 error, please notify us immediately by telephone at 866-584-2143.

 -Original Message-
 From: xiao yang [mailto:yangxiao9...@gmail.com]
 Sent: Wednesday, December 09, 2009 1:41 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl

 What about the configuration in crawl-urlfilter.txt?

 On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya vijaya_pet...@sra.com
 wrote:
  I tried that too.
  in Nutch-site.xml, I added in the below, but this had no effect.
 
  property
   namedb.default.fetch.interval/name
   value0/value
   description(DEPRECATED) The default number of days between re-fetches
 of a page.  value was 30
   /description
  /property
 
  property
   namedb.fetch.interval.default/name
   value3600/value
   descriptionThe default number of seconds between re-fetches of a page
 (30 days). value was 2592000 (30 days)
   /description
  /property
 
  property
   namedb.fetch.interval.max/name
   value3600/value
   descriptionThe maximum number of seconds between re-fetches of a page
   (90 days). After this period every page in the db will be re-tried, no
   matter what is its status.  value was 7776000
   /description
  /property
 
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
 
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or proprietary.
  The information is intended for the use of the individual or entity named
 above.  If you are not the intended recipient, be aware that any disclosure,
 copying, distribution, or use of the contents of this information is
 strictly prohibited.  If you have received this electronic information in
 error, please notify us immediately by telephone at 866-584-2143.
 
  -Original Message-
  From: MilleBii [mailto:mille...@gmail.com]
  Sent: Wednesday, December 09, 2009 1:27 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
 
  Nutch only recrawl every 30 days by default. So you set the numberDays
  adequately and it wil recrawl read nutch-default.xml to get the
  details
 
  2009/12/9, xiao yang yangxiao9...@gmail.com:
  What do you mean by recrawl?
  Does the following command meets what you need?
  bin/nutch crawl urls -dir crawl -depth 3 -topN 50
  Change the destination directory to a different one with the last crawl.
 
  On Thu, Dec 10, 2009 at 1:44 AM, Peters, Vijaya vijaya_pet...@sra.com
  wrote:
  I'm running Nutch 1.0 in windows.  How do I force Nutch to do a
 complete
  recrawl?
 
 
 
  thanks,
 
  - Vijaya
 
 
 
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
 
  www.sra.com http://www.sra.com/
  Named to FORTUNE's 100 Best Companies to Work For list for 10
  consecutive years
 
  P Please consider the environment before printing this e-mail
 
  This electronic message transmission contains information from SRA
  International, Inc. which may be confidential, privileged or
  proprietary.  The information is intended for the use of the individual
  or entity named above.  If you are not the intended recipient, be aware
  that any disclosure, copying, distribution, or use of the contents of
  this information is strictly prohibited.  If you have received this
  electronic information in error, please notify us immediately by
  telephone at 866-584-2143.
 
 
 
 
 
 
 
  --

RE: how to force nutch to do a recrawl

2009-12-09 Thread Peters, Vijaya

Okay.  I'll dig a little deeper.  I saw a few scripts that people had
created, but I couldn't get them to work.

Thanks much.

Vijaya Peters
SRA International, Inc.
4350 Fair Lakes Court North
Room 4004
Fairfax, VA  22033
Tel:  703-502-1184

www.sra.com
Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive years
P Please consider the environment before printing this e-mail
This electronic message transmission contains information from SRA
International, Inc. which may be confidential, privileged or
proprietary.  The information is intended for the use of the individual
or entity named above.  If you are not the intended recipient, be aware
that any disclosure, copying, distribution, or use of the contents of
this information is strictly prohibited.  If you have received this
electronic information in error, please notify us immediately by
telephone at 866-584-2143.

-Original Message-
From: MilleBii [mailto:mille...@gmail.com] 
Sent: Wednesday, December 09, 2009 4:05 PM
To: nutch-user@lucene.apache.org
Subject: Re: how to force nutch to do a recrawl

I don't that you can use nutch crawl command to do that, this is a one
stop
shop command.
You probably want to use individual commands.
Type nutch generate to get the help and you will see the option
-adddays,
read that page on the wiki to get a feel how you should do:
http://wiki.apache.org/nutch/Crawl

2009/12/9 Peters, Vijaya vijaya_pet...@sra.com

 I didn't see a setting to override in crawl-urlfilter.  How do I set
 numberDays? I have regular expressions to include/exclude certain
extensions
 and certain urls, but that's all I have in there.

 Please send me an example and I'll give it a try.

 Thanks!

 Vijaya Peters
 SRA International, Inc.
 4350 Fair Lakes Court North
 Room 4004
 Fairfax, VA  22033
 Tel:  703-502-1184

 www.sra.com
 Named to FORTUNE's 100 Best Companies to Work For list for 10
consecutive
 years
 P Please consider the environment before printing this e-mail
 This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
proprietary.
  The information is intended for the use of the individual or entity
named
 above.  If you are not the intended recipient, be aware that any
disclosure,
 copying, distribution, or use of the contents of this information is
 strictly prohibited.  If you have received this electronic information
in
 error, please notify us immediately by telephone at 866-584-2143.

 -Original Message-
 From: xiao yang [mailto:yangxiao9...@gmail.com]
 Sent: Wednesday, December 09, 2009 1:41 PM
 To: nutch-user@lucene.apache.org
 Subject: Re: how to force nutch to do a recrawl

 What about the configuration in crawl-urlfilter.txt?

 On Thu, Dec 10, 2009 at 2:29 AM, Peters, Vijaya
vijaya_pet...@sra.com
 wrote:
  I tried that too.
  in Nutch-site.xml, I added in the below, but this had no effect.
 
  property
   namedb.default.fetch.interval/name
   value0/value
   description(DEPRECATED) The default number of days between
re-fetches
 of a page.  value was 30
   /description
  /property
 
  property
   namedb.fetch.interval.default/name
   value3600/value
   descriptionThe default number of seconds between re-fetches of a
page
 (30 days). value was 2592000 (30 days)
   /description
  /property
 
  property
   namedb.fetch.interval.max/name
   value3600/value
   descriptionThe maximum number of seconds between re-fetches of a
page
   (90 days). After this period every page in the db will be re-tried,
no
   matter what is its status.  value was 7776000
   /description
  /property
 
  Vijaya Peters
  SRA International, Inc.
  4350 Fair Lakes Court North
  Room 4004
  Fairfax, VA  22033
  Tel:  703-502-1184
 
  www.sra.com
  Named to FORTUNE's 100 Best Companies to Work For list for 10
 consecutive years
  P Please consider the environment before printing this e-mail
  This electronic message transmission contains information from SRA
 International, Inc. which may be confidential, privileged or
proprietary.
  The information is intended for the use of the individual or entity
named
 above.  If you are not the intended recipient, be aware that any
disclosure,
 copying, distribution, or use of the contents of this information is
 strictly prohibited.  If you have received this electronic information
in
 error, please notify us immediately by telephone at 866-584-2143.
 
  -Original Message-
  From: MilleBii [mailto:mille...@gmail.com]
  Sent: Wednesday, December 09, 2009 1:27 PM
  To: nutch-user@lucene.apache.org
  Subject: Re: how to force nutch to do a recrawl
 
  Nutch only recrawl every 30 days by default. So you set the
numberDays
  adequately and it wil recrawl read nutch-default.xml to get the
  details
 
  2009/12/9, xiao yang yangxiao9...@gmail.com:
  What do you mean by recrawl?
  Does the following command meets what you need?
  bin/nutch crawl urls -dir crawl -depth 3 -topN 50
  Change the destination directory to

How to get all the crawled pages for perticular domain

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Re: Nutch Hadoop 0.20 - Exception

Nutch 1.0 and Office 2007 documents

how to force nutch to do a recrawl

Re: how to force nutch to do a recrawl

RE: how to force nutch to do a recrawl

Re: how to force nutch to do a recrawl

RE: how to force nutch to do a recrawl

Re: how to force nutch to do a recrawl

RE: how to force nutch to do a recrawl

Re: how to force nutch to do a recrawl

RE: how to force nutch to do a recrawl

21 matches

Site Navigation

Mail list logo

Footer information