Hiring a Nutch Developer

2005-11-04 Thread Nathan Gwilliam
We're looking for a Nutch developer we can hire to build a nutch search 
engine for our sites.  Are any of you doing side projects?


Nathan Gwilliam
Adoption.com  Families.com
[EMAIL PROTECTED]

   



Re: Hiring a Nutch Developer

2005-11-04 Thread Arun Kumar Sharma
Hi Nathan,
Please send me more details 

Nathan Gwilliam [EMAIL PROTECTED] wrote:
We're looking for a Nutch developer we can hire to build a nutch search 
engine for our sites. Are any of you doing side projects?

Nathan Gwilliam
Adoption.com  Families.com
[EMAIL PROTECTED]

 



WITH WARM REGARDS,
ARUN K. SHARMA (Sr. Java Developer)

Mob: +919815295761
(W): 0172-5079323(ext)21


-
 Enjoy this Diwali with Y! India Click here

Re: Hiring a Nutch Developer

2005-11-04 Thread Nathan Gwilliam
I actually have several projects, but let's start with the first.  We 
need to create a search engine that crawls about 20 adoption-related 
sites that we are affiliated with, such as:


adoption.com
fosterparenting.com
crisispregnancy.com
adoption.org
adopting.org
123adoption.com (which includes a bunch of 5-page URLs in it's network)
parentprofiles.com
adoptioninformation.com
adoptionshop.com
specialneeds.net (about to launch)
infertilitycentral.com
fertilityforums.com

Then, we need to implement a combined site search for all of these 
sites, and then have the ability for each of the sites in the group to 
have a site search that only searches the subset of pages/sites that we 
indicate from the larger database.  In other words, we need a search on 
adoptionshop.com that only searches the products from adoptionshop.com.


We want to be able to preference pages based on title, URL, keyword 
density, etc.


We will provide the server hardware, the graphical templates and the 
URLs.  You would get the site search crawled, indexed and working.


What would you charge us for something like this?  Please include a 
couple of hours in your bid to train our developers on what you have done.


Thanks,
Nathan

Arun Kumar Sharma wrote:


Hi Nathan,
   Please send me more details 


Nathan Gwilliam [EMAIL PROTECTED] wrote:
We're looking for a Nutch developer we can hire to build a nutch search 
engine for our sites. Are any of you doing side projects?


Nathan Gwilliam
Adoption.com  Families.com
[EMAIL PROTECTED]

 

 




WITH WARM REGARDS,
ARUN K. SHARMA (Sr. Java Developer)

Mob: +919815295761
(W): 0172-5079323(ext)21


-
Enjoy this Diwali with Y! India Click here
 



nutch cluster questions.

2005-11-04 Thread Arsen Popovyan
At the moment we are using nutch-nightly (nutch-2005-07-20). We are not pleased 
with productivity of fetching, parsing, indexing, analyzing and scoring... 
information. Now our spider retrieves approx 25,000 new results per day. All 
processes now running on one computer (machine) and we are using local file 
system. We suppose that if we want to raise productivity we need to use cluster.

 

 

1)   Is there any intermediates (storage - ready solutions) for 
clusterization  Nutch?

 

2)   Tell us please if there was experience of clusterization Nutch, and 
what productivity was achieved? And how many computers were used?

 

3)   We are interested: what tasks we can divide into different computers 
and what tasks we can not? And in what way synchronization

 

of  those tasks must be done?

 

4)  Will speed of spiders work increase if we will use 
NutchDistributedFileSystem ? What are the advantages and disadvantages  
NutchDistributedFileSystem  have in using?

 

5)  We were advised to use  nutch mapred branch. Should we use it?



Re: nutch cluster questions.

2005-11-04 Thread Stefan Groschupf

Please do not cross post questions!
Checkout the map reduce branche in the svn. The map reduce will do  
all what you are looking for and it works well for me.


Stefan



Am 04.11.2005 um 14:32 schrieb Arsen Popovyan:

At the moment we are using nutch-nightly (nutch-2005-07-20). We are  
not pleased with productivity of fetching, parsing, indexing,  
analyzing and scoring... information. Now our spider retrieves  
approx 25,000 new results per day. All processes now running on one  
computer (machine) and we are using local file system. We suppose  
that if we want to raise productivity we need to use cluster.






1)   Is there any intermediates (storage - ready solutions) for  
clusterization  Nutch?




2)   Tell us please if there was experience of clusterization  
Nutch, and what productivity was achieved? And how many computers  
were used?




3)   We are interested: what tasks we can divide into different  
computers and what tasks we can not? And in what way synchronization




of  those tasks must be done?



4)  Will speed of spiders work increase if we will use  
NutchDistributedFileSystem ? What are the advantages and  
disadvantages  NutchDistributedFileSystem  have in using?




5)  We were advised to use  nutch mapred branch. Should we use it?






[jira] Created: (NUTCH-123) Cache.jsp some times generate NullPointerException

2005-11-04 Thread JIRA
Cache.jsp some times generate NullPointerException
--

 Key: NUTCH-123
 URL: http://issues.apache.org/jira/browse/NUTCH-123
 Project: Nutch
Type: Bug
  Components: web gui  
 Environment: All systems
Reporter: Lutischán Ferenc
Priority: Critical


There is a problem with the following line in the cached.jsp:

  String contentType = (String) metaData.get(Content-Type);

In the segments data there is some times not equals Content-Type, there are 
content-type or Content-type etc.

The solution, insert these lines over the above line:

for (Enumeration eNum = metaData.propertyNames(); eNum.hasMoreElements();) {
content = (String) eNum.nextElement();
if (content-type.equalsIgnoreCase (content)) {
break;
}
}
final String contentType = (String) metaData.get(content);

Regards,
Ferenc

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting

Rod Taylor wrote:

Every segment that I fetch seems to be missing a part when stored on the
filesystem. The stranger thing is it is always the same part (very
reproducible).


This sounds strange.  Are the datanode errors always on the same host? 
How many hosts are you running this on?


Doug


Re: mapred questions

2005-11-04 Thread Doug Cutting

Ken van Mulder wrote:
First is that the fetcher slows down over time and continues to use more 
and more memory as it goes (which I think is eventually hanging the 
process).


What parser plugins do you have enabled?  These are usually the culprit. 
 Try using 'kill -QUIT' to see what various threads are doing, both at 
the start and later, when it slows and grows.


Second problem is trying to use the crawl. I've tried with a seeds/url 
file contain 4, 2000 and then 100k urls in it. Using:


$ bin/nutch crawl seeds

Which goes through its processing and completes, but doesn't visit any 
of the urls in the seeds file. What am I missing to get it to actually 
do the crawl?


Are you using NDFS?  If so, the seeds directory needs to be stored in 
NDFS.  Use 'bin/nutch ndfs -put seeds seeds'.


Doug


[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]

Paul Baclace updated NUTCH-116:
---

Attachment: required_by_TestNDFS_v3.patch

I found and fixed a problem with a standalone DataNode process exiting too 
early (this was not detected by the current unit tests); this was because of 
changes in the required_by_TestNDFS patch; main() will now join() all the 
subthreads via runAndWait(NutchConf) and run(NutchConf) can be used to start 
subthreads and without waiting for them to finish.  The v3 patch has the 
cumulative required_by_TestNDFS changes.  

(comments_msgs_and_local_renames_during_TestNDFS.patch are still separate.)


 TestNDFS a JUnit test specifically for NDFS
 ---

  Key: NUTCH-116
  URL: http://issues.apache.org/jira/browse/NUTCH-116
  Project: Nutch
 Type: Test
   Components: fetcher, indexer, searcher
 Versions: 0.8-dev
 Reporter: Paul Baclace
  Attachments: TestNDFS.java, TestNDFS.java, required_by_TestNDFS.patch, 
 required_by_TestNDFS_v2.patch, required_by_TestNDFS_v3.patch

 TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more 
 strictly, pseudo distributed) meaning all daemons run in one process and 
 sockets are used to communicate between daemons.  
 The test permutes various block sizes, number of files, file sizes, and 
 number of datanodes.  After creating 1 or more files and filling them with 
 random data, one datanode is shutdown, and then the files are verfified. 
 Next, all the random test files are deleted and we test for leakage 
 (non-deletion) by directly checking the real directories corresponding to the 
 datanodes still running.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Rod Taylor
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote:
 Rod Taylor wrote:
  Every segment that I fetch seems to be missing a part when stored on the
  filesystem. The stranger thing is it is always the same part (very
  reproducible).
 
 This sounds strange.  Are the datanode errors always on the same host? 
 How many hosts are you running this on?

There is only a single datanode and there are 20 hosts.

-- 
Rod Taylor [EMAIL PROTECTED]



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Rod Taylor
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote:
 Rod Taylor wrote:
  Every segment that I fetch seems to be missing a part when stored on the
  filesystem. The stranger thing is it is always the same part (very
  reproducible).
 
 This sounds strange.  Are the datanode errors always on the same host? 
 How many hosts are you running this on?

I lied earlier. It still happens with smaller segments, just not as
frequently.

Found this in the namenode log file:

051104 200412 Server connection on port 5466 from 192.168.100.11:
exiting
051104 200438 Server connection on port 5466 from 192.168.100.11:
starting
051104 200438 Cannot start file because pendingCreates is non-null
051104 200438 Server handler on 5466 call error: java.io.IOException:
Cannot create file /opt/sitesell/sbider
_data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data
java.io.IOException: Cannot create
file /opt/sitesell/sbider_data/nutch/segments/20051104185259/2005110418530
0/crawl_fetch/part-00011/data
at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
051104 200440 Server connection on port 5466 from 192.168.100.11:
exiting
051104 200504 Server connection on port 5466 from 192.168.100.11:
starting
051104 200504 Cannot start file because pendingCreates is non-null
051104 200504 Server handler on 5466 call error: java.io.IOException:
Cannot create
file 
/opt/sitesell/sbider_data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data
java.io.IOException: Cannot create
file 
/opt/sitesell/sbider_data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data
at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.nutch.ipc.RPC$1.call(RPC.java:186)
at org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
051104 200505 Server connection on port 5466 from 192.168.100.11:
exiting
051104 200506 Removing lease [Lease.  Holder: NDFSClient_1755346663,
heldlocks: 0, pendingcreates: 0], leases remaining: 1
051104 200529 Server connection on port 5466 from 192.168.100.11:
starting
051104 201807 Server connection on port 5466 from 192.168.100.11:
exiting
051104 201812 Server connection on port 5466 from 192.168.100.15:
exiting
051104 201823 Server connection on port 5466 from 192.168.100.15:
starting



-- 
Rod Taylor [EMAIL PROTECTED]



[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-11-04 Thread Paul Baclace (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]

Paul Baclace updated NUTCH-116:
---

Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch

 TestNDFS a JUnit test specifically for NDFS
 ---

  Key: NUTCH-116
  URL: http://issues.apache.org/jira/browse/NUTCH-116
  Project: Nutch
 Type: Test
   Components: fetcher, indexer, searcher
 Versions: 0.8-dev
 Reporter: Paul Baclace
  Attachments: TestNDFS.java, TestNDFS.java, 
 comments_msgs_and_local_renames_during_TestNDFS.patch, 
 required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, 
 required_by_TestNDFS_v3.patch

 TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more 
 strictly, pseudo distributed) meaning all daemons run in one process and 
 sockets are used to communicate between daemons.  
 The test permutes various block sizes, number of files, file sizes, and 
 number of datanodes.  After creating 1 or more files and filling them with 
 random data, one datanode is shutdown, and then the files are verfified. 
 Next, all the random test files are deleted and we test for leakage 
 (non-deletion) by directly checking the real directories corresponding to the 
 datanodes still running.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting

Rod Taylor wrote:

There is only a single datanode and there are 20 hosts.


That's a lot of load on one datanode.  I typically run a datanode on 
every host, accessing the local drives on that host.


Doug


[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-04 Thread Doug Cutting (JIRA)
protocol-httpclient does not follow redirects when fetching robots.txt
--

 Key: NUTCH-124
 URL: http://issues.apache.org/jira/browse/NUTCH-124
 Project: Nutch
Type: Bug
  Components: fetcher  
Versions: 0.8-dev, 0.7.2-dev
Reporter: Doug Cutting


If a site's robots.txt redirects, protocol-httpclient does not correctly fetch 
the robots.txt and effectively ignores it for the site.  See 
http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting

Rod Taylor wrote:

I tried running one datanode per machine connecting back to the same SAN
but it seemed pretty clunky.  A crash of any datanode would take down
the entire system (no data replication since it's a common data-store in
the end). Reducing it to a single datanode did not have this impact.


Why use NDFS at all?  Why not just mount the SAN on all hosts?  You're 
not using NDFS as a distributed file system, but rather as a centralized 
file system.


Doug


Re: mapred bug -- bad part calculation?

2005-11-04 Thread Rod Taylor
On Fri, 2005-11-04 at 19:43 -0800, Doug Cutting wrote:
 Rod Taylor wrote:
  I tried running one datanode per machine connecting back to the same SAN
  but it seemed pretty clunky.  A crash of any datanode would take down
  the entire system (no data replication since it's a common data-store in
  the end). Reducing it to a single datanode did not have this impact.
 
 Why use NDFS at all?  Why not just mount the SAN on all hosts?  You're 
 not using NDFS as a distributed file system, but rather as a centralized 
 file system.

I was unable to make the mapred branch work by using 'local' as the
filesystem and having more than one tasktracker. Tasktrackers were
unable to complete any work, although it was quite a while ago when I
last tried (September).

-- 
Rod Taylor [EMAIL PROTECTED]



Re: mapred bug -- bad part calculation?

2005-11-04 Thread Rod Taylor
On Fri, 2005-11-04 at 22:57 -0500, Rod Taylor wrote:
 On Fri, 2005-11-04 at 19:43 -0800, Doug Cutting wrote:
  Rod Taylor wrote:
   I tried running one datanode per machine connecting back to the same SAN
   but it seemed pretty clunky.  A crash of any datanode would take down
   the entire system (no data replication since it's a common data-store in
   the end). Reducing it to a single datanode did not have this impact.
  
  Why use NDFS at all?  Why not just mount the SAN on all hosts?  You're 
  not using NDFS as a distributed file system, but rather as a centralized 
  file system.
 
 I was unable to make the mapred branch work by using 'local' as the
 filesystem and having more than one tasktracker. Tasktrackers were
 unable to complete any work, although it was quite a while ago when I
 last tried (September).

Here you go. local filesystem and a single job tracker on another
machine. When the tasktracker and jobtracker are on the same box there
isn't a problem. When they are on different machines it runs into
issues.

This is using mapred.local.dir on the local machine (not sharedd between
sbider4 and sbider5):

051104 230802 parsing
file:/opt/nutch-0.8_7/conf/nutch-default.xml
051104 230802 parsing
file:/opt/nutch-0.8_7/conf/mapred-default.xml
051104 230802
parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml
[Fatal Error] :-1:-1: Premature end of file.
051104 230802 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException:
Premature end of file.
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358)
at
org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293)
at
org.apache.nutch.util.NutchConf.get(NutchConf.java:94)
at
org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.localizeTask(TaskTracker.java:332)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.init(TaskTracker.java:314)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214)
at
org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268)
at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318)
... 8 more
051104 230802 Lost connection to JobTracker
[sbider5.sitebuildit.com/192.168.100.14:5464].  Retrying...

This is using a shared mapred.local.dir on the SAN:

051104 232115 parsing
file:/opt/nutch-0.8_7/conf/nutch-default.xml
051104 232115 parsing
file:/opt/nutch-0.8_7/conf/mapred-default.xml
051104 232115
parsing 
/opt/sitesell/sbider_data/test/local/taskTracker/task_m_l86ntl/job.xml
[Fatal Error] :-1:-1: Premature end of file.
051104 232116 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException:
Premature end of file.
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358)
at
org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293)
at
org.apache.nutch.util.NutchConf.get(NutchConf.java:94)
at
org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.localizeTask(TaskTracker.java:332)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.init(TaskTracker.java:314)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214)
at
org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268)
at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown
Source)
at
org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318)
... 8 

Re: mapred bug -- bad part calculation?

2005-11-04 Thread Doug Cutting

Rod Taylor wrote:

Here you go. local filesystem and a single job tracker on another
machine. When the tasktracker and jobtracker are on the same box there
isn't a problem. When they are on different machines it runs into
issues.

This is using mapred.local.dir on the local machine (not sharedd between
sbider4 and sbider5):



parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml
[Fatal Error] :-1:-1: Premature end of file.


What is mapred.system.dir?  That must be shared.  Also, filenames you 
pass to commands must be pathnames that work on all hosts.


Doug


Re: mapred bug -- bad part calculation?

2005-11-04 Thread Rod Taylor
On Fri, 2005-11-04 at 20:41 -0800, Doug Cutting wrote:
 Rod Taylor wrote:
  Here you go. local filesystem and a single job tracker on another
  machine. When the tasktracker and jobtracker are on the same box there
  isn't a problem. When they are on different machines it runs into
  issues.
  
  This is using mapred.local.dir on the local machine (not sharedd between
  sbider4 and sbider5):
 
  parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml
  [Fatal Error] :-1:-1: Premature end of file.
 
 What is mapred.system.dir?  That must be shared.  Also, filenames you 
 pass to commands must be pathnames that work on all hosts.

Had the rest, but failed to override system.dir (description is local
directory which isn't really true if it is shared).

That worked through the map but failed at the reduce. Both the remote
task tracker and the task tracker on the same physical machine as the
job tracker failed.

Both had similar errors logged:

051104 235758 task_m_r2dcvc
0.6336343% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415
+1758257
051104 235758 Server connection on port 45644 from 192.168.100.13:
exiting
051104 235759 task_m_r2dcvc
0.7225661% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415
+1758257
051104 235800 task_m_r2dcvc
0.8255505% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415
+1758257
051104 235801 task_m_r2dcvc
0.9183419% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415
+1758257
051104 235802 task_m_r2dcvc
1.0% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415+1758257
051104 235802 Task task_m_r2dcvc is done.
051104 235802 Server connection on port 45644 from 192.168.100.13:
exiting
java.io.FileNotFoundException: 
/opt/sitesell/sbider_data/test/system/submit_fubqfe/job.xml (No such file or 
directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.init(FileInputStream.java:106)
at org.apache.nutch.fs.LocalFileSystem
$LocalNFSFileInputStream.init(LocalFileSystem.java:64)
at
org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:108)
at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:57)
at
org.apache.nutch.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:297)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.localizeTask(TaskTracker.java:328)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.init(TaskTracker.java:314)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268)
at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633)
051104 235806 Lost connection to JobTracker
[sbider5.sitebuildit.com/192.168.100.14:5464].  Retrying...
051104 235811 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml
051104 235811 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml
051104 235811
parsing /home/sitesell/local/taskTracker/task_r_mdnul7/job.xml
[Fatal Error] :-1:-1: Premature end of file.
051104 235811 SEVERE error parsing conf file:
org.xml.sax.SAXParseException: Premature end of file.
java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end
of file.
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358)
at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293)
at org.apache.nutch.util.NutchConf.get(NutchConf.java:94)
at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.localizeTask(TaskTracker.java:332)
at org.apache.nutch.mapred.TaskTracker
$TaskInProgress.init(TaskTracker.java:314)
at
org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214)
at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268)
at
org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633)
Caused by: org.xml.sax.SAXParseException: Premature end of file.
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown
Source)
at
javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172)
at
org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318)
... 8 more
051104 235811 Lost connection to JobTracker
[sbider5.sitebuildit.com/192.168.100.14:5464].  Retrying...

-- 
Rod Taylor [EMAIL PROTECTED]



RE: Halloween Joke at Google

2005-11-04 Thread Fuad Efendi
Andrzej,


I am trying to restore human-oriented web-site tree using anchor text! As a
samle, page with anchor text Motherboards has many linked pages with
concrete motherboards, etc; we can group information in many cases.

Anchor text is the true subject of the page, but within same domain. BTW,
some pages have META name=keywords content=..., and Nutch doesn't
handle it.

Anyway, that's how the PageRank is _supposed_ to work - it should give a 
higher score to sites that are highly linked, and also it should 
strongly consider the anchor text as an indication of the page's true 
subject ... ;-)