Hiring a Nutch Developer
We're looking for a Nutch developer we can hire to build a nutch search engine for our sites. Are any of you doing side projects? Nathan Gwilliam Adoption.com Families.com [EMAIL PROTECTED]
Re: Hiring a Nutch Developer
Hi Nathan, Please send me more details Nathan Gwilliam [EMAIL PROTECTED] wrote: We're looking for a Nutch developer we can hire to build a nutch search engine for our sites. Are any of you doing side projects? Nathan Gwilliam Adoption.com Families.com [EMAIL PROTECTED] WITH WARM REGARDS, ARUN K. SHARMA (Sr. Java Developer) Mob: +919815295761 (W): 0172-5079323(ext)21 - Enjoy this Diwali with Y! India Click here
Re: Hiring a Nutch Developer
I actually have several projects, but let's start with the first. We need to create a search engine that crawls about 20 adoption-related sites that we are affiliated with, such as: adoption.com fosterparenting.com crisispregnancy.com adoption.org adopting.org 123adoption.com (which includes a bunch of 5-page URLs in it's network) parentprofiles.com adoptioninformation.com adoptionshop.com specialneeds.net (about to launch) infertilitycentral.com fertilityforums.com Then, we need to implement a combined site search for all of these sites, and then have the ability for each of the sites in the group to have a site search that only searches the subset of pages/sites that we indicate from the larger database. In other words, we need a search on adoptionshop.com that only searches the products from adoptionshop.com. We want to be able to preference pages based on title, URL, keyword density, etc. We will provide the server hardware, the graphical templates and the URLs. You would get the site search crawled, indexed and working. What would you charge us for something like this? Please include a couple of hours in your bid to train our developers on what you have done. Thanks, Nathan Arun Kumar Sharma wrote: Hi Nathan, Please send me more details Nathan Gwilliam [EMAIL PROTECTED] wrote: We're looking for a Nutch developer we can hire to build a nutch search engine for our sites. Are any of you doing side projects? Nathan Gwilliam Adoption.com Families.com [EMAIL PROTECTED] WITH WARM REGARDS, ARUN K. SHARMA (Sr. Java Developer) Mob: +919815295761 (W): 0172-5079323(ext)21 - Enjoy this Diwali with Y! India Click here
nutch cluster questions.
At the moment we are using nutch-nightly (nutch-2005-07-20). We are not pleased with productivity of fetching, parsing, indexing, analyzing and scoring... information. Now our spider retrieves approx 25,000 new results per day. All processes now running on one computer (machine) and we are using local file system. We suppose that if we want to raise productivity we need to use cluster. 1) Is there any intermediates (storage - ready solutions) for clusterization Nutch? 2) Tell us please if there was experience of clusterization Nutch, and what productivity was achieved? And how many computers were used? 3) We are interested: what tasks we can divide into different computers and what tasks we can not? And in what way synchronization of those tasks must be done? 4) Will speed of spiders work increase if we will use NutchDistributedFileSystem ? What are the advantages and disadvantages NutchDistributedFileSystem have in using? 5) We were advised to use nutch mapred branch. Should we use it?
Re: nutch cluster questions.
Please do not cross post questions! Checkout the map reduce branche in the svn. The map reduce will do all what you are looking for and it works well for me. Stefan Am 04.11.2005 um 14:32 schrieb Arsen Popovyan: At the moment we are using nutch-nightly (nutch-2005-07-20). We are not pleased with productivity of fetching, parsing, indexing, analyzing and scoring... information. Now our spider retrieves approx 25,000 new results per day. All processes now running on one computer (machine) and we are using local file system. We suppose that if we want to raise productivity we need to use cluster. 1) Is there any intermediates (storage - ready solutions) for clusterization Nutch? 2) Tell us please if there was experience of clusterization Nutch, and what productivity was achieved? And how many computers were used? 3) We are interested: what tasks we can divide into different computers and what tasks we can not? And in what way synchronization of those tasks must be done? 4) Will speed of spiders work increase if we will use NutchDistributedFileSystem ? What are the advantages and disadvantages NutchDistributedFileSystem have in using? 5) We were advised to use nutch mapred branch. Should we use it?
[jira] Created: (NUTCH-123) Cache.jsp some times generate NullPointerException
Cache.jsp some times generate NullPointerException -- Key: NUTCH-123 URL: http://issues.apache.org/jira/browse/NUTCH-123 Project: Nutch Type: Bug Components: web gui Environment: All systems Reporter: Lutischán Ferenc Priority: Critical There is a problem with the following line in the cached.jsp: String contentType = (String) metaData.get(Content-Type); In the segments data there is some times not equals Content-Type, there are content-type or Content-type etc. The solution, insert these lines over the above line: for (Enumeration eNum = metaData.propertyNames(); eNum.hasMoreElements();) { content = (String) eNum.nextElement(); if (content-type.equalsIgnoreCase (content)) { break; } } final String contentType = (String) metaData.get(content); Regards, Ferenc -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred bug -- bad part calculation?
Rod Taylor wrote: Every segment that I fetch seems to be missing a part when stored on the filesystem. The stranger thing is it is always the same part (very reproducible). This sounds strange. Are the datanode errors always on the same host? How many hosts are you running this on? Doug
Re: mapred questions
Ken van Mulder wrote: First is that the fetcher slows down over time and continues to use more and more memory as it goes (which I think is eventually hanging the process). What parser plugins do you have enabled? These are usually the culprit. Try using 'kill -QUIT' to see what various threads are doing, both at the start and later, when it slows and grows. Second problem is trying to use the crawl. I've tried with a seeds/url file contain 4, 2000 and then 100k urls in it. Using: $ bin/nutch crawl seeds Which goes through its processing and completes, but doesn't visit any of the urls in the seeds file. What am I missing to get it to actually do the crawl? Are you using NDFS? If so, the seeds directory needs to be stored in NDFS. Use 'bin/nutch ndfs -put seeds seeds'. Doug
[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: required_by_TestNDFS_v3.patch I found and fixed a problem with a standalone DataNode process exiting too early (this was not detected by the current unit tests); this was because of changes in the required_by_TestNDFS patch; main() will now join() all the subthreads via runAndWait(NutchConf) and run(NutchConf) can be used to start subthreads and without waiting for them to finish. The v3 patch has the cumulative required_by_TestNDFS changes. (comments_msgs_and_local_renames_during_TestNDFS.patch are still separate.) TestNDFS a JUnit test specifically for NDFS --- Key: NUTCH-116 URL: http://issues.apache.org/jira/browse/NUTCH-116 Project: Nutch Type: Test Components: fetcher, indexer, searcher Versions: 0.8-dev Reporter: Paul Baclace Attachments: TestNDFS.java, TestNDFS.java, required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, required_by_TestNDFS_v3.patch TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more strictly, pseudo distributed) meaning all daemons run in one process and sockets are used to communicate between daemons. The test permutes various block sizes, number of files, file sizes, and number of datanodes. After creating 1 or more files and filling them with random data, one datanode is shutdown, and then the files are verfified. Next, all the random test files are deleted and we test for leakage (non-deletion) by directly checking the real directories corresponding to the datanodes still running. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred bug -- bad part calculation?
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote: Rod Taylor wrote: Every segment that I fetch seems to be missing a part when stored on the filesystem. The stranger thing is it is always the same part (very reproducible). This sounds strange. Are the datanode errors always on the same host? How many hosts are you running this on? There is only a single datanode and there are 20 hosts. -- Rod Taylor [EMAIL PROTECTED]
Re: mapred bug -- bad part calculation?
On Fri, 2005-11-04 at 13:43 -0800, Doug Cutting wrote: Rod Taylor wrote: Every segment that I fetch seems to be missing a part when stored on the filesystem. The stranger thing is it is always the same part (very reproducible). This sounds strange. Are the datanode errors always on the same host? How many hosts are you running this on? I lied earlier. It still happens with smaller segments, just not as frequently. Found this in the namenode log file: 051104 200412 Server connection on port 5466 from 192.168.100.11: exiting 051104 200438 Server connection on port 5466 from 192.168.100.11: starting 051104 200438 Cannot start file because pendingCreates is non-null 051104 200438 Server handler on 5466 call error: java.io.IOException: Cannot create file /opt/sitesell/sbider _data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data java.io.IOException: Cannot create file /opt/sitesell/sbider_data/nutch/segments/20051104185259/2005110418530 0/crawl_fetch/part-00011/data at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.nutch.ipc.RPC$1.call(RPC.java:186) at org.apache.nutch.ipc.Server$Handler.run(Server.java:198) 051104 200440 Server connection on port 5466 from 192.168.100.11: exiting 051104 200504 Server connection on port 5466 from 192.168.100.11: starting 051104 200504 Cannot start file because pendingCreates is non-null 051104 200504 Server handler on 5466 call error: java.io.IOException: Cannot create file /opt/sitesell/sbider_data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data java.io.IOException: Cannot create file /opt/sitesell/sbider_data/nutch/segments/20051104185259/20051104185300/crawl_fetch/part-00011/data at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.nutch.ipc.RPC$1.call(RPC.java:186) at org.apache.nutch.ipc.Server$Handler.run(Server.java:198) 051104 200505 Server connection on port 5466 from 192.168.100.11: exiting 051104 200506 Removing lease [Lease. Holder: NDFSClient_1755346663, heldlocks: 0, pendingcreates: 0], leases remaining: 1 051104 200529 Server connection on port 5466 from 192.168.100.11: starting 051104 201807 Server connection on port 5466 from 192.168.100.11: exiting 051104 201812 Server connection on port 5466 from 192.168.100.15: exiting 051104 201823 Server connection on port 5466 from 192.168.100.15: starting -- Rod Taylor [EMAIL PROTECTED]
[jira] Updated: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS
[ http://issues.apache.org/jira/browse/NUTCH-116?page=all ] Paul Baclace updated NUTCH-116: --- Attachment: comments_msgs_and_local_renames_during_TestNDFS.patch TestNDFS a JUnit test specifically for NDFS --- Key: NUTCH-116 URL: http://issues.apache.org/jira/browse/NUTCH-116 Project: Nutch Type: Test Components: fetcher, indexer, searcher Versions: 0.8-dev Reporter: Paul Baclace Attachments: TestNDFS.java, TestNDFS.java, comments_msgs_and_local_renames_during_TestNDFS.patch, required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, required_by_TestNDFS_v3.patch TestNDFS is a JUnit test for NDFS using pseudo multiprocessing (or more strictly, pseudo distributed) meaning all daemons run in one process and sockets are used to communicate between daemons. The test permutes various block sizes, number of files, file sizes, and number of datanodes. After creating 1 or more files and filling them with random data, one datanode is shutdown, and then the files are verfified. Next, all the random test files are deleted and we test for leakage (non-deletion) by directly checking the real directories corresponding to the datanodes still running. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred bug -- bad part calculation?
Rod Taylor wrote: There is only a single datanode and there are 20 hosts. That's a lot of load on one datanode. I typically run a datanode on every host, accessing the local drives on that host. Doug
[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt
protocol-httpclient does not follow redirects when fetching robots.txt -- Key: NUTCH-124 URL: http://issues.apache.org/jira/browse/NUTCH-124 Project: Nutch Type: Bug Components: fetcher Versions: 0.8-dev, 0.7.2-dev Reporter: Doug Cutting If a site's robots.txt redirects, protocol-httpclient does not correctly fetch the robots.txt and effectively ignores it for the site. See http://www.webmasterworld.com/forum11/3008.htm. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: mapred bug -- bad part calculation?
Rod Taylor wrote: I tried running one datanode per machine connecting back to the same SAN but it seemed pretty clunky. A crash of any datanode would take down the entire system (no data replication since it's a common data-store in the end). Reducing it to a single datanode did not have this impact. Why use NDFS at all? Why not just mount the SAN on all hosts? You're not using NDFS as a distributed file system, but rather as a centralized file system. Doug
Re: mapred bug -- bad part calculation?
On Fri, 2005-11-04 at 19:43 -0800, Doug Cutting wrote: Rod Taylor wrote: I tried running one datanode per machine connecting back to the same SAN but it seemed pretty clunky. A crash of any datanode would take down the entire system (no data replication since it's a common data-store in the end). Reducing it to a single datanode did not have this impact. Why use NDFS at all? Why not just mount the SAN on all hosts? You're not using NDFS as a distributed file system, but rather as a centralized file system. I was unable to make the mapred branch work by using 'local' as the filesystem and having more than one tasktracker. Tasktrackers were unable to complete any work, although it was quite a while ago when I last tried (September). -- Rod Taylor [EMAIL PROTECTED]
Re: mapred bug -- bad part calculation?
On Fri, 2005-11-04 at 22:57 -0500, Rod Taylor wrote: On Fri, 2005-11-04 at 19:43 -0800, Doug Cutting wrote: Rod Taylor wrote: I tried running one datanode per machine connecting back to the same SAN but it seemed pretty clunky. A crash of any datanode would take down the entire system (no data replication since it's a common data-store in the end). Reducing it to a single datanode did not have this impact. Why use NDFS at all? Why not just mount the SAN on all hosts? You're not using NDFS as a distributed file system, but rather as a centralized file system. I was unable to make the mapred branch work by using 'local' as the filesystem and having more than one tasktracker. Tasktrackers were unable to complete any work, although it was quite a while ago when I last tried (September). Here you go. local filesystem and a single job tracker on another machine. When the tasktracker and jobtracker are on the same box there isn't a problem. When they are on different machines it runs into issues. This is using mapred.local.dir on the local machine (not sharedd between sbider4 and sbider5): 051104 230802 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml 051104 230802 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml 051104 230802 parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml [Fatal Error] :-1:-1: Premature end of file. 051104 230802 SEVERE error parsing conf file: org.xml.sax.SAXParseException: Premature end of file. java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end of file. at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358) at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293) at org.apache.nutch.util.NutchConf.get(NutchConf.java:94) at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.localizeTask(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.init(TaskTracker.java:314) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633) Caused by: org.xml.sax.SAXParseException: Premature end of file. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318) ... 8 more 051104 230802 Lost connection to JobTracker [sbider5.sitebuildit.com/192.168.100.14:5464]. Retrying... This is using a shared mapred.local.dir on the SAN: 051104 232115 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml 051104 232115 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml 051104 232115 parsing /opt/sitesell/sbider_data/test/local/taskTracker/task_m_l86ntl/job.xml [Fatal Error] :-1:-1: Premature end of file. 051104 232116 SEVERE error parsing conf file: org.xml.sax.SAXParseException: Premature end of file. java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end of file. at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358) at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293) at org.apache.nutch.util.NutchConf.get(NutchConf.java:94) at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.localizeTask(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.init(TaskTracker.java:314) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633) Caused by: org.xml.sax.SAXParseException: Premature end of file. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318) ... 8
Re: mapred bug -- bad part calculation?
Rod Taylor wrote: Here you go. local filesystem and a single job tracker on another machine. When the tasktracker and jobtracker are on the same box there isn't a problem. When they are on different machines it runs into issues. This is using mapred.local.dir on the local machine (not sharedd between sbider4 and sbider5): parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml [Fatal Error] :-1:-1: Premature end of file. What is mapred.system.dir? That must be shared. Also, filenames you pass to commands must be pathnames that work on all hosts. Doug
Re: mapred bug -- bad part calculation?
On Fri, 2005-11-04 at 20:41 -0800, Doug Cutting wrote: Rod Taylor wrote: Here you go. local filesystem and a single job tracker on another machine. When the tasktracker and jobtracker are on the same box there isn't a problem. When they are on different machines it runs into issues. This is using mapred.local.dir on the local machine (not sharedd between sbider4 and sbider5): parsing /home/sitesell/localt/taskTracker/task_m_o59djj/job.xml [Fatal Error] :-1:-1: Premature end of file. What is mapred.system.dir? That must be shared. Also, filenames you pass to commands must be pathnames that work on all hosts. Had the rest, but failed to override system.dir (description is local directory which isn't really true if it is shared). That worked through the map but failed at the reduce. Both the remote task tracker and the task tracker on the same physical machine as the job tracker failed. Both had similar errors logged: 051104 235758 task_m_r2dcvc 0.6336343% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415 +1758257 051104 235758 Server connection on port 45644 from 192.168.100.13: exiting 051104 235759 task_m_r2dcvc 0.7225661% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415 +1758257 051104 235800 task_m_r2dcvc 0.8255505% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415 +1758257 051104 235801 task_m_r2dcvc 0.9183419% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415 +1758257 051104 235802 task_m_r2dcvc 1.0% /opt/sitesell/sbider_data/test/urls/list-oct31:167034415+1758257 051104 235802 Task task_m_r2dcvc is done. 051104 235802 Server connection on port 45644 from 192.168.100.13: exiting java.io.FileNotFoundException: /opt/sitesell/sbider_data/test/system/submit_fubqfe/job.xml (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:106) at org.apache.nutch.fs.LocalFileSystem $LocalNFSFileInputStream.init(LocalFileSystem.java:64) at org.apache.nutch.fs.LocalFileSystem.openRaw(LocalFileSystem.java:108) at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:57) at org.apache.nutch.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:297) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.localizeTask(TaskTracker.java:328) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.init(TaskTracker.java:314) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633) 051104 235806 Lost connection to JobTracker [sbider5.sitebuildit.com/192.168.100.14:5464]. Retrying... 051104 235811 parsing file:/opt/nutch-0.8_7/conf/nutch-default.xml 051104 235811 parsing file:/opt/nutch-0.8_7/conf/mapred-default.xml 051104 235811 parsing /home/sitesell/local/taskTracker/task_r_mdnul7/job.xml [Fatal Error] :-1:-1: Premature end of file. 051104 235811 SEVERE error parsing conf file: org.xml.sax.SAXParseException: Premature end of file. java.lang.RuntimeException: org.xml.sax.SAXParseException: Premature end of file. at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:358) at org.apache.nutch.util.NutchConf.getProps(NutchConf.java:293) at org.apache.nutch.util.NutchConf.get(NutchConf.java:94) at org.apache.nutch.mapred.JobConf.getJar(JobConf.java:81) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.localizeTask(TaskTracker.java:332) at org.apache.nutch.mapred.TaskTracker $TaskInProgress.init(TaskTracker.java:314) at org.apache.nutch.mapred.TaskTracker.offerService(TaskTracker.java:214) at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:268) at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:633) Caused by: org.xml.sax.SAXParseException: Premature end of file. at org.apache.xerces.parsers.DOMParser.parse(Unknown Source) at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source) at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:172) at org.apache.nutch.util.NutchConf.loadResource(NutchConf.java:318) ... 8 more 051104 235811 Lost connection to JobTracker [sbider5.sitebuildit.com/192.168.100.14:5464]. Retrying... -- Rod Taylor [EMAIL PROTECTED]
RE: Halloween Joke at Google
Andrzej, I am trying to restore human-oriented web-site tree using anchor text! As a samle, page with anchor text Motherboards has many linked pages with concrete motherboards, etc; we can group information in many cases. Anchor text is the true subject of the page, but within same domain. BTW, some pages have META name=keywords content=..., and Nutch doesn't handle it. Anyway, that's how the PageRank is _supposed_ to work - it should give a higher score to sites that are highly linked, and also it should strongly consider the anchor text as an indication of the page's true subject ... ;-)