[Nutch-dev] to count the number of pages from each domain

anton Fri, 05 May 2006 05:28:04 -0700

We tried to develop a solution to count the number of pages from each
domain.


We thought to do it so: 

.map - had following input k - UTF8 (url of page), v - CrawlDatum and
following output k - UTF8 (domain of page), v - UrlAndPage implemented
Writable (structure which contained url of page and its CrawlDatum)   

.reduce - had following input k - UTF8 (domain of page), v - iterator for
list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum

.in map function we parsed domain from url, created UrlAndPage structure and
put them to OutputCollector

.in reduce we counted how many elements are in list connected with iterator,
and put it into each CrawlDatum, then we formed new pairs of k, v (url,
CrawlDatum) and put them to OutputCollector
 

Following problem has arisen: as far as we see the types of input and output
of map and reduce should be same, but in our case they were different and it
caused the error like this: 

060505 183200 task_0104_m_000000_3 java.lang.RuntimeException:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAn

dPage

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:129)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755)

060505 183200 task_0104_m_000000_3 Caused by:
java.lang.InstantiationException:
org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance0(Class.java:335)

060505 183200 task_0104_m_000000_3      at
java.lang.Class.newInstance(Class.java:303)

060505 183200 task_0104_m_000000_3      at
org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364)

 

We decided that it is impossible in hadoop to have different input/output
types for map and reduce. Then we decided to use another scheme. This scheme
assumes to run two jobs. First job has map function, second job has reduce
task. These jobs have different classes for input and output parameters. New
map and reduce will do the same as described above.  

 

 

We'd like to ask you for advice which way is best for tasks like these. Is
the second way is good? Are there any other variants to do this better? 






-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] to count the number of pages from each domain

Reply via email to