We tried to develop a solution to count the number of pages from each domain.
We thought to do it so: .map - had following input k - UTF8 (url of page), v - CrawlDatum and following output k - UTF8 (domain of page), v - UrlAndPage implemented Writable (structure which contained url of page and its CrawlDatum) .reduce - had following input k - UTF8 (domain of page), v - iterator for list of UrlAndPage and output was k - UTF8 (url of page), v - CrawlDatum .in map function we parsed domain from url, created UrlAndPage structure and put them to OutputCollector .in reduce we counted how many elements are in list connected with iterator, and put it into each CrawlDatum, then we formed new pairs of k, v (url, CrawlDatum) and put them to OutputCollector Following problem has arisen: as far as we see the types of input and output of map and reduce should be same, but in our case they were different and it caused the error like this: 060505 183200 task_0104_m_000000_3 java.lang.RuntimeException: java.lang.InstantiationException: org.apache.nutch.crawl.PostUpdateFilter$UrlAn dPage 060505 183200 task_0104_m_000000_3 at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:366) 060505 183200 task_0104_m_000000_3 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) 060505 183200 task_0104_m_000000_3 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:129) 060505 183200 task_0104_m_000000_3 at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:755) 060505 183200 task_0104_m_000000_3 Caused by: java.lang.InstantiationException: org.apache.nutch.crawl.PostUpdateFilter$UrlAndPage 060505 183200 task_0104_m_000000_3 at java.lang.Class.newInstance0(Class.java:335) 060505 183200 task_0104_m_000000_3 at java.lang.Class.newInstance(Class.java:303) 060505 183200 task_0104_m_000000_3 at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:364) We decided that it is impossible in hadoop to have different input/output types for map and reduce. Then we decided to use another scheme. This scheme assumes to run two jobs. First job has map function, second job has reduce task. These jobs have different classes for input and output parameters. New map and reduce will do the same as described above. We'd like to ask you for advice which way is best for tasks like these. Is the second way is good? Are there any other variants to do this better? ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
