[EMAIL PROTECTED] wrote: > Hello, > > I can not index website with Nutch and Hadoop. > I spend 15 days to try that nutch 0.8.1 work but with no success. > > I use : > * jdk1.5.0_10 or 1.4.2_13 (i have the same problem with these 2 JDK) > * Nutch-0.8.1 > * Hadoop-0.4.0 from Nutch-0.8.1 > And I made configuration with http://wiki.apache.org/nutch/NutchHadoopTutorial > I have ONLY one server. > > I start Hadoop-0.4.0 (start-all.sh) with no errors in the logs. > > The directories for crawl and url are created. > I use dfs put command line to put them in NDFS File System. > > [nutch-0.8.1]$ bin/hadoop dfs -lsr > /user/webadm/crawls <dir> > /user/webadm/urls <dir> > /user/webadm/urls/url-fr.txt <r 1> 44 > > And when I crawl with nutch 0.8.1, I have this message error: > > [nutch-0.8.1]$ bin/nutch crawl urls/url-fr.txt -dir crawls/crawl-fr -depth > 10 -topN 50 > crawl started in: crawls/crawl-fr > rootUrlDir = urls/url-fr.txt > threads = 10 > depth = 10 > topN = 50 > Injector: starting > Injector: crawlDb: crawls/crawl-fr/crawldb > Injector: urlDir: urls/url-fr.txt > Injector: Converting injected urls to crawl db entries. > Exception in thread "main" java.io.IOException: Input directory > /user/webadm/urls/url-fr.txt in localhost:9000 is invalid. > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:274) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:327) > at org.apache.nutch.crawl.Injector.inject(Injector.java:138) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:105) > >
.. and that's because "urlDir: urls/url-fr.txt" is not a directory, but a file. You should give only the "urls" as the input directory - Nutch will read all text files inside the directory. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
