NUTCH_CRAWLING
Hai, bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50 I have used the above command to crawl. I am getting the following error. Dedup: adding indexes in: crawl_NEW1/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java :439) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) can anyone help me to resolve this problem. Thank you in advance. -- View this message in context: http://www.nabble.com/NUTCH_CRAWLING-tp25903220p25903220.html Sent from the Nutch - User mailing list archive at Nabble.com.
Nutch-based Application for Windows - New Release
All, Version 2.1 has now been release. This version adds the following: 1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build 2009-09-09) 2. Updated Tomcat from 6.0.16 to 6.0.20. 3. Fixed bugs related to running in non-English locales. 4. Fixed bug in uninstaller. (Improved handling of situations where application is still running while attempting uninstall..) 5. Added support for scheduled crawls*. 6. Added support for automatic startup on reboot (Scheduled restart*). The following page describes the application in detail, and contains links to the download sites: http://www.whelanlabs.com/content/SearchEngineManager.htm http://www.whelanlabs.com/content/SearchEngineManager.htm Enjoy! Regards, John * Scheduled Tasks has been succssfully tested on XP. This feature not yet supported on Vista. -- View this message in context: http://www.nabble.com/Nutch-based-Application-for-Windows---New-Release-tp25902543p25902543.html Sent from the Nutch - User mailing list archive at Nabble.com.
Problems crawling >500K Pages with Hadoop/Nutch
It was my understanding that using Hadoop and HDFS would allow me to crawl millions or even billions of pages with Nutch. I have a 4 node hadoop cluster with nutch installed. I have been having problems when I try to crawl the relatively small amount of 500K pages. The error I have been getting is: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on client 192.168.1.201 because current leaseholder is trying to recreate file. My goal is to crawl 1.6M tld's and then all the links from those tld's but this error has been the major stumbling block. Thanks, Eric Osgood - Cal Poly - Computer Engineering, Moon Valley Software - eosg...@calpoly.edu, e...@lakemeadonline.com - www.calpoly.edu/~eosgood, www.lakemeadonline.com
RE: http keep alive
I'd like to add: Keep-Alive is not polite. It uses dedicated listener on server-side. Establishing TCP socket via specific IP "handshake" takes time, that's why KeepAlive exists for web servers - to improve performance of subsequent requests. However, it allocated dedicated listener for specific IP port / remote client... What will happen with classic setting of 150 processes in HTTPD 1.3 in case of 150 robots trying to use Keep-Alive feature? == http://www.linkedin.com/in/liferay > > protocol-httpclient can support keep-alive. However, I think that it > won't help you much. Please consider that Fetcher needs to wait some > time between requests, and in the meantime it will issue requests to > other sites. This means that if you want to use keep-alive connections > then the number of open connections will climb up quickly, depending on > the number of unique sites on your fetchlist, until you run out of > available sockets. On the other hand, if the number of unique sites is > small, then most of the time the Fetcher will wait anyway, so the > benefit from keep-alives (for you as a client) will be small - though > there will be still some benefit for the server side. > > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com
Re: Recrawling Nutch
nutch doesn't do a good job on storing or testing the Last-Modified time of pages it's crawled. I made the following changes which seem to help a lot: snowbird:~/src/nutch/trunk> svn diff Index: src/java/org/apache/nutch/fetcher/Fetcher.java === --- src/java/org/apache/nutch/fetcher/Fetcher.java (revision 817382) +++ src/java/org/apache/nutch/fetcher/Fetcher.java (working copy) @@ -21,6 +21,7 @@ import java.net.MalformedURLException; import java.net.URL; import java.net.UnknownHostException; +import java.text.ParseException; import java.util.*; import java.util.Map.Entry; import java.util.concurrent.atomic.AtomicInteger; @@ -42,6 +43,7 @@ import org.apache.nutch.metadata.Metadata; import org.apache.nutch.metadata.Nutch; import org.apache.nutch.net.*; +import org.apache.nutch.net.protocols.HttpDateFormat; import org.apache.nutch.protocol.*; import org.apache.nutch.parse.*; import org.apache.nutch.scoring.ScoringFilters; @@ -742,6 +744,23 @@ datum.setStatus(status); datum.setFetchTime(System.currentTimeMillis()); + LOG.debug("metadata = " + (content != null ? content.getMetadata() : "content-null")); + LOG.debug("modified? = " + ((content != null && content.getMetadata() != null) ? content.getMetadata().get("Last-Modified") : "content-null")); + if (content != null && content.getMetadata() != null && content.getMetadata().get("Last-Modified") != null) + { + String lastModifiedStr = content.getMetadata().get("Last-Modified"); + + try + { + long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr); + LOG.debug("last modified = " + lastModifiedStr + " = " + lastModifiedDate); + datum.setModifiedTime(lastModifiedDate); + } + catch (ParseException e) + { + LOG.error("unable to parse " + lastModifiedStr, e); + } + } if (pstatus != null) datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus); ParseResult parseResult = null; Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java === --- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 817382) +++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy) @@ -84,8 +84,10 @@ if (CrawlDatum.hasDbStatus(datum)) dbDatum = datum; else if (CrawlDatum.hasFetchStatus(datum)) { - // don't index unmodified (empty) pages - if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) + /* + * Where did this person get the idea that unmodified pages are empty? + // don't index unmodified (empty) pages + if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */ fetchDatum = datum; } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() || CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) { @@ -108,7 +110,7 @@ } if (!parseData.getStatus().isSuccess() || -fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) { +(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS && fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) { return; } Index: src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java === --- src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (revision 817382) +++ src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java (working copy) @@ -124,11 +124,14 @@ reqStr.append("\r\n"); } - reqStr.append("\r\n"); if (datum.getModifiedTime() > 0) { -reqStr.append("If-Modified-Since: " + HttpDateFormat.toString(datum.getModifiedTime())); + String httpDate = + HttpDateFormat.toString(datum.getModifiedTime()); + Http.LOG.debug("modified time: " + httpDate); +reqStr.append("If-Modified-Since: " + httpDate); reqStr.append("\r\n"); } + reqStr.append("\r\n"); byte[] reqBytes= reqStr.toString().getBytes(); On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN wrote: > > "We are looking at picking up updates in a recrawl - How do I get the the > fetcher to read the recently built segment, get to the url and decide > whether to get the content based on whether the url has been updated since? > " > > Shreekanth Prabhu > -- > View this message in context: > http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > -- http://www.linkedin.com/in/paultomblin
Recrawling Nutch
"We are looking at picking up updates in a recrawl - How do I get the the fetcher to read the recently built segment, get to the url and decide whether to get the content based on whether the url has been updated since? " Shreekanth Prabhu -- View this message in context: http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: http keep alive
Marko Bauhardt wrote: hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? protocol-httpclient can support keep-alive. However, I think that it won't help you much. Please consider that Fetcher needs to wait some time between requests, and in the meantime it will issue requests to other sites. This means that if you want to use keep-alive connections then the number of open connections will climb up quickly, depending on the number of unique sites on your fetchlist, until you run out of available sockets. On the other hand, if the number of unique sites is small, then most of the time the Fetcher will wait anyway, so the benefit from keep-alives (for you as a client) will be small - though there will be still some benefit for the server side. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
http keep alive
hi. is there a way for using http-keep-alive with nutch? supports protocol-http or protocol-httpclient keep alive? i cant find the using of http-keep-alive inside the code or in configuration files? thanks marko