NUTCH_CRAWLING

2009-10-14 Thread meh

Hai, 

bin/nutch crawl urls -dir crawl_NEW1 -depth 3 -topN 50 

I have used the above command to crawl. 

I am getting the following error. 

Dedup: adding indexes in: crawl_NEW1/indexes 
Exception in thread "main" java.io.IOException: Job failed! 
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) 
at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java 
:439) 
at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) 


can anyone help me to resolve this problem. 

Thank you in advance. 

-- 
View this message in context: 
http://www.nabble.com/NUTCH_CRAWLING-tp25903220p25903220.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Nutch-based Application for Windows - New Release

2009-10-14 Thread John Whelan

All,

Version 2.1 has now been release. This version adds the following:

1. Updated Nutch from 1.0-dev (build 2008-10-28) to 1.1-dev (build
2009-09-09)
2. Updated Tomcat from 6.0.16 to 6.0.20.
3. Fixed bugs related to running in non-English locales.
4. Fixed bug in uninstaller. (Improved handling of situations where
application is still running
while attempting uninstall..)
5. Added support for scheduled crawls*.
6. Added support for automatic startup on reboot (Scheduled restart*).

The following page describes the application in detail, and contains links
to the download sites:
http://www.whelanlabs.com/content/SearchEngineManager.htm
http://www.whelanlabs.com/content/SearchEngineManager.htm 

Enjoy!

Regards,
John

* Scheduled Tasks has been succssfully tested on XP. This feature not yet
supported on Vista.
-- 
View this message in context: 
http://www.nabble.com/Nutch-based-Application-for-Windows---New-Release-tp25902543p25902543.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Problems crawling >500K Pages with Hadoop/Nutch

2009-10-14 Thread Eric Osgood
It was my understanding that using Hadoop and HDFS would allow me to  
crawl millions or even billions of pages with Nutch. I have a 4 node  
hadoop cluster with nutch installed. I have been having problems when  
I try to crawl the relatively small amount of 500K pages. The error I  
have been getting is:


org.apache.hadoop.ipc.RemoteException:  
org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed  
to create file /user/hadoop/crawl/segments/20091013161641/crawl_fetch/ 
part-00015/index for DFSClient_attempt_200910131302_0011_r_15_2 on  
client 192.168.1.201 because current leaseholder is trying to recreate  
file.
My goal is to crawl 1.6M tld's and then all the links from those tld's  
but this error has been the major stumbling block.


Thanks,


Eric Osgood
-
Cal Poly - Computer Engineering, Moon Valley Software
-
eosg...@calpoly.edu, e...@lakemeadonline.com
-
www.calpoly.edu/~eosgood, www.lakemeadonline.com



RE: http keep alive

2009-10-14 Thread Fuad Efendi
I'd like to add:

Keep-Alive is not polite. It uses dedicated listener on server-side. 
Establishing TCP socket via specific IP "handshake" takes time, that's why 
KeepAlive exists for web servers - to improve performance of subsequent 
requests. However, it allocated dedicated listener for specific IP port / 
remote client... 

What will happen with classic setting of 150 processes in HTTPD 1.3 in case of 
150 robots trying to use Keep-Alive feature?

==
http://www.linkedin.com/in/liferay


> 
> protocol-httpclient can support keep-alive. However, I think that it
> won't help you much. Please consider that Fetcher needs to wait some
> time between requests, and in the meantime it will issue requests to
> other sites. This means that if you want to use keep-alive connections
> then the number of open connections will climb up quickly, depending on
> the number of unique sites on your fetchlist, until you run out of
> available sockets. On the other hand, if the number of unique sites is
> small, then most of the time the Fetcher will wait anyway, so the
> benefit from keep-alives (for you as a client) will be small - though
> there will be still some benefit for the server side.
> 
> 
> 
> --
> Best regards,
> Andrzej Bialecki <><
>   ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com





Re: Recrawling Nutch

2009-10-14 Thread Paul Tomblin
nutch doesn't do a good job on storing or testing the Last-Modified
time of pages it's crawled.  I made the following changes which seem
to help a lot:

snowbird:~/src/nutch/trunk> svn diff
Index: src/java/org/apache/nutch/fetcher/Fetcher.java
===
--- src/java/org/apache/nutch/fetcher/Fetcher.java  (revision 817382)
+++ src/java/org/apache/nutch/fetcher/Fetcher.java  (working copy)
@@ -21,6 +21,7 @@
 import java.net.MalformedURLException;
 import java.net.URL;
 import java.net.UnknownHostException;
+import java.text.ParseException;
 import java.util.*;
 import java.util.Map.Entry;
 import java.util.concurrent.atomic.AtomicInteger;
@@ -42,6 +43,7 @@
 import org.apache.nutch.metadata.Metadata;
 import org.apache.nutch.metadata.Nutch;
 import org.apache.nutch.net.*;
+import org.apache.nutch.net.protocols.HttpDateFormat;
 import org.apache.nutch.protocol.*;
 import org.apache.nutch.parse.*;
 import org.apache.nutch.scoring.ScoringFilters;
@@ -742,6 +744,23 @@

   datum.setStatus(status);
   datum.setFetchTime(System.currentTimeMillis());
+  LOG.debug("metadata = " + (content != null ?
content.getMetadata() : "content-null"));
+  LOG.debug("modified? = " + ((content != null &&
content.getMetadata() != null) ?
content.getMetadata().get("Last-Modified") : "content-null"));
+  if (content != null && content.getMetadata() != null &&
content.getMetadata().get("Last-Modified") != null)
+  {
+  String lastModifiedStr = content.getMetadata().get("Last-Modified");
+
+  try
+  {
+  long lastModifiedDate = HttpDateFormat.toLong(lastModifiedStr);
+  LOG.debug("last modified = " + lastModifiedStr + " = "
+ lastModifiedDate);
+  datum.setModifiedTime(lastModifiedDate);
+  }
+  catch (ParseException e)
+  {
+  LOG.error("unable to parse " + lastModifiedStr, e);
+  }
+  }
   if (pstatus != null)
datum.getMetaData().put(Nutch.WRITABLE_PROTO_STATUS_KEY, pstatus);

   ParseResult parseResult = null;
Index: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
===
--- src/java/org/apache/nutch/indexer/IndexerMapReduce.java (revision 
817382)
+++ src/java/org/apache/nutch/indexer/IndexerMapReduce.java (working copy)
@@ -84,8 +84,10 @@
 if (CrawlDatum.hasDbStatus(datum))
   dbDatum = datum;
 else if (CrawlDatum.hasFetchStatus(datum)) {
-  // don't index unmodified (empty) pages
-  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)
+  /*
+   * Where did this person get the idea that unmodified pages
are empty?
+   // don't index unmodified (empty) pages
+  if (datum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED) */
 fetchDatum = datum;
 } else if (CrawlDatum.STATUS_LINKED == datum.getStatus() ||
CrawlDatum.STATUS_SIGNATURE == datum.getStatus()) {
@@ -108,7 +110,7 @@
 }

 if (!parseData.getStatus().isSuccess() ||
-fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS) {
+(fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_SUCCESS &&
fetchDatum.getStatus() != CrawlDatum.STATUS_FETCH_NOTMODIFIED)) {
   return;
 }

Index: 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
===
--- 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (revision
817382)
+++ 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
  (working
copy)
@@ -124,11 +124,14 @@
 reqStr.append("\r\n");
   }

-  reqStr.append("\r\n");
   if (datum.getModifiedTime() > 0) {
-reqStr.append("If-Modified-Since: " +
HttpDateFormat.toString(datum.getModifiedTime()));
+   String httpDate =
+ HttpDateFormat.toString(datum.getModifiedTime());
+   Http.LOG.debug("modified time: " + httpDate);
+reqStr.append("If-Modified-Since: " + httpDate);
 reqStr.append("\r\n");
   }
+  reqStr.append("\r\n");

   byte[] reqBytes= reqStr.toString().getBytes();



On Wed, Oct 14, 2009 at 9:40 AM, sprabhu_PN
 wrote:
>
> "We are looking at picking up updates in a recrawl - How do I get the the
> fetcher to read the recently built segment, get to the url and decide
> whether to get the content based on whether the url has been updated since?
> "
>
> Shreekanth Prabhu
> --
> View this message in context: 
> http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>



-- 
http://www.linkedin.com/in/paultomblin


Recrawling Nutch

2009-10-14 Thread sprabhu_PN

"We are looking at picking up updates in a recrawl - How do I get the the
fetcher to read the recently built segment, get to the url and decide
whether to get the content based on whether the url has been updated since?
"

Shreekanth Prabhu
-- 
View this message in context: 
http://www.nabble.com/Recrawling--Nutch-tp25891294p25891294.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: http keep alive

2009-10-14 Thread Andrzej Bialecki

Marko Bauhardt wrote:

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in 
configuration files?


protocol-httpclient can support keep-alive. However, I think that it 
won't help you much. Please consider that Fetcher needs to wait some 
time between requests, and in the meantime it will issue requests to 
other sites. This means that if you want to use keep-alive connections 
then the number of open connections will climb up quickly, depending on 
the number of unique sites on your fetchlist, until you run out of 
available sockets. On the other hand, if the number of unique sites is 
small, then most of the time the Fetcher will wait anyway, so the 
benefit from keep-alives (for you as a client) will be small - though 
there will be still some benefit for the server side.




--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



http keep alive

2009-10-14 Thread Marko Bauhardt

hi.
is there a way for using http-keep-alive with nutch?
supports protocol-http or protocol-httpclient keep alive?

i cant find the using of http-keep-alive inside the code or in  
configuration files?


thanks
marko