One-line patch Re: [Nutch-dev] New pdf and doc support

john Tue, 29 Jun 2004 12:29:17 -0700

Hi, Doug,

Here is a one-line patch that instructs nutch not to truncate content.
Nothing essential, just convenience.


John

--------------------------- patch.txt ----------------------------------------

--- src/plugin/protocol-http/src/java/net/nutch/protocol/http/HttpResponse.java.ori    
 2004-06-07 14:04:21.000000000 -0700
+++ src/plugin/protocol-http/src/java/net/nutch/protocol/http/HttpResponse.java 
2004-06-28 17:30:26.000000000 -0700
@@ -184,7 +184,8 @@
         throw new HttpException("bad content length: "+contentLengthString);
       }
     }
-    if (contentLength > Http.MAX_CONTENT)   // limit download size
+    if (Http.MAX_CONTENT >= 0
+      && contentLength > Http.MAX_CONTENT)   // limit download size
       contentLength  = Http.MAX_CONTENT;
 
     ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
--- ./conf/nutch-default.xml.ori        2004-06-16 10:31:30.000000000 -0700
+++ ./conf/nutch-default.xml    2004-06-29 12:03:52.000000000 -0700
@@ -70,8 +70,10 @@
 <property>
   <name>http.content.limit</name>
   <value>65536</value>
-  <description>The default length limit for downloaded content, in
-   bytes.  Content longer than this is truncated.</description>
+  <description>The length limit for downloaded content, in bytes.
+  If this value is nonnegative (>=0), content longer than it will be truncated;
+  otherwise, no truncation at all.
+  </description>
 </property>
 
 <property>


On Tue, Jun 29, 2004 at 12:14:59PM -0700, [EMAIL PROTECTED] wrote:
> On Mon, Jun 28, 2004 at 09:25:04PM -0700, Jacques Grove wrote:
> > Hi all,
> > 
> > Great job on the new pdf and doc file support (which arrived just about
> > a week before I wanted to start hacking on it).  Anyway, I have some
> > comments, based on the intranet crawl/search I use nutch for.  Neither
> > are directly nutch's fault, but I wanted to mention them for the record:
> > 
> > - The pdf engine nutch uses, PDFBox, doesn't do very well on a (largish)
> > subset of real-world pdf files.  The most common errors I see are (from
> > the crawler):
> 
> Yes, neither PDFBox nor poi can handle 100% of *pdf or *.doc out there.
> If there are better libs that people like, we can always switch.
> 
> However, you definitely want make sure that file contents are not
> truncated when crawled (by default, nutch truncates at 65536 bytes,
> check ./conf/nutch-deafult.xml), since neither lib currently can deal with 
> incomplete files.
> 
> I am in the process of testing a few finished codes that allow
> external programs to be used as parsers. This might add a little
> flexibility.
> 
> John
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

One-line patch Re: [Nutch-dev] New pdf and doc support

Reply via email to