Hi, Doug,
Here is a one-line patch that instructs nutch not to truncate content.
Nothing essential, just convenience.
John
--------------------------- patch.txt ----------------------------------------
--- src/plugin/protocol-http/src/java/net/nutch/protocol/http/HttpResponse.java.ori
2004-06-07 14:04:21.000000000 -0700
+++ src/plugin/protocol-http/src/java/net/nutch/protocol/http/HttpResponse.java
2004-06-28 17:30:26.000000000 -0700
@@ -184,7 +184,8 @@
throw new HttpException("bad content length: "+contentLengthString);
}
}
- if (contentLength > Http.MAX_CONTENT) // limit download size
+ if (Http.MAX_CONTENT >= 0
+ && contentLength > Http.MAX_CONTENT) // limit download size
contentLength = Http.MAX_CONTENT;
ByteArrayOutputStream out = new ByteArrayOutputStream(Http.BUFFER_SIZE);
--- ./conf/nutch-default.xml.ori 2004-06-16 10:31:30.000000000 -0700
+++ ./conf/nutch-default.xml 2004-06-29 12:03:52.000000000 -0700
@@ -70,8 +70,10 @@
<property>
<name>http.content.limit</name>
<value>65536</value>
- <description>The default length limit for downloaded content, in
- bytes. Content longer than this is truncated.</description>
+ <description>The length limit for downloaded content, in bytes.
+ If this value is nonnegative (>=0), content longer than it will be truncated;
+ otherwise, no truncation at all.
+ </description>
</property>
<property>
On Tue, Jun 29, 2004 at 12:14:59PM -0700, [EMAIL PROTECTED] wrote:
> On Mon, Jun 28, 2004 at 09:25:04PM -0700, Jacques Grove wrote:
> > Hi all,
> >
> > Great job on the new pdf and doc file support (which arrived just about
> > a week before I wanted to start hacking on it). Anyway, I have some
> > comments, based on the intranet crawl/search I use nutch for. Neither
> > are directly nutch's fault, but I wanted to mention them for the record:
> >
> > - The pdf engine nutch uses, PDFBox, doesn't do very well on a (largish)
> > subset of real-world pdf files. The most common errors I see are (from
> > the crawler):
>
> Yes, neither PDFBox nor poi can handle 100% of *pdf or *.doc out there.
> If there are better libs that people like, we can always switch.
>
> However, you definitely want make sure that file contents are not
> truncated when crawled (by default, nutch truncates at 65536 bytes,
> check ./conf/nutch-deafult.xml), since neither lib currently can deal with
> incomplete files.
>
> I am in the process of testing a few finished codes that allow
> external programs to be used as parsers. This might add a little
> flexibility.
>
> John
>
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
digital self defense, top technical experts, no vendor pitches,
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers