On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote:

But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.

Haven't had time to go through the complete code (not sure I'd
understand it, anyway), but this looks like you need to set
file.content.limit to, say, 16777216. If you're crawling over http
rather than intranet shares, the property you need to set is
http.content.limit.

Hope it helps.


t.n.a.

Reply via email to