On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote:
But when I execute ./nutch crawl there show some messages like "fetch okay ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content truncated at 70709 bytes.Parse can`t handle incomplete pdf file.
Haven't had time to go through the complete code (not sure I'd understand it, anyway), but this looks like you need to set file.content.limit to, say, 16777216. If you're crawling over http rather than intranet shares, the property you need to set is http.content.limit. Hope it helps. t.n.a.
