Re: Could anyone teache me how to index the title or content of PDF?

Tomi NA Sat, 02 Sep 2006 02:19:32 -0700

On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote:

But when I execute ./nutch crawl there show some messages like "fetch okay
,but can`t parse http://(omit...).pdf " reason:failed <omit..>content
truncated at 70709 bytes.Parse can`t handle incomplete pdf file.


Haven't had time to go through the complete code (not sure I'd
understand it, anyway), but this looks like you need to set
file.content.limit to, say, 16777216. If you're crawling over http
rather than intranet shares, the property you need to set is
http.content.limit.

Hope it helps.


t.n.a.

Re: Could anyone teache me how to index the title or content of PDF?

Reply via email to