Thanks for your help. I crawl over http and set http.content.limit like following in nutch-default: <property> <name>http.content.limit</name> <value>16777216</value> <description>The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. </description> </property>
but it still show the same error: fetch okay,but can`t parse http://(omit...).pdf " reason:failed <omit..>content truncated at 70709 bytes.Parse can`t handle incomplete pdf file. what did I mistake ? thanks Tomi NA wrote: > > On 9/1/06, Frank Huang <[EMAIL PROTECTED]> wrote: > >> But when I execute ./nutch crawl there show some messages like "fetch >> okay >> ,but can`t parse http://(omit...).pdf " reason:failed <omit..>content >> truncated at 70709 bytes.Parse can`t handle incomplete pdf file. > > Haven't had time to go through the complete code (not sure I'd > understand it, anyway), but this looks like you need to set > file.content.limit to, say, 16777216. If you're crawling over http > rather than intranet shares, the property you need to .set is > http.content.limit > > Hope it helps. > > > t.n.a. > > -- View this message in context: http://www.nabble.com/Could-anyone-teache-me-how-to-index--the-title-or-content-of-PDF--tf2203822.html#a6119492 Sent from the Nutch - User forum at Nabble.com.
