Hi All, I am currently setting up Nutch to crawl an intranet site. I have been working through various issues and resolving them by googling and searching in Nutch User forum.
The current problem I have is while crawling I get the following error : Error parsing: http://xxxxxx.xxxx/svn/repos/sgs/trunk/xdoc/sample.doc: failed(2,0): Can't be handled as Microsoft document. java.lang.StringIndexOutOfBoundsException: String index out of range: -106938 The above document is about 2.6MB It appears that the file size might be the reason. Could you please confirm my observation and suggest a work-around. May be limit the content.limit to 1MB. Any help/suggestion is greatly appreciated. Thanks Lakshman -- View this message in context: http://www.nabble.com/Microsoft-document-index-out-of-range-tf3679172.html#a10281578 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
