Hi All,

I am currently setting up Nutch to crawl an intranet site. I have been
working through various issues and resolving them by googling and searching
in Nutch User forum.

The current problem I have is while crawling I get the following error :
Error parsing: http://xxxxxx.xxxx/svn/repos/sgs/trunk/xdoc/sample.doc:
failed(2,0): Can't be handled as Microsoft document.
java.lang.StringIndexOutOfBoundsException: String index out of range:
-106938

The above document is about 2.6MB

It appears that the file size might be the reason. Could you please confirm
my observation and suggest a work-around. May be limit the content.limit to
1MB.

Any help/suggestion is greatly appreciated.

Thanks
Lakshman
-- 
View this message in context: 
http://www.nabble.com/Microsoft-document-index-out-of-range-tf3679172.html#a10281578
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to