is there any way to index partial content of doc/xls/rtf . if its not possible let me know.
ogjunk-nutch wrote: > > I *think* you have to fetch the *full* content of MS Word docs (and PDFs > and RTFs and ...) if you want parsers that handle those documents to be > able to parse them. A partial MS Word/PDF/RTF/... document is considered > invalid/broken. Try opening it with MS Word, for example -- it will not > work. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- >> From: m.harig <[EMAIL PROTECTED]> >> To: [email protected] >> Sent: Thursday, June 5, 2008 3:27:18 AM >> Subject: Re: nutch file content limit >> >> >> thanks >> >> my situation is this.. i've 100 MS-WORD files . each has 15MB in size... >> >> if i set file.content.limit as 5MB. when nutch goes for fetching it can't >> parse the content. it says Can't handle as Microsoft document. and its >> failed.. how do i index partial content of those documents. any1 help me >> out >> of this >> >> >> this is my error >> >> Can't be handled as Microsoft document. java.io.IOException: Cannot >> remove >> block[ 20839 ]; out of range >> -- >> View this message in context: >> http://www.nabble.com/nutch-file-content-limit-tp17640376p17663787.html >> Sent from the Nutch - Dev mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/nutch-file-content-limit-tp17640376p17686729.html Sent from the Nutch - Dev mailing list archive at Nabble.com.
