Then you should open a bug report on TIKA, providing them your files that do not parse. Often the problem is in some of TIKA's underlying parser libs like Apache POI, then there is nothing they can do. Maybe another TIKA issue handles about the same problem, just search the issue tracker!
Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen <http://www.thetaphi.de/> http://www.thetaphi.de eMail: [email protected] From: Deepak Singh [mailto:[email protected]] Sent: Wednesday, March 09, 2011 2:09 PM To: [email protected] Subject: Re: Solr Exception downloaded apache-solr-3.1 still it giving TIKA Exception On Wed, Mar 9, 2011 at 5:11 PM, Deepak Singh <[email protected]> wrote: oh, thanks for the better solution. On Wed, Mar 9, 2011 at 4:36 PM, Uwe Schindler <[email protected]> wrote: Hi, These are all bugs in Apache TIKA not Solr, some of them are already fixed in later TIKA versions (so you may try the soon-to-be-released Solr 3.1 version which contains a newer TIKA bundled). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de <http://www.thetaphi.de/> eMail: [email protected] From: Deepak Singh [mailto:[email protected]] Sent: Wednesday, March 09, 2011 12:03 PM To: [email protected] Subject: Re: Solr Exception HTTP ERROR :500 (INTERNAL SERVER ERROR) For DOC files: org.apache.tika.exception. TikaException : -Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1248f2 Caused by: org.apache.poi.hpsf.IllegalPropertySetDataException: The property set claims to have a size of 16 bytes. However, it exceeds 16 bytes. -TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.OfficeParser@1248f2 Caused by: java.io.IOException: block[ 0 ] already removed - does your POIFS have circular or duplicate block references? For PDF files: org.apache.tika.exception.TikaException : -Unexpected RuntimeException from org.apache.tika.parser.Pdfparser@1b4cd65 Caused by: java.lang.ClassCastException: org.pdfbox.cos.COSArray cannot be cast to org.pdfbox.cos.COSDictionar Caused by: java.lang.NullPointerException -Unable to extract PDF content HTTP ERROR:400 (BAD REQUEST) -This error come when some fields are missing ERROR:unknown field 'language' (Ex:content_status, description,version) On Wed, Mar 9, 2011 at 4:19 PM, Gora Mohanty <[email protected]> wrote: Hi, This is probably better directed to the user list. Also, please provide details of the exceptions from your log files. Regards, Gora
