On Thu, May 22, 2025 at 07:04:17PM +0000, Keese, Brian W wrote: > More information... in my test sample of one, just now, I changed > "textextractor.use-temp-file = true" to "textextractor.use-temp-file = false" > in dspace.cfg and then the pdf text was parsed successfully. I'll dig into > the temp file code to see if I can nail down the root cause. I'm guessing > something about the parser plug-in interface has changed.
Interesting. I may try that. More data: I fetched tika-app 3.1.0 and opened one of the offending files. It warns twice about "Empty COSName at offset blah" but has no trouble reading the file or displaying content. > On Thursday, May 22, 2025 at 10:32:41 AM UTC-5 [email protected] wrote: > On Thu, May 22, 2025 at 02:51:21PM +0000, Keese, Brian W wrote: > > We have a recurring issue with PDF submissions that cause filter-media to > > fail when parsing text. Maybe 10% of submissions cause errors that look > > like below. I tried to upgrade apache tika beyond 2.9.2, thinking there > > might have been a bug fix. But I can't get the build to finish because of > > dependency conflicts in tika and I don't know enough about maven to get > > past them. Has anyone solved this or can anyone suggest a solution? > > Thanks, > > Brian > > 2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started > > 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM > > Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR > > filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: > > ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc > > (MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 > > 2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected > > RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused > > by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 > > 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream > > 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, > > Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO > > filter-media - 139 @ The script has completed > > You are not alone. We have hundreds of these. Other PDF tools have > no problem with those files. -- Mark H. Wood Lead Technology Analyst University Library Indiana University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 library.indianapolis.iu.edu -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/aDCIPwFXVNRdNSHT%40iu.edu.
signature.asc
Description: PGP signature
