More information... in my test sample of one, just now, I changed 
"textextractor.use-temp-file = true" to "textextractor.use-temp-file = 
false" in dspace.cfg and then the pdf text was parsed successfully. I'll 
dig into the temp file code to see if I can nail down the root cause. I'm 
guessing something about the parser plug-in interface has changed. 

On Thursday, May 22, 2025 at 10:32:41 AM UTC-5 [email protected] wrote:

> On Thu, May 22, 2025 at 02:51:21PM +0000, Keese, Brian W wrote:
> > We have a recurring issue with PDF submissions that cause filter-media 
> to fail when parsing text. Maybe 10% of submissions cause errors that look 
> like below. I tried to upgrade apache tika beyond 2.9.2, thinking there 
> might have been a bug fix. But I can't get the build to finish because of 
> dependency conflicts in tika and I don't know enough about maven to get 
> past them. Has anyone solved this or can anyone suggest a solution?
> > Thanks,
> > Brian
> > 2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started 
> 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM 
> Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR 
> filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: 
> ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc 
> (MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 
> 2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused 
> by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 
> 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 
> 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, 
> Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO 
> filter-media - 139 @ The script has completed
>
> You are not alone. We have hundreds of these. Other PDF tools have
> no problem with those files.
>
> -- 
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749 <(317)%20274-0749>
> library.indianapolis.iu.edu
>

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-tech/80a10c91-b86e-406f-9443-139304e6e31an%40googlegroups.com.

Reply via email to