On Thu, May 22, 2025 at 02:51:21PM +0000, Keese, Brian W wrote:
> We have a recurring issue with PDF submissions that cause filter-media to 
> fail when parsing text. Maybe 10% of submissions cause errors that look like 
> below. I tried to upgrade apache tika beyond 2.9.2, thinking there might have 
> been a bug fix. But I can't get the build to finish because of dependency 
> conflicts in tika and I don't know enough about maven to get past them. Has 
> anyone solved this or can anyone suggest a solution?
> Thanks,
> Brian
> 2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started 
> 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM 
> Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR 
> filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: ORIGINAL 
> File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc (MD5) Asset 
> Store: 0 Internal ID: 139753933998022015522955092400058404315 2025-05-10 
> 12:07:22.776 ERROR filter-media - 139 @ Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused by: 
> java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 
> 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 
> 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, Diana 
> (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO filter-media 
> - 139 @ The script has completed

You are not alone.  We have hundreds of these.  Other PDF tools have
no problem with those files.

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
library.indianapolis.iu.edu

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-tech/aC9Dj5yDMgmgFjMl%40iu.edu.

Attachment: signature.asc
Description: PGP signature

Reply via email to