Hello,
We have a recurring issue with PDF submissions that cause filter-media to 
fail when parsing text. Maybe 10% of submissions cause errors that look 
like below. I tried to upgrade apache tika beyond 2.9.2, thinking there 
might have been a bug fix. But I can't get the build to finish because of 
dependency conflicts in tika and I don't know enough about maven to get 
past them. Has anyone solved this or can anyone suggest a solution?
Thanks,
Brian
2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started 
2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM 
Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR 
filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: 
ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc 
(MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 
2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected 
RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused 
by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 
2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 
67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, 
Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO 
filter-media - 139 @ The script has completed

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-tech/b1756ca8-3164-438d-b5c4-ed5b10e54ee4n%40googlegroups.com.

Reply via email to