Hello, We have a recurring issue with PDF submissions that cause filter-media to fail when parsing text. Maybe 10% of submissions cause errors that look like below. I tried to upgrade apache tika beyond 2.9.2, thinking there might have been a bug fix. But I can't get the build to finish because of dependency conflicts in tika and I don't know enough about maven to get past them. Has anyone solved this or can anyone suggest a solution? Thanks, Brian 2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has started 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana (DM Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc (MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO filter-media - 139 @ The script has completed
-- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Technical Support" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/dspace-tech/b1756ca8-3164-438d-b5c4-ed5b10e54ee4n%40googlegroups.com.
