Re: [dspace-tech] Errors with tika pdfparser in filter-media when extracting text

Brian Keese Fri, 23 May 2025 09:25:45 -0700

I was not able to figure out the problem with the way parsing is done when 
use-temp-file is set to true. I did confirm that it doesn't matter if 
max-chars is in effect (settings of 100000 and -1 yield the same results).


Maybe the best bet is to find a way to upgrade the tika version in the 
build. I don't know how to get past the dependency conflicts. I found this 
relevant (but older) ticket, but I don't know how to apply the 
fix. https://issues.apache.org/jira/browse/TIKA-2598

On Friday, May 23, 2025 at 9:38:01 AM UTC-5 [email protected] wrote:

> On Thu, May 22, 2025 at 07:04:17PM +0000, Keese, Brian W wrote:
> > More information... in my test sample of one, just now, I changed 
> "textextractor.use-temp-file = true" to "textextractor.use-temp-file = 
> false" in dspace.cfg and then the pdf text was parsed successfully. I'll 
> dig into the temp file code to see if I can nail down the root cause. I'm 
> guessing something about the parser plug-in interface has changed.
>
> Interesting. I may try that.
>
> More data: I fetched tika-app 3.1.0 and opened one of the offending
> files. It warns twice about "Empty COSName at offset blah" but has no
> trouble reading the file or displaying content.
>
> > On Thursday, May 22, 2025 at 10:32:41 AM UTC-5 [email protected] wrote:
> > On Thu, May 22, 2025 at 02:51:21PM +0000, Keese, Brian W wrote:
> > > We have a recurring issue with PDF submissions that cause filter-media 
> to fail when parsing text. Maybe 10% of submissions cause errors that look 
> like below. I tried to upgrade apache tika beyond 2.9.2, thinking there 
> might have been a bug fix. But I can't get the build to finish because of 
> dependency conflicts in tika and I don't know enough about maven to get 
> past them. Has anyone solved this or can anyone suggest a solution?
> > > Thanks,
> > > Brian
> > > 2025-05-10 12:07:22.585 INFO filter-media - 139 @ The script has 
> started 2025-05-10 12:07:22.587 INFO filter-media - 139 @ File: Wuli, Diana 
> (DM Cello).pdf.txt 2025-05-10 12:07:22.775 ERROR filter-media - 139 @ ERROR 
> filtering, skipping bitstream: Item Handle: 2022/33585 Bundle Name: 
> ORIGINAL File Size: 25537966 Checksum: 26d2ff8f679b7e4ddca975f3390766fc 
> (MD5) Asset Store: 0 Internal ID: 139753933998022015522955092400058404315 
> 2025-05-10 12:07:22.776 ERROR filter-media - 139 @ Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@5c9b293e Caused 
> by: java.lang.StringIndexOutOfBoundsException begin 66, end 24, length 90 
> 2025-05-10 12:07:22.779 INFO filter-media - 139 @ SKIPPED: bitstream 
> 67dabaac-db02-4371-88ea-1129e41e4e2e (item: 2022/33585) because 'Wuli, 
> Diana (DM Cello).pdf.jpg' already exists 2025-05-10 12:07:22.789 INFO 
> filter-media - 139 @ The script has completed
> > 
> > You are not alone. We have hundreds of these. Other PDF tools have
> > no problem with those files.
>
> -- 
> Mark H. Wood
> Lead Technology Analyst
>
> University Library
> Indiana University Indianapolis
> 755 W. Michigan Street
> Indianapolis, IN 46202
> 317-274-0749 <(317)%20274-0749>
> library.indianapolis.iu.edu
>

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/dspace-tech/65e8d90e-39b2-4c06-9ddc-45e972b5c1adn%40googlegroups.com.

Re: [dspace-tech] Errors with tika pdfparser in filter-media when extracting text

Reply via email to