Dear All, I am having issues with the text extraction of pdfs having non latin characters and east asian languages. I tried switching to xpdf from pdfbox's pdffilter but it is also not properly extracting the text from the pdf. If I tried to extract the text from the pdf using the command line tools (ie java -jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 <inputfile> <outputtextfile> for pdfbox and pdftotext -enc UTF-8 for xpdf), it is properly extracting the text.
Does anybody encountered that issue and how did you solved it? I looked at the XPDF2Text.java and in line 53 it does include the UTF-8 encoding ("@COMMAND@", "-q", "-enc", "UTF-8", "@infile@", "-"). I'm wondering why it is not properly extracting the text when I run filter-media but is working when I am running it from the command line. In PDFFilter.java, I tried using PDFTextStripper pts = new PDFTextStripper("UTF-8") but the result is still the same. Would greatly appreciate any hints, tips, suggestions and help. Thanks in advance and regards, euler -- View this message in context: http://dspace.2283337.n4.nabble.com/Issues-in-Media-Filter-PDF-Text-Extractor-PDFFilter-and-XPDF-tp4678283.html Sent from the DSpace - Tech mailing list archive at Nabble.com. ------------------------------------------------------------------------------ _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette