[Dspace-tech] Issues in Media Filter PDF Text Extractor (PDFFilter and XPDF)

euler Mon, 08 Jun 2015 06:21:59 -0700

Dear All,

I am having issues with the text extraction of pdfs having non latin
characters and east asian languages. I tried switching to xpdf from pdfbox's
pdffilter but it is also not properly extracting the text from the pdf. If I
tried to extract the text from the pdf using the command line tools (ie java
-jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 <inputfile>
<outputtextfile> for pdfbox and pdftotext -enc UTF-8 for xpdf), it is
properly extracting the text.


Does anybody encountered that issue and how did you solved it? I looked at
the XPDF2Text.java and in line 53 it does include the UTF-8 encoding
("@COMMAND@", "-q", "-enc", "UTF-8", "@infile@", "-"). I'm wondering why it
is not properly extracting the text when I run filter-media but is working
when I am running it from the command line. In PDFFilter.java, I tried using
PDFTextStripper pts = new PDFTextStripper("UTF-8") but the result is still
the same.

Would greatly appreciate any hints, tips, suggestions and help.

Thanks in advance and regards,
euler



--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Issues-in-Media-Filter-PDF-Text-Extractor-PDFFilter-and-XPDF-tp4678283.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

------------------------------------------------------------------------------
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

[Dspace-tech] Issues in Media Filter PDF Text Extractor (PDFFilter and XPDF)

Reply via email to