Hi euler

This seems similar to http://dspace.2283337.n4.nabble.com/Character-encoding-issues-in-Discovery-search-results-tp4675835p4675839.html
Perhaps it can help.

euler schreef op 08/06/15 om 15:00:
Dear All,

I am having issues with the text extraction of pdfs having non latin
characters and east asian languages. I tried switching to xpdf from pdfbox's
pdffilter but it is also not properly extracting the text from the pdf. If I
tried to extract the text from the pdf using the command line tools (ie java
-jar pdfbox-app-1.8.7.jar ExtractText -encoding UTF-8 <inputfile>
<outputtextfile> for pdfbox and pdftotext -enc UTF-8 for xpdf), it is
properly extracting the text.

Does anybody encountered that issue and how did you solved it? I looked at
the XPDF2Text.java and in line 53 it does include the UTF-8 encoding
("@COMMAND@", "-q", "-enc", "UTF-8", "@infile@", "-"). I'm wondering why it
is not properly extracting the text when I run filter-media but is working
when I am running it from the command line. In PDFFilter.java, I tried using
PDFTextStripper pts = new PDFTextStripper("UTF-8") but the result is still
the same.

Would greatly appreciate any hints, tips, suggestions and help.

Thanks in advance and regards,
euler



--
View this message in context: 
http://dspace.2283337.n4.nabble.com/Issues-in-Media-Filter-PDF-Text-Extractor-PDFFilter-and-XPDF-tp4678283.html
Sent from the DSpace - Tech mailing list archive at Nabble.com.

------------------------------------------------------------------------------
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


--
logo
        *Antoine Snyers*
/2888 Loker Avenue East, Suite 315, Carlsbad, CA. 92010/
/Esperantolaan 4, Heverlee 3001, Belgium/
www.atmire.com <http://atmire.com/website/?q=services&utm_source=emailfooter&utm_medium=email&utm_campaign=antoine>

------------------------------------------------------------------------------
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to