Re: [Dspace-tech] Issues in Media Filter PDF Text Extractor (PDFFilter and XPDF)

2015-06-08 Thread Antoine Snyers
Hi euler This seems similar to http://dspace.2283337.n4.nabble.com/Character-encoding-issues-in-Discovery-search-results-tp4675835p4675839.html Perhaps it can help. euler schreef op 08/06/15 om 15:00: Dear All, I am having issues with the text extraction of pdfs having non latin characters

[Dspace-tech] Issues in Media Filter PDF Text Extractor (PDFFilter and XPDF)

2015-06-08 Thread euler
Dear All, I am having issues with the text extraction of pdfs having non latin characters and east asian languages. I tried switching to xpdf from pdfbox's pdffilter but it is also not properly extracting the text from the pdf. If I tried to extract the text from the pdf using the command line

Re: [Dspace-tech] Issues in Media Filter PDF Text Extractor (PDFFilter and XPDF)

2015-06-08 Thread euler
Hi Antoine, Thanks for the response. I did stumbled upon that thread when searching for a solution. What I discovered was even though the extracted text is not showing the proper characters when viewed from the browser, if I download and open it in a text editor, it is showing the proper