Hi, Omar Chiyean schrieb: > Hi there... > I'm new with PDFBox and i'm extracting text from some pdf and letting them > in a String variable. Now my problem is the latin characters as accentued > letter are not suited as they would. > > How can I set the charset or how can i see the charset returned from the > TextStripper from PDFBox?? > > I read it was UTF-16BE but when i get byte code with this charset and > translate it to ISO-8859-1 i get letter separated with a space and no luck > with accented letters... > > So whats wrong or can you help me to correct this?? I'm using PDFBOX 0.7.3 First of all I suggest to update to PDFBox 0.8. It includes a lot of improvements and bugfixes. Back to your question. Your are able to choose the needed charset before extraction. Have a look at ExtractText as an example how to use the text extraction.
BR Andreas Lehmkühler