On 12/13/2014 08:47 AM, Eric Dod?mont wrote: > I am converting PDF files to fixed layout ePub files, which mean mainly > converting PDF to HTML. > > I noticed something strange. > > > When converting PDF files produced by Scribus, the HTML displays very well, > but when I copy/paste a selected text, there is no ?spaces? in the text! > > E.g.: > > > - On the screen you see: ?The red car was behind the house." > > - The copy/paste gives: ?Theredcarwasbehindthehouse.? > > I found a PDF file produced by InDesign, and with that file the problem is > not there. > > After analyzing with Acrobat the fonts embedded in the PDF files, I noticed: > > > - Indesign: fonts contain the <SPACE> (code 20 in hexa, 32 in decimal). > > - Scribus: fonts does not contain the <SPACE> (code 20 in hexa, 32 in > decimal). > > I am using PDFTron to convert PDF files to ePub files. When I use the > pdf2htmlEX tool (available for Linux and Windows), the problem is not there. > > It seems that PDFTron will only insert a space in the text when the code 32 > is in the text. > > How comes there is no spaces with code 32 in the PDF produced by Scribus? > > I know there is a lot of different spaces: U+0020 SPACE, U+00A0 NO-BREAK > SPACE, U+2000 EN QUAD 1 en (= 1/2 em), U+2001 EM QUAD 1 em, etc. >
It may be a bit more complex than you think. The first question is, where are you copying from? I presume you mean highlighting text, then doing Ctrl+C or some equivalent. You might be better off trying to use the PDF viewer to extract the text. Next, what encoding system are you using? Scribus uses UTF-8, but some other piece of software might use something else. I just tried this in Fedora, by opening a Scribus-generated PDF in Adobe Reader, highlighting text, copying then pasting to a text editor (in this case Emacs), and saw all the spaces as I'd expect. Greg