Justine, I have done some recent work on the pdftohtml utility.  I recommend 
grabbing the latest version off of GIT and building it.  It's not too hard.

The latest version of pdftohtml generates valid XHTML and has the width and 
height in the body div.

It may also be that your problem with Arabic is fixed in the latest version 
(make sure you use "complex" option), because I don't know how backwards 
rendering would possibly happen with that version, but if it does, I'd love to 
know more about it.  I would expect it to write out the characters in the same 
order in which they are printed on the page, from left to write.  This could 
cause issues with ligatures, however.

--josh

From: Justine Guillaumont 
<[email protected]<mailto:[email protected]>>
Date: Thu, 22 Sep 2011 04:23:02 -0700
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [poppler] poppler util pdftohtml

Hi all,

My name is Justine Guillaumont, I am completing  my engineering studies by a 
6-months internship. I am working on the opensource project WebLab 
(weblab-project.org<http://weblab-project.org/>).
I am currently using poppler-0.16.7 (I tried to install poppler-0.17.4 but 
libpoppler.so.17 is missing).
One of the purposes of my internship is to transform PDF files into XHTML files 
that will give the same structured display. In order to doing this, I use 
pdftohtml -nodrm -p -s (to obtain HTML) and then a script and XSL (to obtain 
XHTML). I encountered several problems with pdftohtml that I would like to 
share in order to have your opinion.

1) Would it be possible to have the width and height of the tag DIV in the BODY 
?
I noticed that with have it with pdftohtml -xml (in the tags TEXT) but not with 
pdftohtml -nodrm -p -s. I tried to modifiy your code (HtmlOutputDev.cc) but I 
only "sucess" to collect the width and height of the first word of the DIV.

2) The HTML generate by pdftohtml is not validated by W3C 
(http://validator.w3.org/)
It is sad because you don't have much to modify to obtain valid HTML 4 or 
XHTML. If you like, I can send you the xsl I made to transform the HTML 
generate by pdftohtml -p -s into valid HTML4.

3) With arabic PDF, pdftohtml seems to read correctely the PDF (from rigth to 
left) and to write the HTML upside-down / backwards (from left to right). All 
words are reversed. Would that be corrected soon ?
Please find attached an example of this problem.

Regards,

Justine Guillaumont
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to