Re: [poppler] pdftohtml using Poppler

Josh Richardson Wed, 20 Jul 2011 11:17:37 -0700

You should have read the list sooner - you could have saved some time.  :-)


Rotated text is solved, but hasn't been committed to the freedesktop 
repository.  See https://bugs.freedesktop.org/show_bug.cgi?id=38586 .

Fonts are mostly solved, see https://bugs.freedesktop.org/show_bug.cgi?id=39385 
.

It works, but you have to separately extract the fonts.  Ideally, we would 
include the code to extract the fonts with Poppler.  Right now you can use Mu 
PDF, pdfextract, to extract the fonts, and FontForge operates on them just 
fine.  Mu PDF doesn't seem to be as forgiving as Poppler in terms of ill-formed 
PDF documents.

Z-index is a problem I've thought about too, but I haven't had any use cases 
yet, so we haven't tackled it.  I believe that what should be done, as you 
guessed, is that the text and graphics output devices need to be combined into 
a single device.  What makes this a bit tricky is that I believe we have to 
support the "HAVE_SPLASH" precompiler flag.  So, in order to derive the 
HtmlOutputDev from the SplashOutputDev, it would probably have to be done 
conditionally.  Ok, so you combine them.  Then what?

If you look closely at the patches in the second bug referenced above, you'll 
see that we're keeping track of the bounding box of each drawing operation, 
using a new ImageProperties class.  Then we coalesce them to find the regions 
of the "big background image" to extract into individual html images.  Now, if 
you were to extend that to also keep track of text showing operations, you 
could use the order of those overlapping regions in the list as the z-index.

Currently, the image extraction algorithm isn't using the alpha channel.  You 
would obviously need to fix that.  I don't know if the underlying Poppler 
library starts with a blank canvas (alpha = 0) or with a blank white canvas 
(alpha = 1, color = 0xFFFFFF), but I believe you would need the former, not the 
latter.

Hope it helps.  Let us know what you're working on, so that we don't duplicate 
effort and create competing solutions to the same problems.  Btw., what Stephen 
and I are working on now is to fix text spacing.  Now that we're using the 
right font, we're much closer, but there are still a bunch of issues.  
Stephen's already made some great progress, and we'll be submitting patches 
soon.

We look forward to working with you!

Best, --josh

From: Akash Agrawal 
<[email protected]<mailto:[email protected]>>
Date: Wed, 20 Jul 2011 05:36:58 -0700
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [poppler] pdftohtml using Poppler

Hi All,

My name is Akash Agrawal and I am working on producing a full-fledged pdf to 
html solution. I investigated poppler and made a lot of custom changes for my 
requirement. I got your reference from revision log in pdfthtml source files. I 
will appreciate if you can address my queries. I am stuck at 2 issues currently:

 1.  z-index
 2.  Fonts

z-index: In it's current solution, poppler's pdftohtml puts all the non-text 
data into an image and use this image as a background image in html. But at 
times, there are pdfs which have image/graphics over the text and current 
solution fails in such case. I looked into Gfx and OutputDevice code and didn't 
reach a good workable solution for this case. I will be highly indebted if you 
can suggest some pointers.

Fonts: Fonts are the biggest problem here. I saw that currently, it outputs all 
fonts as Times (default font), so I fixed that with exact font names (with tag 
coz multiple versions of a same fonts might be present in pdf). I also made 
non-horizontal text as part of image coz rotating the glyphs were not a very 
good idea to me seeing the time in hand. I am also able to extract font data 
but facing difficulties to extract encoding info like cmap etc. Your pointers 
on the same will be very much appreciated. FYI I am using fontforge to convert 
extracted fonts in a common format (ttf in my case). I am thing to apply cmaps 
using fontforge. Please let me know if you suggest otherwise.

I am waiting for a positive response from your side regarding the same. Looking 
forward for a strong technical relationship.

Regards,
Akash Agrawal
http://tech-queries.blogspot.com/

_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdftohtml using Poppler

Reply via email to