Many thanks Josh, Very interesting to hear you're working on this. Indeed, I tested quite a few things and there seems to be few to do with WebKit letter-spacing (only has effect with huge difference). The bug has been known for years, but strangely enough nothing has been done : https://bugs.webkit.org/show_bug.cgi?id=20606
No issue then with font extraction, just wondered if it was normal not to have the otf/woff/eot — and so is it. Would love those scripts. Since I'm working on OS X, I use FontXChange which works fine, but is not the good solution to automatize this. I'll have a look at the other points then. -- Clément Wehrung 06 88 10 65 91 Le mercredi 26 octobre 2011 à 18:55, Josh Richardson a écrit : > Yes, I'm aware of the Gecko vs. Webkit issue. I have a colleague checking > with the Webkit developers — apparently a fix is underway for the decimal > issues, but we're unsure when it will be ready. In the mean time, I tried > using text-align-last, but Webkit doesn't seem to honor that. I tried > text-align-justify, but Webkit seems to never reduce spacing in order to > justify, so it breaks different than the original document. > > Currently I'm working on a new option for pdftohtml which will place each > word in its own span. While being heavy, this should overcome some of > Webkit's current limitations, and make these pages more usable on > Safari/Chrome, etc., although the character-spacing limitation will mean that > all the justification will happen between words — less ideal than how it will > work on FireFox. > > I'm not sure exactly your issue with font extraction. Font extraction is > relatively simple code with no external dependency, so that should be > working. I have not built into pdftohtml to do font ^conversion^ into > web-enabled formats (WOFF/TTF), because I think FontForge, etc. is more > suitable for that particular task. I have a couple Python scripts to do it, > which if it's acceptable to the Poppler maintainers, I'd be happy to check > into the repository. > > Best, --josh > > From: Clément Wehrung <[email protected] (mailto:[email protected])> > Date: Wed, 26 Oct 2011 08:14:09 -0700 > To: Josh Richardson <[email protected] (mailto:[email protected])> > Cc: Clément Wehrung <[email protected] (mailto:[email protected])>, > "[email protected] (mailto:[email protected])" > <[email protected] (mailto:[email protected])>, Alec > Taylor <[email protected] (mailto:[email protected])> > Subject: Re: [poppler] pdftohtml does not preserve fonts > > Sure, but I reproduce there are (I believe) two issues here : > 1) justification is more complicated with webkit due to not (really) working > optimizeLegibility in WebKit and the fact that WebKit handles poorly decimal > in word-spacing and not at all in letter-spacing > 2) due to kerning (I can send you a screenshot comparing in Photoshop two > texts one over the other) / letter-spacing / word-spacing (?), lines are much > longer in WebKit => hence, if you have for example "footnotes" as in this > PDF, they don't get at the right place in the text (all the more so as if you > have a PDF from an InDesign export, there may be "metrics" which cause some > text to go over another — yet, you can always remove all metrics before > exporting in PDF…it avoids part of the issue) > > NB : I don't manage to get the fonts extracted to work, but I can send those > to you in otf if you want (don't know if extraction is not working due to my > installation ?) > > PDF file : BugWebkit.pdf (http://cl.ly/0L3g2I1r3G2a0T0o3622) > > -- > Clément Wehrung > 06 88 10 65 91 > > Le mercredi 26 octobre 2011 à 14:35, Clément Wehrung a écrit : > > > You can understand better the issue here (Firefox vs Safari on Mac/iOS) > > > > http://dev.nurves.com/pdf2html/-6.html > > > > Cf. footnotes > > > > WebKit.png (http://cl.ly/3c1B2V1X2u2C2f0M2L0L) > > Firefox.png (http://cl.ly/0Q111C3u2g3T2U1D3U2u) > > -- > > Clément Wehrung > > 06 88 10 65 91 > > > > > > > > Le mercredi 26 octobre 2011 à 14:26, Clément Wehrung a écrit : > > > > > Hi Josh, > > > > > > Thanks for all this. I'm already looking at the code now, but I've run > > > into some issues with webkit rendering compared to Firefox (where it > > > looks really amazing !). I know webkit has a bug with letter-spacing > > > (does not take decimal into account) but there's more to it since > > > text-rendering:optimizeLegibility; only partly works. I try to see how we > > > could get text boxes not to end up one over the other. I can show you > > > some screenshots if you want. > > > > > > btw, when have you chosen not to use only the background image for all > > > graphics ? is it in order to achieve some image over text ? > > > > > > Thanks, > > > > > > Clement > > > > > > -- > > > Clément Wehrung > > > 06 88 10 65 91 > > > > > > Le mardi 25 octobre 2011 à 00:41, Josh Richardson a écrit : > > > > > > > Ok, sent you a read-only access invitation for now. Thanks for your > > > > offer to help. Here is my bigger issues list to get a flavor – a lot > > > > of fun things to do. Let me know what you want to do with pdftohtml! > > > > > > > > Translate drawing operations into canvas with SVG > > > > Find better way to calculate vertical positioning, by looking at > > > > browser source code > > > > z-index handling -- currently text is never masked by graphics > > > > Algorithmic extraction of TOC > > > > Algorithmic extraction of page numbering (Alec may be working on this) > > > > Algorithmic identification of chapters > > > > Right-to-left text, proper display (e.g. Arabic, Hebrew) > > > > Algorithmic detection of text flow (Stephen may be working on this) > > > > Detection / removal of duplicate images > > > > Jpg vs. png selection; automatically choose the best format for each > > > > image > > > > > > > > > > > > --josh > > > > > > > > From: Clément Wehrung <[email protected] > > > > (mailto:[email protected])> > > > > Date: Mon, 24 Oct 2011 15:27:23 -0700 > > > > To: Josh Richardson <[email protected] (mailto:[email protected])> > > > > Cc: "[email protected] > > > > (mailto:[email protected])" <[email protected] > > > > (mailto:[email protected])>, Alec Taylor > > > > <[email protected] (mailto:[email protected])> > > > > Subject: Re: [poppler] pdftohtml does not preserve fonts > > > > > > > > Sure ! Do you have a link for the repo so that I can already have a > > > > look (I didn't figure out which one it is right now) ? I'm really > > > > interested in helping you, if you need something on any specific topic > > > > don't hesitate. Many thanks again, > > > > > > > > Clément > > > > > > > > > > > > On Mon, Oct 24, 2011 at 8:01 PM, Josh Richardson <[email protected] > > > > (mailto:[email protected])> wrote: > > > > > Can you give me a couple of days? I want to try to get a repo hosted > > > > > on, > > > > > e.g. bitbucket, which is connected to my repo, so that it's easier > > > > > to keep > > > > > everything in synch. Alec Taylor set up a repo there already, which > > > > > you > > > > > can use to get an immediate snapshot if needed. > > > > > > > > > > Best, --josh > > > > > > > > > > On 10/24/11 10:45 AM, "iclems" <[email protected] > > > > > (mailto:[email protected])> wrote: > > > > > > > > > > > > > > > > >Dear Josh, > > > > > > > > > > > >Being working on a pdftohtml project which requires font > > > > > >preservation, I'd > > > > > >be really interested in getting this too. Do you think it's possible > > > > > >? > > > > > > > > > > > >Thanks, > > > > > > > > > > > >Clement > > > > > >[email protected] (mailto:[email protected]) > > > > > > > > > > > > > > > > > >Josh Richardson wrote: > > > > > >> > > > > > >> Preserving fonts is not integrated into the master repository yet. > > > > > >> If > > > > > >>you > > > > > >> like, I can send you a patched version of Poppler which will do it. > > > > > >> You'll still have to run your own process (like Fontforge) to > > > > > >> convert > > > > > >>the > > > > > >> fonts into a web-usable format, but it's straightforward as long > > > > > >> as the > > > > > >> fonts have mapping to unicode, and doable even without. > > > > > >> > > > > > >> --josh > > > > > >> > > > > > >> From: M Naveed Akram <[email protected] > > > > > >> (mailto:[email protected])<mailto:[email protected]>> > > > > > >> Date: Fri, 30 Sep 2011 06:52:14 -0700 > > > > > >> To: > > > > > >>"[email protected] > > > > > >>(mailto:[email protected])<mailto:[email protected]>" > > > > > >> <[email protected] > > > > > >> (mailto:[email protected])<mailto:[email protected]>> > > > > > >> Subject: [poppler] pdftohtml does not preserve fonts > > > > > >> > > > > > >> Hi, > > > > > >> > > > > > >> I have been using 0.16 release of poppler-utils, but I am facing a > > > > > >> problem. When converting pdf to html using pdftohtml it does not > > > > > >>preserve > > > > > >> fonts in the output html. How can I solve this issue. Please help > > > > > >> > > > > > >> > > > > > >> _______________________________________________ > > > > > >> poppler mailing list > > > > > >> [email protected] > > > > > >> (mailto:[email protected]) > > > > > >> http://lists.freedesktop.org/mailman/listinfo/poppler > > > > > >> > > > > > >> > > > > > > > > > > > >-- > > > > > >View this message in context: > > > > > >http://old.nabble.com/pdftohtml-does-not-preserve-fonts-tp32569116p3271208 > > > > > >4.html > > > > > >Sent from the Free Desktop - poppler mailing list archive at > > > > > >Nabble.com (http://Nabble.com). > > > > > > > > > > > >_______________________________________________ > > > > > >poppler mailing list > > > > > >[email protected] (mailto:[email protected]) > > > > > >http://lists.freedesktop.org/mailman/listinfo/poppler > > > > > > > > > > > > > > > > > > > > >
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
