rendering HTML to PDF with Flying Saucer, Jython, and HTML Tidy

Kragen Javier Sitaker Sun, 03 Oct 2010 21:44:38 -0700

Flying Saucer (http://xhtmlrenderer.dev.java.net/) includes code to
render XHTML documents to PDF, using CSS stylesheets to precisely
control the document's appearance. For me, this is a big improvement
over either a manual-layout word processor or TeX.


This is a quick Jython script I wrote to integrate Flying Saucer with
some fonts I wanted to use in my document, and HTML Tidy, so that I
didn't have to write the document in XHTML directly.

I'm using Flying Saucer R8pre2 with it, haven't yet tried the final R8
release, but I imagine it's compatible.

#!/usr/bin/jython
# -*- coding: utf-8 -*-
"""Render an HTML file to PDF using Flying Saucer.

This program has the following advantages over
`org.xhtmlrenderer.simple.PDFRenderer`, the example that comes with
Flying Saucer:

- it is written in Jython rather than Java, so it should be simpler to
  read, understand, and reuse;
- courtesy of shelling out to HTML Tidy, it takes HTML input rather
  than XHTML;
- it imports some fonts from `fontdir`.

You still have to set `CLASSPATH` to include Flying Saucer and IText
though.  I invoke it from the `xhtmlrenderer` (Flying Saucer)
top-level directory as follows:

    time CLASSPATH='build/classes:lib/itext-paulo-155.jar' \
        wherever/pdfwithfonts.py input.html > output.pdf


Or from somewhere with the jar files:

    time CLASSPATH=core-renderer.jar:itext-paulo-155.jar \
        ./pdfwithfonts.py input.html > output.pdf

One weird problem is that it doesn’t automatically make a PDF index
using the headers of the HTML file.  Flying Saucer expects some
invalid HTML like this in `<head>` to tell it what you want to put in
your PDF index:

    <bookmarks>
        <bookmark name='1. Foo bar baz' href='#1'>
          <bookmark name='1.1 Baz quux' href='#1.2'>
          </bookmark>
        </bookmark>
        <bookmark name='2. Foo bar baz' href='#2'>
          <bookmark name='2.1 Baz quux' href='#2.2'>
          </bookmark>
        </bookmark>
    </bookmarks>

I may look into building these bookmarks automatically.

I’d also like to be able to compile this program with `jythonc` so I
can run it without Jython installed, but I can’t figure out how.

"""

import org.xhtmlrenderer.pdf, com.lowagie.text.pdf, java.io, os, sys

# I’m developing this with Jython 2.1.
try:
    True
except NameError:
    True, False = 1, 0

# BROKEN AND NOT USED; see comment.
def add_font_directory(resolver, dirname, embedded=True):
    """Add a directory of .afm and corresponding .pfb files.

    This doesn’t work, because the ITextFontResolver isn’t
    discriminating enough, so if you add all of these fonts, your
    Nimbus Sans L bold text ends up as Nimbus Sans L 'italic' (really
    oblique), your Nimbus Sans L regular text ends up as Nimbus Sans L
    condensed, your URW Palladio L regular text ends up as URW
    Palladio L bold, and your URW Palladio L bold text ends up as URW
    Palladio L italic (which is really quite a nice font, but not what
    you asked for).

    If you just add the four files containing the specific fonts you
    need, things work better.  See `addFonts` for that.

    """

    encoding = com.lowagie.text.pdf.BaseFont.CP1252 # Is this right?

    for fileobj in java.io.File(dirname).listFiles():
        if fileobj.name.lower().endswith('.afm'):
            path = fileobj.absolutePath
            resolver.addFont(path, encoding, embedded, path[:-4] + '.pfb')

class TidyFailed(Exception): pass

def tidy_file(infile, outfile):
    "Invoke HTML Tidy to generate XHTML.  Not suitable for malicious input."

    tidy_rv = os.system('tidy -utf8 -asxhtml "%s" > "%s"' % (infile, outfile))

    success       = 0
    tidy_warnings = 1    # see tidy(1) man page, section "EXIT STATUS"

    if tidy_rv not in [success, tidy_warnings]:
        raise TidyFailed(tidy_rv)

fontdir = "/usr/share/fonts/type1/gsfonts/"

def add_fonts(font_resolver):
    """Imports a couple of specific fonts I like to use from `fontdir`.

    Although the documentation for Flying Saucer says that only
    TrueType fonts can be added to the font resolver, it actually
    supports `.afm` files with associated `.pfb` files as well.

    """
    # This version doesn’t work (see comments on add_font_directory):
    # add_font_directory(font_resolver, fontdir)

    encoding = com.lowagie.text.pdf.BaseFont.CP1252 # is this right?

    fontnames = ["n019003l",        # Nimbus Sans L regular
                 "n019004l",        # Nimbus Sans L bold
                 "p052003l",        # URW Palladio L regular
                 "p052004l",        # URW Palladio L bold
                ]

    for fontname in fontnames:
        font = fontdir + fontname

        # True here is embedded=True, i.e. embed the font in the .pdf
        font_resolver.addFont(font + ".afm", encoding, True, font + ".pfb")

def main(infile, outfile):
    """Render HTML from filename `infile` to a PDF on the file obj `outfile`.

    Interpolates `infile` into a command line and then writes a
    temporary file `tmp.xhtml` in the current directory, so don’t use
    it where an attacker could supply the input filename or control
    the current working directory.

    """
    tmpname = 'tmp.xhtml'
    tidy_file(infile, tmpname)

    r = org.xhtmlrenderer.pdf.ITextRenderer()
    add_fonts(r.fontResolver)

    # XXX can’t just `r.document =` because that calls setDocument(String)
    r.setDocument(java.io.File(tmpname))
    r.layout()
    r.createPDF(outfile)

if __name__ == '__main__':
    main(sys.argv[1], sys.stdout)
-- 
To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-hacks

rendering HTML to PDF with Flying Saucer, Jython, and HTML Tidy

Reply via email to