Re: LyX to Kindle, points to remember

Steve Litt Tue, 06 Dec 2011 09:28:53 -0800

On Tuesday, December 06, 2011 04:57:07 AM Guenter Milde wrote:
> On 2011-12-04, Steve Litt wrote:
> > Hi all,
> > 
> > I'm making this thread in hopes that everyone who knows something
> > on the subject will add something, and at the end we'll all know
> > how to go from LyX to Kindle.
> > 
> > I'm quickly coming to the conclusion that the best way to do it
> > is to write a post-processor for Alex's HTML. Simple, modular,
> > and I can do it myself (and obviously make it free software).
> > The post-processor might need to read the LyX file itself to get
> > a few bits of information not in Alex's HTML file.
> > 
> > So far I've discovered that LyX pagebreaks don't translate to
> > Kindle page breaks (start at top of reader). Therefore, to every
> > <h1> should be added the property style="page-break-before:
> > always;". That way every chapter starts at the top of the
> > reader.
> 
> I think this could/should be done better in a global CSS rule.


Very possibly. I know so little about CSS it didn't even occur to me. 
I'm a parse and change kinda guy, so that's just what I did. For 
later, how do I write a CSS rule that makes all <h1> pagefeed before 
printing? The more I think about it, putting this in CSS would be 
better because the author can change it without changing Python code.

> 
> > I assume the official interpreter of the LyX project is Python,
> > and if that assumption's true I'll make the postprocessor in
> > Python. This post processing will be a heck of a lot easier if
> > someone can point me to a good XML or HTML parser for Python, so
> > I can look at nodes and attributes instead of trying to parse
> > tags. I've noticed Alex's HTML appears very standard, with
> > matching start and end tags and the like. This should
> > theoretically make my job easier. So anyone have any suggestions
> > for HTML or XML parsers for Python?
> 
> beautifulsoup http://pypi.python.org/pypi/BeautifulSoup/3.2.0
> and the standard library xml.* submodules:
> 
> xml.dom
> xml.dom.minidom
> xml.dom.pulldom
> xml.etree.ElementTree
> xml.parsers.expat
> xml.sax
> xml.sax.handler
> xml.sax.saxutils
> xml.sax.xmlreader

I've chosen HTMLParser because:

1) Most ubiquitous documentation
2) Seems to be native Python
3) Event driven, no need to build dom tree
4) Relatively easy
5) It works

#3 was important to me in case somebody with a 400K word book, having 
lots of little paragraphs with lots of character styles in each 
paragraph, that whole albatross won't need to be in RAM at one time. 
Of course such a book would take a long time to parse, but probably 
someone with a book that size is used to things taking a long time.

From what I hear, lxml is by far the fastest, but my understanding is 
it parses to a dom tree. From what I hear 
(http://blog.ianbicking.org/2008/03/30/python-html-parser-
performance/), HTMLParser is one of the faster ones, though nowhere 
near as fast as lxml, and it's fairly easy on RAM because it outputs 
events, not a dom tree.

My postprocessor is a proof of concept -- its design decisions can be 
changed later.

[clip]
 
> Keep up the work.

Thank you!

SteveT
 
Steve Litt
Author: The Key to Everyday Excellence
http://www.troubleshooters.com/bookstore/key_excellence.htm
Twitter: http://www.twitter.com/stevelitt

Re: LyX to Kindle, points to remember

Reply via email to