On Tuesday, December 06, 2011 04:57:07 AM Guenter Milde wrote: > On 2011-12-04, Steve Litt wrote: > > Hi all, > > > > I'm making this thread in hopes that everyone who knows something > > on the subject will add something, and at the end we'll all know > > how to go from LyX to Kindle. > > > > I'm quickly coming to the conclusion that the best way to do it > > is to write a post-processor for Alex's HTML. Simple, modular, > > and I can do it myself (and obviously make it free software). > > The post-processor might need to read the LyX file itself to get > > a few bits of information not in Alex's HTML file. > > > > So far I've discovered that LyX pagebreaks don't translate to > > Kindle page breaks (start at top of reader). Therefore, to every > > <h1> should be added the property style="page-break-before: > > always;". That way every chapter starts at the top of the > > reader. > > I think this could/should be done better in a global CSS rule.
Very possibly. I know so little about CSS it didn't even occur to me. I'm a parse and change kinda guy, so that's just what I did. For later, how do I write a CSS rule that makes all <h1> pagefeed before printing? The more I think about it, putting this in CSS would be better because the author can change it without changing Python code. > > > I assume the official interpreter of the LyX project is Python, > > and if that assumption's true I'll make the postprocessor in > > Python. This post processing will be a heck of a lot easier if > > someone can point me to a good XML or HTML parser for Python, so > > I can look at nodes and attributes instead of trying to parse > > tags. I've noticed Alex's HTML appears very standard, with > > matching start and end tags and the like. This should > > theoretically make my job easier. So anyone have any suggestions > > for HTML or XML parsers for Python? > > beautifulsoup http://pypi.python.org/pypi/BeautifulSoup/3.2.0 > and the standard library xml.* submodules: > > xml.dom > xml.dom.minidom > xml.dom.pulldom > xml.etree.ElementTree > xml.parsers.expat > xml.sax > xml.sax.handler > xml.sax.saxutils > xml.sax.xmlreader I've chosen HTMLParser because: 1) Most ubiquitous documentation 2) Seems to be native Python 3) Event driven, no need to build dom tree 4) Relatively easy 5) It works #3 was important to me in case somebody with a 400K word book, having lots of little paragraphs with lots of character styles in each paragraph, that whole albatross won't need to be in RAM at one time. Of course such a book would take a long time to parse, but probably someone with a book that size is used to things taking a long time. From what I hear, lxml is by far the fastest, but my understanding is it parses to a dom tree. From what I hear (http://blog.ianbicking.org/2008/03/30/python-html-parser- performance/), HTMLParser is one of the faster ones, though nowhere near as fast as lxml, and it's fairly easy on RAM because it outputs events, not a dom tree. My postprocessor is a proof of concept -- its design decisions can be changed later. [clip] > Keep up the work. Thank you! SteveT Steve Litt Author: The Key to Everyday Excellence http://www.troubleshooters.com/bookstore/key_excellence.htm Twitter: http://www.twitter.com/stevelitt
