Re: eLyXer for Document Parsing

slitt Sat, 04 Feb 2012 10:08:30 -0800

On Sat, 4 Feb 2012 10:03:00 -0700
Rob Oakes <lyx-de...@oak-tree.us> wrote:


> Dear eLyXer Users and Developers,
> 
> I'm still at work on the import/export module for Microsoft Word
> documents. I'm making pretty good progress. I've got a rough
> prototype that works pretty well and I'm now starting to refine it.
> 
> My approach up to now has been to use regular expressions to match
> portions of the document and then use a library to translate those to
> the corresponding Word XML structures. It's working pretty well with
> my simple test documents.
> 
> Before going too far with this approach, though, I wanted to post
> (another general query).
> 
> In the eLyXer library, there is already a robust set of tools used
> for converting LyX documents to HTML. Does anyone know if the library
> is written in such as way that getting a generic in-memory
> representation of the document would be possible? It would be awesome
> to re-use as much existing code for the Word document export as
> possible. That would allow me to support a broader number of
> features, and gives me a framework for working with maths.
> 
> Any thoughts Alex (and others)? I've downloaded the sources and have
> begun to work through them, but before spending hours to days trying
> to wrap my head around them, I thought I would ask.


This is obviously an Alex question, so I'll go ahead and answer it :-)

Not only possible but easy if you do things the Steve Litt way. eLyXer
quickly punches out HTML that's clean enough to read with an XML
parser, I think. So, eLyXer converts to HTML, and then your program's
DOMbuilder module converts that HTML to in-memory DOM. No muss, no
fuss, no bother, no picking apart eLyXer code (it's big and not
immediately obvious, not a single weekend task).

One more question: You sure you want to go in-memory? What happens if a
guy has a 1200 page book with 100 chapters each containing 10 sections,
each containing 10 subsections, and tries to parse it on a machine with 512 MB 
RAM? 
You in a heap of
trouble son. He'll be swapped half way into the next century. If
instead you used an event parser (e.g SAX) with a few stacks, it will
probably be slower, and it will be much more hard to write, but for
practical purposes there won't be an upper limit on input file size.

SteveT

Re: eLyXer for Document Parsing

Reply via email to