Re: eLyXer for Document Parsing

Rob Oakes Sun, 05 Feb 2012 08:49:27 -0800

On Feb 5, 2012, at 2:04 AM, Abdelrazak Younes wrote:

> Strong suggestion: use LyX proper. I am quite sure you already know that 
> because I saw some patches from you in this area but I'll explain anyway: 
> LyX's html own export is so good and fast because it effectively knows the 
> in-memory representation of the document. You can't be faster nor more 
> accurate than that. I mean, unless you want to rewrite LyX in python.


Extremely good point, I'm also more comfortable with the HTML export available 
in LyX. I initially was interested in eLyXer because I thought I might be able 
to use it to help with an import filter as well. I'm not sure that it can, 
though. As you note in your email, it doesn't create a document model.

> IIUC you want a single module in python for both import and export in python. 
> But I don't think this is a valid argument. As for the word to lyx format 
> conversion, if you want to use this epub library there must be a way to use 
> that in C++ I'm sure…

I though about using Python because I'd found a tool capable of generating docx 
for me. After working with it a little more, though, I'm less enamored with it. 
 docx is a pretty straightforward file format, and there's quite a few things 
that are sloppily implemented.

> AFAIK, eLyXer doesn't construct a document model. So you'd better spend this 
> time reading the C++ code for exporting to html/xhtml ;-)

Following Steve's suggestion, I decided to try the "easy" way and directly 
parse the XHTML created by eLyXer. Turns out that it's not only easy, but 
probably the best way forward. There are some excellent libraries for reading 
XML in python. Using lxml, in particular, looks like a good solution. You 
generate the XHTML, parse it with lxml, and then iterate over the elements, 
translating as you go. My current script is about 50 lines long, and can be 
used with either native XHTML or eLyXer. To add new features, you add 
additional cases describing how to translate the XHTML.

Which brings us to an important point: there's already a pretty good LyX -> 
XHTML -> LibreOffice -> Word pathway for translating documents. Unless I 
directly implement Word as another backend (which, while a lot of work, isn't 
difficult), I'm not sure there's much reason for a direct MS Word export. The 
real need seems to be for an MS Word import, anyway.

Cheers,

Rob

Re: eLyXer for Document Parsing

Reply via email to