Re: [NTG-context] Doc to ConTeXt [was Re: HTML to ConTeXt]

Andrea Valle Fri, 09 Nov 2007 18:08:52 -0800

Hi to all (Idris, in particular, as we are always dealing with thesame problems... ),

I just want to share some thoughts about the ol' damn' problem ofconverting to ConTeXt from Word et al.

As I told Andrea: For relatively simple documents (like the kind weuse in
academic journals) it seems we can now

1) convert doc to odt using OOo
2) convert odt to markdown using

As suggest by Idris, I subscribed to the pandoc list, but I have tosay that the activity is not exactly like the one on ConTeXt list...So the actual support for ConTeXt conversion is not convincing. More,it's always better to put the hands on your machine...

My problem is to convert a series of academic journals in ConTeXt.They come form the Humanities so little structure (basically, mainlybody and footnotes).Far from me the idea of automatically doing all the stuff, I'd liketo be faster and more accurate in conversion.(No particular interest in figures, they are few, not so much inreferences: they tends to be typographically inconsistent if done

in a WYSISYG environment, so difficult to parse).

More, as the journal has already being published we need to work withfinal pdfs.

After wasting my time with an awful pdf to html converter byAcrobat, I discovered this, you may all know:

http://pdftohtml.sourceforge.net/

The html conversion is very very good in resulting rendering andalso in sources, but after some tweakings I got interested in the xmlconversion it allows.The xml format substantially encodes the infos related to page,typically each line is an element. Plus, there are bold and italicsmarked easily as <b> and <i>I'm still struggling to understand something really operative of XMLprocessing in ConTeXt, so I switched back to Python.

I used an incremental sax parser with some replacement.
This is today's draft.
Original:
http://www.semiotiche.it/andrea/membrana/02%20imp.pdf

Recomposed (no setup at all, only \enableregime[utf]):
http://www.semiotiche.it/andrea/membrana/02imp.pdf

pdf --> pdftoxml --> xml --> python script --> tex --> pdf

I recovered par, bold, em, footnotes, stripping dashes andreassembling the text with footnote references. Not bad as a first step.


I guess that you xml gurus could probably do much easier and cleaner.

So, I mean -just for my very specific needs, I con probably takeword sources, convert to pdf and then finally reach ConTeXt asdiscussed.


Just some ideas to share with the list

Best

-a-




--------------------------------------------------
Andrea Valle
--------------------------------------------------
CIRMA - DAMS
Università degli Studi di Torino
--> http://www.cirma.unito.it/andrea/
--> [EMAIL PROTECTED]
--------------------------------------------------

I did this interview where I just mentioned that I read Foucault. Whodoesn't in university, right? I was in this strip club giving thisguy a lap dance and all he wanted to do was to discuss Foucault withme. Well, I can stand naked and do my little dance, or I can discussFoucault, but not at the same time; too much information.

(Annabel Chong)

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Re: [NTG-context] Doc to ConTeXt [was Re: HTML to ConTeXt]

Reply via email to