Re: [NTG-context] ignore not closed tags in XML input

Taco Hoekwater via ntg-context Mon, 16 May 2022 11:14:05 -0700


> On 16 May 2022, at 18:50, Pablo Rodriguez via ntg-context 
> <ntg-context@ntg.nl> wrote:
> 
> On 5/16/22 17:30, Hans van der Meer via ntg-context wrote:
>> Can't you use an editor with grep, searching for something like the
>> pattern <meta.*^/>?
> 
> Many thanks for your reply, dr. van der Meer.
> 
> If I want to typeset the whole book
> (https://seumasjeltzz.github.io/LinguaeGraecaePerSeIllustrata/), I will
> have to download and sanitize over 20 HTML files.


Which can be done with a couple of command lines. Xmllint usually does a good
job of cleaning up dodgy html input:

  xmllint --html --xmlout <crappy.html> > <nice.xml>

(As good as can be expected from a program, anyway).

> It is really a pity that ConTeXt cannot totally ignore any given XML elements.

This statement is a little unfair: the problem is exactly that your input is 
NOT proper XML.
 
If it was proper XML, ConTeXt would not have problems with it. ConTeXt 
explicitly has
the capability to handle XML files, which your input simply is not. In fact, it 
is
sloppy HTML-esque data that modern webbrowsers happen to be able to handle more 
or less
correctly. It is not valid HTML either, because valid HTML has to be valid 
SGML, which your
input clearly is not.

That said, Tools like xmllint exist for this stuff. Just write a small batch 
driver file in 
some scripting language ((power)shell, lua, python, perl, etc.) to preprocess 
the HTML 
stuff into clean XML, and you should be fine.

Taco

— 
Taco Hoekwater              E: t...@bittext.nl
genderfluid (all pronouns)



___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://context.aanhet.net
archive  : https://bitbucket.org/phg/context-mirror/commits/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Re: [NTG-context] ignore not closed tags in XML input

Reply via email to