That is extremely good advice that I absolutely intend to follow. Here's a
bit about what I'm doing:

At Digital Ricoeur (https://digitalricoeur.org/), we have a corpus of
hundreds of XML documents and growing, some of them book-length. These must
be validated against a custom DTD that is derived from the TEI standard.
(We also have a number of additional, project-specific validity
requirements that we check with Racket contracts. These end up covering
most of validation, but having a standard tool working from the DTD is
important as a sanity check.)

Currently, I validate the documents using xmllint (a program included with
libxml2) running in an external process. This is mostly good, especially if
all of the documents are valid (which of course we always hope is the
case). However, things don't go so well if even one document is invalid:
xmllint doesn't provide structured output, just error messages written to
standard error and a non-zero exit code if any of the files were invalid.
Currently, when that happens, we fall back to invoking xmllint on each file
individually. That takes an extremely long time.

Obviously there are a number of possible approaches. I've considered, for
example, partitioning the list of files when some are invalid so that
hopefully we could find some sub-groups that can be done all at once. What
I'm currently exploring, though, is writing a helper program in Racket
using the FFI. It will probably read a list of paths from standard in and
write a hash table to standard out mapping each path to its validation
result. I still plan to run the validation in a separate process—I don't
like segfaults—just the subprocess will now happen to also be implemented
in Racket, communicate in s-expressions, and not have to be invoked
repeatedly to track down which specific files are invalid. (Of course, I
haven't implemented this yet, so we'll see how it turns out in practice.)

All of that said, though, thank you for the links to Oleg Kiselyov's work
on validation! In the long term I would love to have a real XML validator
in pure Racket and leave libxml2 behind altogether. I had been looking at
some of the sxml packages (though all of my code now uses x-expressions in
the sense of the xml module, and actually a restricted subset of those),
but I hadn't seen the first link you sent, in particular.

-Philip


On Mon, Aug 27, 2018 at 9:34 AM Neil Van Dyke <n...@neilvandyke.org> wrote:

> Rather than use FFI, would it work for your purposes to have the libxml2
> code in a separate process from Racket?  That would avoid the likely C
> memory bugs corrupting your Racket process.
>
> https://www.cvedetails.com/vulnerability-list/vendor_id-1962/product_id-3311/Xmlsoft-Libxml2.html
>
> I've done this before for XML in Racket, to get DSig support, when I
> couldn't cost-justify implementing it in pure Racket at the time. (W3C
> standards tend to be big and complicated, and your implementation of
> DSig has to be perfectly compliant in many regards, to work at all.)
>
> Another possible option is to do what validation and other XML behavior
> you need in pure Racket.  Oleg Kiselyov did some work on validation,
> and, if you have the time, you might implement more.
> http://okmij.org/ftp/Scheme/xml.html#validation
> https://pkgs.racket-lang.org/package/sxml
> https://www.neilvandyke.org/racket/sxml-intro/
>
> XML validation is good for system robustness, but every C library we
> pull into a Racket process makes us less confident about robustness in a
> different way.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to