Hi all, I finally pushed an initial draft of the Commons XML Factory project I proposed back in December [1]:
https://github.com/copernik-eu/commons-xml-factory The library is a single `XmlFactories` class with factory methods that return hardened JAXP factories for: - DocumentBuilderFactory - SAXParserFactory - XMLInputFactory - TransformerFactory - SchemaFactory - XPathFactory Internally, each factory method dispatches to a per-implementation `XmlProvider` that applies the correct hardening for that implementation. The SPI is open via `ServiceLoader`, but providers for the JDK, Xerces, Woodstox and Saxon are bundled. It's fair to ask whether this is worth a library at all: a per-factory hardening recipe is only a handful of lines, and most projects wrote their own years ago. Two observations: First, those handful of lines are exactly the lines people forget or get subtly wrong. The 2025 Java XXE CVEs bear this out: Apache Tika (CVE-2025-54988, CVE-2025-66516), WebDriverManager (CVE-2025-4641), CycloneDX (CVE-2025-64518), GeoServer (CVE-2025-58360). Second, the correct recipe depends on which JAXP implementation is actually on the classpath, and that's often not what the developer thinks. A library author tests against the JDK, observes that FEATURE_SECURE_PROCESSING transitively restricts ACCESS_EXTERNAL_* (JEP 185), and writes a minimal hardening block. The library is then deployed in an application that pulls in external Xerces transitively: JEP 185 no longer applies, ACCESS_EXTERNAL_* is not honored, and the minimal block is no longer sufficient. The draft intentionally offers no configuration: it hardens at one level and fails fast if it encounters an implementation it doesn't recognize. Before extending it, I'd like feedback on whether the proposed direction makes sense. I see three plausible hardening levels worth supporting: 1. No DOCTYPE allowed. Eliminates the entire class of DTD-based attacks. This is what the draft implements. 2. DOCTYPE allowed, no external resources loaded. Internal entities work (for users who need HTML-style named entities, for example), entity expansion limits are enforced, but nothing is fetched from outside the document. 3. DOCTYPE allowed, user-supplied resolver. The caller provides an EntityResolver; we wrap it so that if the resolver returns null for an unknown reference, we throw rather than falling through to the parser's default URL-fetching behavior. This closes SAX's most common footgun while letting integrators implement classpath-scoped loading, XML catalogs, and similar. The draft also addresses the secondary-source problem for TransformerFactory (stylesheet loading) and SchemaFactory (schema imports). Currently both are locked down as tightly as primary input, but this is probably a place where two distinct levels make sense: users often have trusted stylesheets or schemas they want to load via xsl:import or xs:include, separate from the question of what to allow in the document being transformed or validated. Two things I'd particularly appreciate feedback on: - Does the three-level model above cover the use cases you'd want to bring to this library? - For the secondary-source question, is there appetite for a separate axis, or should primary and secondary be tied together under a single level? Piotr [1] https://lists.apache.org/thread/b2tjc15vjkgsrxxkc8phlnt6801hx4xz --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
