Hi,

I am part of the standardisation group that works on a public transport standard for network and timetable exchange. It is available as XSD on github <https://github.com/NeTEx-CEN/NeTEx> under a GPL license.

One of the main problems that we face is the syntax validation of 100MB+ XML-document with this schema, but especially: identity constraint validation. Practically I am looking for a better than libxml2/xmllint speed, where I notice that many - if not all - tools have a direct single threaded performance bottleneck. I am trying to find a generic form to overcome this, I am surprised that it is difficult to find one. Practically parallel syntax validation using sharding could work for us, but identity constraint validation needs all parts of the document, hence I would expect a "better way".

From the Codesynthesis XSD mailinglist I arrived here. I am specifically
interested in any effort that can make identity constraint validation faster, or "XML Screamer" like approaches.


But towards my question. When I compare Xerces Java and Xerces C++ I noticed the following on the same file.

The Java version is capable of doing this:

org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'StopArea_KeyRef' with value 'SYNTUS:StopArea:60103,20200422' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'ScheduledStopPoint_KeyRef' with value 'SYNTUS:ScheduledStoppoint:50203005,20200422' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'TransportAdministrativeZone_KeyRef' with value 'NL:AdministrativeZone:AL,any' not found for identity constraint of element 'PublicationDelivery'. org.xml.sax.SAXParseException; lineNumber: 1499081; columnNumber: 23; Key 'Operator_KeyRef' with value 'SYNTUS,20200422' not found for identity constraint of element 'PublicationDelivery'.


While the C++ version does:

/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV-pushed.xml:1499081:23 error: identity constraint key for element 'PublicationDelivery' not found
(duplicated: 1196 times)


So I am missing the "Key/Value" report and instead get an ocean of duplicates where I can't find out the reason. Could anyone help me out how I an resolve this?


I am currently using this reference code shows from the XSD project.


int main() { xml_schema::properties props; props.schema_location ("http://www.netex.org.uk/netex";, "file:///home/skinkie/Sources/NeTEx-NL/xsd/netex-nl.xsd");
try
{
 //
 // Parse, work with object model, serialize.
 //
netex::PublicationDelivery_ ("/var/tmp/NeTEx_SYNTUS_20200422_New_NDOV.xml", 0, props);
}
catch (const xml_schema::exception& e)
{
 cerr << e << endl;
 return 1;
}
catch (const xml_schema::properties::argument&)
{
cerr << "invalid property argument (empty namespace or location)" << endl;
 return 1;
}
catch (const xsd::cxx::xml::invalid_utf16_string&)
{
 cerr << "invalid UTF-16 text in DOM model" << endl;
 return 1;
}
catch (const xsd::cxx::xml::invalid_utf8_string&)
{
 cerr << "invalid UTF-8 text in object model" << endl;
 return 1;
} return 0; }

--
Stefan

Reply via email to