Hi, 2010/11/14 Frédéric Demians <[email protected]>: > A MARCXML document is very simple XML which doesn't need a full > fledged XML parser. I'm just saying that as soon as MARCXML records as > stored in Koha are valid, if it isn't already the case, we can avoid > using an heavy-weighted parser which impact performance and isn't > required. We need of course to continue to use a SAX parser for incoming > records.
I've measured, and your parser is, in fact pretty fast -- *if* you feed it only MARCXML that meets narrower constraints than are permitted by the MARC21slim schema. However, I see no good reason to limit Koha to that artificial restriction; having biblioitems.marcxml contain MARCXML that validates against the MARC21slim is sufficient. Two parsers doing similar operations is an invitation for subtle bugs. The pure Perl parser you propose currently doesn't handle namespaces prefixes (which are allowed in MARC21slim records), wouldn't handle any situation where the attributes aren't in the order you expect them in (attribute order is not significant per the XML specification), and will blithely accept non-well-formed XML without complaining (this is *not* a good thing). It also doesn't recognize and correctly handle XML entities. Obviously you could address much of this in your code, but I suspect what you'll find is that you'll end up with an XML parser that is slower and still has more bugs than any of the standard parser modules. Fortunately, I've found an approach that is significantly faster than MARC::File::XML/SAX: dropping SAX from MARC::File::XML entirely and using XML::LibXML's DOM parser instead [1]. It is faster [2] than using XML::LibXML::SAX::Parser [3], XML::Expat [4], and even XML::ExpatXS [5]. A pure Perl approach based on your work [6] does win the race [7], but it also fails some of MARC::File::XML's test cases and I'm sure it would lose speed once extended to handle the full range of what constitutes a valid MARCXML document. But, one might ask, what about memory usage with a DOM parser? MARC::File::XML as used by Koha (and used in general) is geared towards parsing one record at a time; it doesn't currently have any provision for loading an entire file in memory. A DOM tree for a typical MARCXML record is not a big deal, and even a record having several thousand items wouldn't be any more unmanageable. (Of course, as we all know, one of the most significant gains to be had will arise from changing Koha so that it doesn't embed item data in bib MARC tags as a matter of course). In fact, we already have proof that we'd be no worse off as far as memory consumption is concerned -- XML::LibXML::SAX::Parser, as it happens, isn't a traditional SAX parser. What it does is load the XML document into a DOM tree, then walks the tree and fires off SAX events. In other words, we're *already* using a DOM parser. In any event, I would be grateful for people to test the DOM version of MARC::File::XML. It passes MARC::File::XML's test suite successfully, but more testing to verify that it won't break things would help a great deal. By the way, I did also try XML::Twig, but that didn't turn out to be faster than XML::LibXML::SAX::Parser, and in some cases was slower. [1] http://git.librarypolice.com/?p=marcpm.git;a=shortlog;h=refs/heads/use-dom-instead-of-sax [2] http://librarypolice.com/nytprof/run-libxml-dom-2/index.html [3] http://librarypolice.com/nytprof/run-sax-libxml-sax-parser/ [4] http://librarypolice.com/nytprof/run-sax-expat/index.html [5] http://librarypolice.com/nytprof/run-sax-expatxs/index.html [6] http://git.librarypolice.com/?p=marcpm.git;a=shortlog;h=refs/heads/pure-perl [7] http://librarypolice.com/nytprof/run-pp/ Regards, Galen -- Galen Charlton [email protected] _______________________________________________ Koha-devel mailing list [email protected] http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel website : http://www.koha-community.org/ git : http://git.koha-community.org/ bugs : http://bugs.koha-community.org/
