Hi to all, I'm new to NIM, but for what I've seen so far, I like it a lot. I really hope NIM's community and NIM adoption will grow up. I was wondering if I could use NIM for some specific tasks I have to deal with, and one of these is (huge) XML files parsing. I've seen a bit the online documentation and I started with something reasonably simple. I got the SwissProt.xml file, 109 MB size uncompressed XML that can be downloaded from: <http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html> I decided to extract only the content of element <Species>, along with the attribute "id" of its parent and element tag, and write the output to a text file. I'm forced to use stream mode and on-the-fly parsing, because my real target XML files are without any newline and are more than 10 GBs large.
Here's the code: import streams, parsexml, times var filename = "SwissProt.xml" var s = newFileStream(filename, fmRead) if s == nil: quit("cannot open the file " & filename) var attrkey = "" var attrval = "" var elemstart = "" var data = "" var line = "" let time = cpuTime() let f = open("nim_output.csv", fmWrite) var x: XmlParser open(x, s, filename) while true: #walk through XML case x.kind of xmlAttribute: attrkey = x.attrKey if attrkey == "id": attrval = x.attrValue of xmlElementStart: elemstart = x.elementName of xmlCharData: data = x.charData if elemstart == "Species": line = attrval & ";" & elemstart & ";" & data f.writeLine(line) of xmlEof: break # end of file reached else: discard # ignore other events x.next() echo "Time taken: ", cpuTime() - time x.close() Run The output is like this: 100K_RAT;Species;Rattus norvegicus (Rat) 104K_THEPA;Species;Theileria parva 108_LYCES;Species;Lycopersicon esculentum (Tomato) 10KD_VIGUN;Species;Vigna unguiculata (Cowpea) 110K_PLAKN;Species;Plasmodium knowlesi 11S3_HELAN;Species;Helianthus annuus (Common sunflower) (...) However I'm not that happy with performance. I'm using NIM 1.2, and no special compilation options. On my latop (quite old DELL Latitude 5480 Intel Core i5 7th gen, 8 GB RAM, Windows 10) execution takes around 20 - 21 secs, while the same task in Python 3.8 using lxml library for XML pasing, and iterparse construct, takes around 8-9 secs. I don't expect NIM to be necessary faster (lmxl is a quite popular and efficient library written in C), but maybe in the same order of magnitude. Am I coding the wrong way for this kind of task? I also tried to replace copy-assignment like attrval = x.attrValue with shallowCopy, but I didn't gain that much (possibly 1 sec). Is there a fast way to walk through XML nodes / elements (expecially when I need to skip most of them)? Thank you.