XML parsing performance

tcheran Sat, 24 Apr 2021 06:45:15 -0700

Hi to all, I'm new to NIM, but for what I've seen so far, I like it a lot. I 
really hope NIM's community and NIM adoption will grow up. I was wondering if I 
could use NIM for some specific tasks I have to deal with, and one of these is 
(huge) XML files parsing. I've seen a bit the online documentation and I 
started with something reasonably simple. I got the SwissProt.xml file, 109 MB 
size uncompressed XML that can be downloaded from: 
<http://aiweb.cs.washington.edu/research/projects/xmltk/xmldata/www/repository.html>
 I decided to extract only the content of element <Species>, along with the 
attribute "id" of its parent and element tag, and write the output to a text 
file. I'm forced to use stream mode and on-the-fly parsing, because my real 
target XML files are without any newline and are more than 10 GBs large.


Here's the code:
    
    
    import streams, parsexml, times
    var filename = "SwissProt.xml"
    
    var s = newFileStream(filename, fmRead)
    if s == nil: quit("cannot open the file " & filename)
    
    var attrkey = ""
    var attrval = ""
    var elemstart = ""
    var data = ""
    var line = ""
    
    let time = cpuTime()
    let f = open("nim_output.csv", fmWrite)
    var x: XmlParser
    open(x, s, filename)
    while true:
      #walk through XML
      case x.kind
      of xmlAttribute:
        attrkey = x.attrKey
        if attrkey == "id":
          attrval = x.attrValue
      
      of xmlElementStart:
        elemstart = x.elementName
      
      of xmlCharData:
        data = x.charData
        if elemstart == "Species":
          line = attrval & ";" & elemstart & ";" & data
          f.writeLine(line)
      of xmlEof: break # end of file reached
      else: discard # ignore other events
      x.next()
    echo "Time taken: ", cpuTime() - time
    x.close()
    
    
    Run

The output is like this:

100K_RAT;Species;Rattus norvegicus (Rat) 104K_THEPA;Species;Theileria parva 
108_LYCES;Species;Lycopersicon esculentum (Tomato) 10KD_VIGUN;Species;Vigna 
unguiculata (Cowpea) 110K_PLAKN;Species;Plasmodium knowlesi 
11S3_HELAN;Species;Helianthus annuus (Common sunflower)

(...)

However I'm not that happy with performance. I'm using NIM 1.2, and no special 
compilation options. On my latop (quite old DELL Latitude 5480 Intel Core i5 
7th gen, 8 GB RAM, Windows 10) execution takes around 20 - 21 secs, while the 
same task in Python 3.8 using lxml library for XML pasing, and iterparse 
construct, takes around 8-9 secs. I don't expect NIM to be necessary faster 
(lmxl is a quite popular and efficient library written in C), but maybe in the 
same order of magnitude. Am I coding the wrong way for this kind of task? I 
also tried to replace copy-assignment like attrval = x.attrValue with 
shallowCopy, but I didn't gain that much (possibly 1 sec). Is there a fast way 
to walk through XML nodes / elements (expecially when I need to skip most of 
them)? Thank you.

XML parsing performance

Reply via email to