Hi again,

So I am trying to parse some large (2-9 GB) XML files for an idea I
had for using JDB. My plan was to use XSLT to flatten these things out
(they are deeply nested structures), but figured I'd do a quick and
easy test to make sure I had a reasonable grip on J's facilities
before diving in.

Unfortunately with the code I came up with it is so slow that I don't
think it's even worth attempting, can anyone provide some tips on how
I can maybe speed this up? I read up on the J performance monitor and
clocked it, it said 75% of the time was spent in cdcallback (which
makes me think there's nothing I can do short of writing something in
C/C++, but maybe I'm wrong). Here's the code (loosely adapted from
Oleg & John Baker's examples for sax).

For the record, I created two smaller test files (500KB and 6MB) and
the code below works correctly on both of those. I've also written
Python code using lxml's element tree module and it can process the 2
GB file in about 60 seconds, I let this code run for 30 minutes and
then killed it.

Any ideas?

Thanks,
Dan

require 'jmf'
require 'files dir'
require 'xml/sax'

saxclass ‘xp’

startDocument=: 3 : 0
ids=: ''
)


startElement=: 4 : 0
if. y-:,’Node’ do.
  ids=: ids,< x getAttribute '_Id'
end.
)


endDocument=: 3 : 0
s: ids
)

NB. =========================================================
cocurrent 'base'

fn=: 'c:/data/test/2GBfile.xml’'

unmap_jmf_ 'xfile' NB. Hokey, but for debugging

JCHAR map_jmf_ 'xfile';fn

process_xp_ xfile
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to