Re: [Pharo-dev] tricks for XML parsing.

monty Thu, 14 Jul 2016 02:12:52 -0700


> Sent: Thursday, July 14, 2016 at 2:50 AM
> From: "Jan Vrany" <[email protected]>
> To: [email protected]
> Subject: Re: [Pharo-dev] tricks for XML parsing.
>
> On Thu, 2016-07-14 at 01:58 +0200, monty wrote:
> > Thanks for the link.
> > 
> > In-place parsing is a non-starter because it means storing the entire
> > input as a string in memory, so you could only parse files that fit
> > in Pharo's address space. The multi-gigabyte OpenStreetMap docs the
> > article mentions would be unparsable with SAX in a 32-bit VM.
> 
> I do not understand. I only know expat which does - AFAIK - in-place
> parsing and surelt does not need the whole input in memory.


>From the article, footnote 3: "This creates a lifetime dependency–the entire 
>source buffer must outlive all document nodes for the technique to work"

> > There is always the option of an FFI-based parser, but it shouldn't
> > be a hybrid like Python's minidom (FFI Expat with a Python DOM
> > implementation), 
> > because something like that already exists in Smalltalk/X (FFI Expat
> > with a Smalltalk DOM) 
> 
> I guess you refer to the implementation I did ages ago. 
> 
> > and it was slower than a St/X port of XMLParser in my tests (I assume
> > due to the FFI overhead), so it's probably not worth it. 
> 
> Very, very interesting. Where can I find the benchmarks? 

This was well over a year ago, and it was DOM parsing. I was testing if St/X 
(your branch, I think) could be supported by XMLParser in addition to Pharo, 
Squeak, and GS, but I ran into too many incompatibilities, like Monticello not 
working (had to load in .st files), #new not sending #initialize, not being 
able to modify the value of a dictionary association directly, #lf/#cr 
weirdness, so I gave up. But not before hacking it enough to kind-of run and 
compared it with the other parsers.

> I just run a very simple benchmark on 112MB document (http://www.xml-be
> nchmark.org/downloads.html) and results are quite the opposite: 
> 
> Benchmark resut:
> Generated at :14-07-2016 07:32:25 AM
> 
>            Benchmark      Execution Time [ms]      # of M&S GCs
> [1]      # of newspace GCs [1]   Parameters
> BenchmarkXML
>             SAX -
> VW                    93418                     0                      
>  2060   
>       SAX -
> XMLSuite                     9921                     0                
>         410   
> 
> As you can see, the latter is roughly 10 times faster. 

That's the VW parser, which is slower than XMLParser. And again, it was of DOM 
parsing.

> I agree my implementation which uses Expat is clearly suboptimal 
> and need to be improved (for example it does not use a ILC-based 
> send to driver so you have a lot of cache misses and does a lot 
> of unnecessary memcpy()s, but this can be easily improved)

Your implementation was fine, and particularly, its XPath/Query was very 
impressive. I wasn't attacking you. My point was just the hybrid approach built 
on Expat (which is a non-validating parser, BTW) should be avoided, in case 
anyone is considering it, based on my experience with minidom vs lxml.etree in 
Python and with St/X's v2 parser. A parser based on LibXLM2, Xerces, or 
something else for SAX, DOM, XPath, etc is probably a better way of creating an 
alternative to pure-Smalltalk parsers.

> Jan
>   
> 
> > But a non-hybrid parser with everything (including the DOM) done in C
> > should definitely be faster.
> > 
> > > Sent: Wednesday, July 13, 2016 at 10:27 AM
> > > From: stepharo <[email protected]>
> > > To: "Pharo Development List" <[email protected]>
> > > Subject: [Pharo-dev] tricks for XML parsing.
> > > 
> > > Hi guys
> > > 
> > > these free books may be interesting for you
> > > 
> > >      http://aosabook.org/
> > > 
> > > http://aosabook.org/en/posa/parsing-xml-at-the-speed-of-light.html
> > > 
> > > 
> > > stef
> > > 
> > > 
> > > 
> > 
> 
>

Re: [Pharo-dev] tricks for XML parsing.

Reply via email to