Re: Fast XML parser?

Octavian Rasnita Sun, 28 Oct 2012 08:55:01 -0700

From: "Shlomi Fish" <shlo...@shlomifish.org>


Hi Octavian,

On Thu, 25 Oct 2012 14:33:15 +0300
"Octavian Rasnita" <orasn...@gmail.com> wrote:

Hi,

Can you recommend an XML parser which is faster than XML::Twig?
I need to use an XML parser that can parse the XML files chunk by chunkandwhich works faster (much faster) than XML::Twig, because I tried usingthis
module but it is very slow.

XML::LibXML contains several event-based parsers including the SAX parserand

the pull-parser. Can you try using them?

Regards,

Shlomi Fish


Hi Shlomi,

I tried to use XML::LibXML::Reader which uses the pool parser, and I readthat:


""
However, it is also possible to mix Reader with DOM. At every point the
user may copy the current node (optionally expanded into a complete
sub-tree) from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM tree
""

So I tried:

use XML::LibXML::Reader;

my $xml = 'path/to/xml/file.xml';

my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannotread $xml";


while ( $reader->nextElement( 'Lexem' ) ) {
   my $id = $reader->getAttribute( 'id' ); #works fine

   my $doc = $reader->document;

my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't workwellmy @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't workfine

So I could get that attribute well, but I couldn't get the rest of thesub-elements because for example when I printed the var $timestamp,sometimes it printed its value twice or 3 times together.I couldn't find an example of using XML::LibXML for reading the xml fileelement by element, than read each element's elements directly.


The XML I want to parse looks like the one below. It is just much bigger.

I want to read one by one each <Lexem> element (and I've done thissuccessfully), then read its id attribute (also done this well), but I alsowant to read its sub elements, using something like:


$reader->read_some_element('Form')
or
$reader->{Form}

which should read just the element <Form> right below the <Lexem> element,but don't read the <Form> elements below the <InflectedForm>.

and then read the elements under the <InflectedForm> element using somethinglike:


$reader->read_another_element( '/InflectedForm/Form' )
or like
$reader->{InflectedForm}{Form}

or using the $doc object...

I tried to use a lot of methods for reading the elements of the current<Lexem> element, but with no good results.



<?xml version="1.0" encoding="UTF-8"?>
<Lexems>
 <Lexem id="1">
   <Timestamp>1346826989</Timestamp>
   <Form>aa</Form>
             <InflectedForm>
       <InflectionId>84</InflectionId>
       <Form>aa</Form>
     </InflectedForm>
     </Lexem>
 <Lexem id="2">
   <Timestamp>1346826989</Timestamp>
   <Form>aaa</Form>
             <InflectedForm>
       <InflectionId>84</InflectionId>
       <Form>aaa</Form>
     </InflectedForm>
     </Lexem>
 <Lexem id="3">
   <Timestamp>1346826989</Timestamp>
   <Form>aaleni&#039;an</Form>
             <InflectedForm>
       <InflectionId>25</InflectionId>
       <Form>aaleni&#039;an</Form>
     </InflectedForm>
         <InflectedForm>
       <InflectionId>26</InflectionId>
       <Form>aaleni&#039;an</Form>
     </InflectedForm>
     </Lexem>
</Lexems>


Thanks.

Octavian.





--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/

Re: Fast XML parser?

Reply via email to