From: "Shlomi Fish" <shlo...@shlomifish.org>

Hi Octavian,

On Thu, 25 Oct 2012 14:33:15 +0300
"Octavian Rasnita" <orasn...@gmail.com> wrote:

Hi,

Can you recommend an XML parser which is faster than XML::Twig?

I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this
module but it is very slow.

XML::LibXML contains several event-based parsers including the SAX parser and
the pull-parser. Can you try using them?

Regards,

Shlomi Fish


Hi Shlomi,

I tried to use XML::LibXML::Reader which uses the pool parser, and I read that:

""
However, it is also possible to mix Reader with DOM. At every point the
user may copy the current node (optionally expanded into a complete
sub-tree) from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM tree
""

So I tried:

use XML::LibXML::Reader;

my $xml = 'path/to/xml/file.xml';

my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot read $xml";

while ( $reader->nextElement( 'Lexem' ) ) {
   my $id = $reader->getAttribute( 'id' ); #works fine

   my $doc = $reader->document;

my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work well my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work fine

}


So I could get that attribute well, but I couldn't get the rest of the sub-elements because for example when I printed the var $timestamp, sometimes it printed its value twice or 3 times together. I couldn't find an example of using XML::LibXML for reading the xml file element by element, than read each element's elements directly.

The XML I want to parse looks like the one below. It is just much bigger.
I want to read one by one each <Lexem> element (and I've done this successfully), then read its id attribute (also done this well), but I also want to read its sub elements, using something like:

$reader->read_some_element('Form')
or
$reader->{Form}

which should read just the element <Form> right below the <Lexem> element, but don't read the <Form> elements below the <InflectedForm>.

and then read the elements under the <InflectedForm> element using something like:

$reader->read_another_element( '/InflectedForm/Form' )
or like
$reader->{InflectedForm}{Form}

or using the $doc object...

I tried to use a lot of methods for reading the elements of the current <Lexem> element, but with no good results.


<?xml version="1.0" encoding="UTF-8"?>
<Lexems>
 <Lexem id="1">
   <Timestamp>1346826989</Timestamp>
   <Form>aa</Form>
             <InflectedForm>
       <InflectionId>84</InflectionId>
       <Form>aa</Form>
     </InflectedForm>
     </Lexem>
 <Lexem id="2">
   <Timestamp>1346826989</Timestamp>
   <Form>aaa</Form>
             <InflectedForm>
       <InflectionId>84</InflectionId>
       <Form>aaa</Form>
     </InflectedForm>
     </Lexem>
 <Lexem id="3">
   <Timestamp>1346826989</Timestamp>
   <Form>aaleni&#039;an</Form>
             <InflectedForm>
       <InflectionId>25</InflectionId>
       <Form>aaleni&#039;an</Form>
     </InflectedForm>
         <InflectedForm>
       <InflectionId>26</InflectionId>
       <Form>aaleni&#039;an</Form>
     </InflectedForm>
     </Lexem>
</Lexems>


Thanks.

Octavian.





--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/


Reply via email to