From: "Shlomi Fish" <shlo...@shlomifish.org>
Hi Octavian,
On Thu, 25 Oct 2012 14:33:15 +0300
"Octavian Rasnita" <orasn...@gmail.com> wrote:
Hi,
Can you recommend an XML parser which is faster than XML::Twig?
I need to use an XML parser that can parse the XML files chunk by chunk
and
which works faster (much faster) than XML::Twig, because I tried using
this
module but it is very slow.
XML::LibXML contains several event-based parsers including the SAX parser
and
the pull-parser. Can you try using them?
Regards,
Shlomi Fish
Hi Shlomi,
I tried to use XML::LibXML::Reader which uses the pool parser, and I read
that:
""
However, it is also possible to mix Reader with DOM. At every point the
user may copy the current node (optionally expanded into a complete
sub-tree) from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM tree
""
So I tried:
use XML::LibXML::Reader;
my $xml = 'path/to/xml/file.xml';
my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
read $xml";
while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine
my $doc = $reader->document;
my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work
well
my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work
fine
}
So I could get that attribute well, but I couldn't get the rest of the
sub-elements because for example when I printed the var $timestamp,
sometimes it printed its value twice or 3 times together.
I couldn't find an example of using XML::LibXML for reading the xml file
element by element, than read each element's elements directly.
The XML I want to parse looks like the one below. It is just much bigger.
I want to read one by one each <Lexem> element (and I've done this
successfully), then read its id attribute (also done this well), but I also
want to read its sub elements, using something like:
$reader->read_some_element('Form')
or
$reader->{Form}
which should read just the element <Form> right below the <Lexem> element,
but don't read the <Form> elements below the <InflectedForm>.
and then read the elements under the <InflectedForm> element using something
like:
$reader->read_another_element( '/InflectedForm/Form' )
or like
$reader->{InflectedForm}{Form}
or using the $doc object...
I tried to use a lot of methods for reading the elements of the current
<Lexem> element, but with no good results.
<?xml version="1.0" encoding="UTF-8"?>
<Lexems>
<Lexem id="1">
<Timestamp>1346826989</Timestamp>
<Form>aa</Form>
<InflectedForm>
<InflectionId>84</InflectionId>
<Form>aa</Form>
</InflectedForm>
</Lexem>
<Lexem id="2">
<Timestamp>1346826989</Timestamp>
<Form>aaa</Form>
<InflectedForm>
<InflectionId>84</InflectionId>
<Form>aaa</Form>
</InflectedForm>
</Lexem>
<Lexem id="3">
<Timestamp>1346826989</Timestamp>
<Form>aaleni'an</Form>
<InflectedForm>
<InflectionId>25</InflectionId>
<Form>aaleni'an</Form>
</InflectedForm>
<InflectedForm>
<InflectionId>26</InflectionId>
<Form>aaleni'an</Form>
</InflectedForm>
</Lexem>
</Lexems>
Thanks.
Octavian.
--
To unsubscribe, e-mail: beginners-unsubscr...@perl.org
For additional commands, e-mail: beginners-h...@perl.org
http://learn.perl.org/