Aamer Akhter <[EMAIL PROTECTED]> writes:
We're starting a project that includes a parser that would convert a huge swath of multi-device output (output of explicit commands) to an XML tokenized form. ... I've written parse::recdescent grammars for some of this output in the past but speed was the biggest issue. Has anybody done speed comparisons between Perl6::Rules and prd?
I haven't done a benchmark -- I'd expect P6::R to be faster, I don't know by how much. Still, if you're parsing huge amounts of output:
* Is the data's grammar context-free or regular? If the latter, you might want to consider building a regular parser (e.g. Perl's regex engine).
I'm not too steeped in parser-lingo (unfortunately), but I beleive that the output is context-free. Currently I'm working along the lines of a regex-heavy prd idea:
text:
Cisco Internetwork Operating System Software
IOS (tm) 7200 Software (C7200-JS-M), Version 12.2(18)S, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)
TAC Support: http://www.cisco.com/tac
Copyright (c) 1986-2003 by cisco Systems, Inc.
Compiled Thu 21-Aug-03 03:05 by kellythw
Image text-base: 0x60008C40, data-base: 0x61A82000
ROM: System Bootstrap, Version 12.2(4r)B2, RELEASE SOFTWARE (fc2)
BOOTLDR: 7200 Software (C7200-KBOOT-M), Version 12.1(8a)E, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)
dut7200 uptime is 7 weeks, 5 days, 16 hours, 5 minutes System returned to ROM by reload at 23:10:44 UTC Thu Mar 11 2004 System image file is "disk0:c7200-js-mz.122-18.S"
cisco 7206VXR (NPE400) processor (revision A) with 229376K/32768K bytes of memory. Processor board ID 29553442 R7000 CPU at 350Mhz, Implementation 39, Rev 3.3, 256KB L2, 4096KB L3 Cache 6 slot VXR midplane, Version 2.7
Last reset from power-on Bridging software. X.25 software, Version 3.0.0. SuperLAT software (copyright 1990 by Meridian Technology Corp). TN3270 Emulation software. Primary Rate ISDN software, Version 1.1.
PCI bus mb0_mb1 has 800 bandwidth points PCI bus mb2 has 200 bandwidth points WARNING: PCI bus mb0_mb1 Exceeds 600 bandwidth points
1 Ethernet/IEEE 802.3 interface(s) 3 FastEthernet/IEEE 802.3 interface(s) 1 Gigabit Ethernet/IEEE 802.3 interface(s) 4 Channelized T1/PRI port(s) 125K bytes of non-volatile configuration memory.
47040K bytes of ATA PCMCIA card at slot 0 (Sector size 512 bytes). 8192K bytes of Flash internal SIMM (Sector size 256K). Configuration register is 0x2102
grammar:
<autotree>
test: IOS compiled instance hardware configRegister
IOS: /[\s\S]*? \s+(\S+)\s Software\s+\((\S+)\),\s Version\s(\S+) /x {$return = { software=>$2, family=>$1, version=>$3, };}
compiled: /[\s\S]*?Compiled ([\s\S]+?) by/ {$return=$1} instance: uptime image image: /[\s\S]*?image file is \"([\s\S]+?)\"\n/ {$return=$1} uptime: /[\s\S]*?uptime is ([\s\S]+?)\n/ {$return=$1} configRegister: /[\s\S]*?Configuration register is (\S+)/ {$return= $1} hardware: platform memory platform: /[\s\S]*?(cisco|CISCO) (\S+) \(/ {$return=$2} memory: /[\s\S]*?(\d+)K\/(\d+)K bytes of memory\./ {$return={main=>$1,io=>$2}}
========= output: <parse> <__RULE__>test</__RULE__> <configRegister>0x2102</configRegister> <hardware> <memory> <io>32768</io> <main>229376</main> </memory> <__RULE__>hardware</__RULE__> <platform>7206VXR</platform> </hardware> <compiled>Thu 21-Aug-03 03:05</compiled> <instance> <__RULE__>instance</__RULE__> <uptime>7 weeks, 5 days, 16 hours, 5 minutes</uptime> <image>disk0:c7200-js-mz.122-18.S</image> </instance> <IOS> <software>C7200-JS-M</software> <version>12.2(18)S,</version> <family>7200</family> </IOS> </parse>
the benefits are that i can give the output of the 'show version' command to any of the tokens under test: and still get my data. but I can also give it at the toplevel test: token.
But this is just one example, other commads have tabular output, while still others have nested repeating output.
* Are the context-free parts fairly small? If so, then segmenting the input into these sections before parsing might be a good idea.
i think we're moving in that state of direction already. for example the device can accept a command 'show version | include IOS', which gives you the single line:
IOS (tm) 7200 Software (C7200-JS-M), Version 12.2(18)S, EARLY DEPLOYMENT RELEASE
which you could pass directly to the IOS: token.
* How familiar are you with Lex and Yacc? These are old-school tools that are very stable and efficient.
In any case, I'd be wary of P6::R since, as the docs imply, it tickles bugs in the perl regex engine if you try to do anything too complicated.
yeah, i think damian beat that into me. ;-)
/s