Sean O'Rourke wrote:

Aamer Akhter <[EMAIL PROTECTED]> writes:

We're starting a project that includes a parser that would convert a
huge swath of multi-device output (output of explicit commands) to an
XML tokenized form.
...
I've written parse::recdescent grammars for some of this output in
the past but speed was the biggest issue. Has anybody done speed
comparisons between Perl6::Rules and prd?


I haven't done a benchmark -- I'd expect P6::R to be faster, I don't
know by how much.  Still, if you're parsing huge amounts of output:

* Is the data's grammar context-free or regular?  If the latter, you
  might want to consider building a regular parser (e.g. Perl's regex
  engine).

I'm not too steeped in parser-lingo (unfortunately), but I beleive that the output is context-free. Currently I'm working along the lines of a regex-heavy prd idea:


text:
Cisco Internetwork Operating System Software
IOS (tm) 7200 Software (C7200-JS-M), Version 12.2(18)S, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)
TAC Support: http://www.cisco.com/tac
Copyright (c) 1986-2003 by cisco Systems, Inc.
Compiled Thu 21-Aug-03 03:05 by kellythw
Image text-base: 0x60008C40, data-base: 0x61A82000


ROM: System Bootstrap, Version 12.2(4r)B2, RELEASE SOFTWARE (fc2)
BOOTLDR: 7200 Software (C7200-KBOOT-M), Version 12.1(8a)E, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)


dut7200 uptime is 7 weeks, 5 days, 16 hours, 5 minutes
System returned to ROM by reload at 23:10:44 UTC Thu Mar 11 2004
System image file is "disk0:c7200-js-mz.122-18.S"

cisco 7206VXR (NPE400) processor (revision A) with 229376K/32768K bytes of memory.
Processor board ID 29553442
R7000 CPU at 350Mhz, Implementation 39, Rev 3.3, 256KB L2, 4096KB L3 Cache
6 slot VXR midplane, Version 2.7

Last reset from power-on
Bridging software.
X.25 software, Version 3.0.0.
SuperLAT software (copyright 1990 by Meridian Technology Corp).
TN3270 Emulation software.
Primary Rate ISDN software, Version 1.1.

PCI bus mb0_mb1 has 800 bandwidth points
PCI bus mb2 has 200 bandwidth points
WARNING: PCI bus mb0_mb1 Exceeds 600 bandwidth points

1 Ethernet/IEEE 802.3 interface(s)
3 FastEthernet/IEEE 802.3 interface(s)
1 Gigabit Ethernet/IEEE 802.3 interface(s)
4 Channelized T1/PRI port(s)
125K bytes of non-volatile configuration memory.

47040K bytes of ATA PCMCIA card at slot 0 (Sector size 512 bytes).
8192K bytes of Flash internal SIMM (Sector size 256K).
Configuration register is 0x2102



grammar:

<autotree>

test: IOS compiled instance hardware configRegister

IOS: /[\s\S]*?
    \s+(\S+)\s
    Software\s+\((\S+)\),\s
    Version\s(\S+)
    /x
    {$return = {
        software=>$2,
        family=>$1,
        version=>$3,
    };}


compiled: /[\s\S]*?Compiled ([\s\S]+?) by/ {$return=$1} instance: uptime image image: /[\s\S]*?image file is \"([\s\S]+?)\"\n/ {$return=$1} uptime: /[\s\S]*?uptime is ([\s\S]+?)\n/ {$return=$1} configRegister: /[\s\S]*?Configuration register is (\S+)/ {$return= $1} hardware: platform memory platform: /[\s\S]*?(cisco|CISCO) (\S+) \(/ {$return=$2} memory: /[\s\S]*?(\d+)K\/(\d+)K bytes of memory\./ {$return={main=>$1,io=>$2}}

=========
output:
<parse>
  <__RULE__>test</__RULE__>
  <configRegister>0x2102</configRegister>
  <hardware>
    <memory>
      <io>32768</io>
      <main>229376</main>
    </memory>
    <__RULE__>hardware</__RULE__>
    <platform>7206VXR</platform>
  </hardware>
  <compiled>Thu 21-Aug-03 03:05</compiled>
  <instance>
    <__RULE__>instance</__RULE__>
    <uptime>7 weeks, 5 days, 16 hours, 5 minutes</uptime>
    <image>disk0:c7200-js-mz.122-18.S</image>
  </instance>
  <IOS>
    <software>C7200-JS-M</software>
    <version>12.2(18)S,</version>
    <family>7200</family>
  </IOS>
</parse>

the benefits are that i can give the output of the 'show version' command to any of the tokens under test: and still get my data. but I can also give it at the toplevel test: token.

But this is just one example, other commads have tabular output, while still others have nested repeating output.

* Are the context-free parts fairly small?  If so, then segmenting the
  input into these sections before parsing might be a good idea.

i think we're moving in that state of direction already. for example the device can accept a command 'show version | include IOS', which gives you the single line:


IOS (tm) 7200 Software (C7200-JS-M), Version 12.2(18)S, EARLY DEPLOYMENT RELEASE

which you could pass directly to the IOS: token.


* How familiar are you with Lex and Yacc? These are old-school tools that are very stable and efficient.

In any case, I'd be wary of P6::R since, as the docs imply, it
tickles bugs in the perl regex engine if you try to do anything too
complicated.

yeah, i think damian beat that into me. ;-)



/s

Reply via email to