I am trying to write a parser to handle human-generated "info
        files" that accompany the type of legal live concert
        recordings you can find at http://bt.etree.org (see for
        example http://bklyn.org/~cae/info-files/mmw2002-04-20.txt and
        numerous other examples in http://bklyn.org/~cae/info-files/)

        These generally follow a common structure, but since they are
        typed up by hand there can be a lot of variation. The overall
        structure is usually something along the lines of band name,
        date, venue, source and transfer information, and then
        setlist/tracking info.

        Because of the irregular structure, I am finding writing a
        pure token-based parser is pretty tricky.  I have a
        halfway-decent line-oriented parser that I've implemented
        mostly as a bunch of "if" statements which test against some
        state variables and regular expressions which match certain
        tell-tale strings (for example different brands of
        microphones, DAT decks, concert hall names, state
        abbreviations, etc).  For some masochistic reason though, I've
        decided that I need to reimplement this using a proper grammar
        and Parse::RecDescent seems like a good fit.  But maybe not.

        As I said, I'm having difficulty with the token-based nature
        of P::RD.  In some cases I want things split up word-wise,
        but in others I'd prefer to look for strings anywhere within a
        line (e.g. microphone names like "Schoeps" are a pretty good
        indicator that I'm dealing with source info and that is pretty
        much guaranteed to span an entire line).

        Here's my line-based parser:

                http://bklyn.org/~cae/InfoFile.pm

        Here's the skeletal Parse::RecDescent parser I'm trying to use
        to do the same thing:

                http://bklyn.org/~cae/parser

        I've tried my hand at using the <skip> directive with a little
        luck (see the "artist" rule which seems to work well), and
        also some spectacular failures: if I try to use it in the
        source or sourceinfo rules, things end up not matching.

        I'm also having difficulty with some of my rules being to
        greedy and am not sure how to stop them.  For example, the
        "source" rule as written often ends up gobbling the tokens
        like "Disc 1" which I'm hoping to match with the "disc" rule
        or "Set I" which I try to match with the "set" rule.  I've
        tried using ...!rule a bit, but again with little luck.  I'd
        like to have some way to tell the parser that a newline should
        (usually) signal the end of a rule.

        If anyone has any advice, I'd greatly appreciate it.  It may
        be the case that the data set I'm working with is just NOT
        suited to this type of parsing, but I don't think I know
        enough about the solution domain to reach this decision
        myself.

-- 
Caleb Epstein |  bklyn . org  |
    cae at    | Brooklyn Dust |    Th' MIND is the Pizza Palace of th' SOUL
bklyn dot org |   Bunny Mfg.  |

Reply via email to