SUMMARY:  I have a set of parse rules that work, but I'm wondering if
I've missed something braindead obvious.

If you aren't interested in parsing, or can't stand to look at regular
expressions, please don't read any further!

BACKGROUND:  I'm converting some web-based content to XML.  The pages
were created (by somebody else) using an outliner application that has
a "save to HTML" feature.  The HTML is mildly bloated but consistent.
The generated code renders the hierarchy with <ul>, <li>, and </ul>
tags (not an </li> in sight, grumble, gripe, frown!), but parsing out
the structure was mostly a yawn.  The item content was another story
entirely!

Pretending (for the sake of simplicity) that the data is an org chart,
each data item is of the format

    title ":" name "[" address "]" "(" email ")"

where the fields (title, name, address, and email) are human-typed,
with arbitrary content and spacing.  Just to make it interesting,
though, the address and email fields are BOTH optional (and if one
of these is omitted, its punctuation is also).

PROBLEM:  Given a string in the above format, I need to parse out the
four fields (using empty strings as the value of missing fields).

COMMENT:  As a long-term text hacker (Perl for several years, AWK
before that), I thought that I'd dash it right out.  After all, in
Perl it's a single statement with a touch of verification:

  ($title, $name, $address, $email) =
    ($item =~
      m{^([^:]*):\s+([^\[\(]*\S)\s*(?:\[(.*)\])?\s*(?:\((.*)\))?\s*$});
  $address = "" unless defined $address;
  $email   = "" unless defined $email  ;

For those who aren't Perlmongers, that "line noise with a mission" in
the third line is ugly in appearance, but simple in concept.  The =~
binds a variable on the left to the RE operation on the right (simple
pattern matching, in this case).  Much of the ugliness is due to the
need to escape characters that both occur in the data and have meaning
to the RE grammar.  The RE enclosed in m{...} specifies the following

  ^             beginning of the target string
    (           beginning of result group 1, which contains ...
      [^:]*     ... any non-colon characters, (zero or more) ...
    )           ... and that's all for result group 1
    :           a literal colon
    \s+         some (one or more) whitespace characters
    (           beginning of result group 2, containing ...
      [^\[\(]*  ... any characters NOT [ or (, (zero or more) ...
      \S        ... ending in a non-whitespace character ...
    )           ... that's all for result group 2
    \s*         any (zero or more) whitespace
    (?:         a sequence of ...
      \[        ... a literal left-bracket ...
      (.*)      ... a group of any characters (result 3) ...
      \]        ... a literal right-bracket ...
    )?          ... and this whole sequence is optional
    \s*         any (zero or more) whitespace
    (?:         a sequence of ...
      \(        ... a literal left-paren ...
      (.*)      ... a group of any characters (result 4) ...
      \)        ... a literal right-paren ...
    )?          ... and this sequence is optional
    \s*         any (zero or more) trailing whitespace
  $             end of the target string

Well...  When I tried to write REBOL parse rules for this task, I
kept tripping over my shoelaces.  Nothing major, but lots of little
annoyances that kept telling me, "You have not yet grasped the
pebble, Grasshopper!".  The last such annoyance to be bludgeoned
into submission had to do with getting rid of trailing whitespace
in the (otherwise correctly parsed) data fields.

After much meditation on the mantram "BNF, not RE!", I ended up with
the following script, which embeds the equivalent (?) parsing in a
loop that runs through a few test cases.

;==========(begin script)==========
REBOL []

testdata: [
  "Manager: John Doe [123 Jones St] ([EMAIL PROTECTED])"
  "Employee: Jane Doe ([EMAIL PROTECTED])"
  "Trainee: Bobby Shaftoe [555 Silver Buckle Ln] (no email)"
  "Contractor: Mary Lamb [1111 Shepherd's Cove, Apt 7]"
  "Executive: Mary Mary Q. Contrary"
]

noncolon: complement charset ":"
nongroup: complement charset "[()]"
nongrpbl: complement charset "[()] "

blank:    charset " ^-"
blanks:  [any blank]
wort:    [some nongrpbl]
worter:  [blanks  term]
satz:    [some worter]

foreach item testdata [
  print item
  set [title name address email] ["" "" "" ""]
  parse/all item [
    copy title to ":" skip
    blanks
    copy name satz
    [blanks "[" copy address to "]" skip | none]
    [blanks "(" copy email   to ")" skip | none]
    blanks
    end
  ]
  print join copy "^-" reduce [
    "{" title "} {" name "} {" address "} {" email "}"
  ]

]
;==========(end of script)==========

It troubles me that I wrote so much code to do such a simple piece of
string munching!  Have I missed an obvious simplification (or simpler
approach entirely), or do I just have to get over the fact that I'm
not in Kansas any more?

Any feedback, suggestions, advice, etc. will be appreciated.

-jn-

Reply via email to