Re: Regexps

Nadav Har'El Wed, 11 Feb 2004 03:55:21 -0800

This thread is becoming hackers-il material...

On Wed, Feb 11, 2004, Arik Baratz wrote about "RE: Regexps":
> I would like to suggest that there are two forms of multiple-field 
> variable-field-length 
> flat-file formats.
> 
> One form uses 'enclosure' for fields that need seperation, like CSV. In this method,
>...
> The first form cannot be parsed using a regular expression IMHO. I bet I can


I showed in a previous email how CSV *can* be parsed by a regular expression,
albeit one that uses Perl extended syntax "lookahead".

I agree with you that it's impossible to with the "classic" regular-expression
syntax (which basically has only "*", "|"). But it *is* possible to do it
with a deterministic finite automaton (state machine), and even easier than
writing a regular expression: all the "state" you need to remember is the
parity of the number of double-quotes you have seen so far. When you saw an
even number of quotes and come across a comma, this is a field split.
Otherwise, the comma is part of the field. It's as easy as that.

Here's the automaton to find a comma marking an end of the field:
(the default action here on an unknown character, not drawn, is to loop into
the same state)

            
    +------+        +-----+
    |      | (")--> |     |
    | EVEN |        | ODD |
    |      | <--(") |     |
    +------+        +-----+
      (,)
       |
       v      
  END OF FIELD

if you also want to replace two consecutive quotes by one, this slightly
complicates the automaton and adds a few more states. Here's an automaton
that *outputs* the next field (with "out(..)" actions on some edges):

                             +-----+
                             | ODD |    out(c)
                   --------->| TMP |(c)---\
                  /          +-----+      |
                (")           (")         v
             +------+  out(")  |       +-----+
         +(c)|      |<--------/  out(")|     |(c)+
  out(c) |   | EVEN |            ----->| ODD |   | out(c)
         \-->|      |           /      |     |<--/
             +------+          (")     +-----+
               (,) ^         +-----+     (")
                |  \------(c)|EVEN |      /
                |    out(c)  | TMP |<-----
                v            +-----+
           END OF FIELD

(at least, I think. I'm just inventing this as I go along :) it's much easier
to program in a normal programming language...)


> (like keeping the state with an outside variable). The problem IMHO is that
> you must use a lookahead greater than 1, that's why the perl lookahead
> extension works.

Perl's lookahead for counting the number of quotes *after* the current comma,
probably generates some sort of non-deterministic finite automaton code. As
I explained above, unless I'm mistaken, it's even simpler to remember the
parity of the quotes *before* this comma (one bit of state) and write a
deterministic finite automaton directly.

The reason I gave a regular expression is because this is what the original
poster asked for. I also think the trick I came up with (looking for an
even number of quotes) is cool ;)


-- 
Nadav Har'El                        |   Wednesday, Feb 11 2004, 19 Shevat 5764
[EMAIL PROTECTED]             |-----------------------------------------
Phone: +972-53-790466, ICQ 13349191 |Share your knowledge. It's a way to
http://nadav.harel.org.il           |achieve immortality.

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Re: Regexps

Reply via email to