This thread is becoming hackers-il material...
On Wed, Feb 11, 2004, Arik Baratz wrote about "RE: Regexps":
> I would like to suggest that there are two forms of multiple-field
> variable-field-length
> flat-file formats.
>
> One form uses 'enclosure' for fields that need seperation, like CSV. In this method,
>...
> The first form cannot be parsed using a regular expression IMHO. I bet I can
I showed in a previous email how CSV *can* be parsed by a regular expression,
albeit one that uses Perl extended syntax "lookahead".
I agree with you that it's impossible to with the "classic" regular-expression
syntax (which basically has only "*", "|"). But it *is* possible to do it
with a deterministic finite automaton (state machine), and even easier than
writing a regular expression: all the "state" you need to remember is the
parity of the number of double-quotes you have seen so far. When you saw an
even number of quotes and come across a comma, this is a field split.
Otherwise, the comma is part of the field. It's as easy as that.
Here's the automaton to find a comma marking an end of the field:
(the default action here on an unknown character, not drawn, is to loop into
the same state)
+------+ +-----+
| | (")--> | |
| EVEN | | ODD |
| | <--(") | |
+------+ +-----+
(,)
|
v
END OF FIELD
if you also want to replace two consecutive quotes by one, this slightly
complicates the automaton and adds a few more states. Here's an automaton
that *outputs* the next field (with "out(..)" actions on some edges):
+-----+
| ODD | out(c)
--------->| TMP |(c)---\
/ +-----+ |
(") (") v
+------+ out(") | +-----+
+(c)| |<--------/ out(")| |(c)+
out(c) | | EVEN | ----->| ODD | | out(c)
\-->| | / | |<--/
+------+ (") +-----+
(,) ^ +-----+ (")
| \------(c)|EVEN | /
| out(c) | TMP |<-----
v +-----+
END OF FIELD
(at least, I think. I'm just inventing this as I go along :) it's much easier
to program in a normal programming language...)
> (like keeping the state with an outside variable). The problem IMHO is that
> you must use a lookahead greater than 1, that's why the perl lookahead
> extension works.
Perl's lookahead for counting the number of quotes *after* the current comma,
probably generates some sort of non-deterministic finite automaton code. As
I explained above, unless I'm mistaken, it's even simpler to remember the
parity of the quotes *before* this comma (one bit of state) and write a
deterministic finite automaton directly.
The reason I gave a regular expression is because this is what the original
poster asked for. I also think the trick I came up with (looking for an
even number of quotes) is cool ;)
--
Nadav Har'El | Wednesday, Feb 11 2004, 19 Shevat 5764
[EMAIL PROTECTED] |-----------------------------------------
Phone: +972-53-790466, ICQ 13349191 |Share your knowledge. It's a way to
http://nadav.harel.org.il |achieve immortality.
=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]