"Nadav Har'El" <[EMAIL PROTECTED]> writes:

> On Tue, Feb 10, 2004, Oleg Goldshmidt wrote about "Re: Regexps":
> > In general, handling string literals with regexps is not trivial,
> > because you need to take into account escaped ", as in
> > 
> > "foo \"sna fu\" bar"
> 
> This may not be relevant for his situation. 

True, which is why I suggested a simple solution. 

To tell you the truth, I knew of perl's lookahead, but I am not much
of a perl-monger and I didn't remember the syntax, and I don't know of
any regexp engine other than perl that supports this very useful
feature.[1]

So I thought of the most straightforward (not necessarily the best)
way to pair the quotes and process portions of the input accordingly.

> What happens if there are quotes in one of the field? Each
> double-quote is replaced by two of them, keeping the evenness of the
> number of quotes (quote parity) and allowing exactly the same method
> of splitting on commas, and allowing for an easy reverse
> transformation.

Well, you are specifying an input convention that may or may not be
applicable. I am sure I don't need to give examples of usage of
backslash-escaped quotes in string literals.

The escape convention should be specified. From Tal's description, for
instance, it is not quite clear what the output from

"Nadav said, ""Hi, Oleg,"" and turned back to his code."

should be - maybe the right output is reproducing the input verbatim
(there is no unquoted whitespace)? So I put a disclaimer about all
sorts of assumptions made, and only went through paired quotes, not
checking for more general odd/even cases. Of course, with an escape
sequence that is not based on merely doubling the escaped character
the odd/even rule breaks down, and I didn't think of it at all.

Another potential regexp pitfall is that - for better and for worse -
different regexp engines behave differently, to the point of matching
different things given the same regexp. Therefore, it may be unsafe to
ask for a regexp without specifying the type of engine (or a specific
tool, such as perl or awk). Find some issues with the regexp in the code
below[2].

[1] A really useful feature would be "lookbehind", i.e. "match
    anything but a double quote unless *preceded* by an odd number of
    consecutive backslashes." Not even perl supports that.

[2] Basing a parser on matching quoted strings as a whole will make it
    a bit difficult reporting unmatched quotes. The code below does a
    pretty good job on backslash-escaped quotes, but no warranty is
    implied ;-)

#!/bin/gawk -f

function tail(str,len) { return substr(str,len+1,length(str)-len+1); }

function trprint(str) { gsub(/[ \t]+/,"\n",str); printf("%s",str); }

{
    str = $0;
    pos = 0;
    while (q = match(str,/"([^"\\]|\\.)*"/,quoted)) {
        # process as appropriate
        trprint(substr(str,1,q-1));
        printf("%s\n",quoted[0]);
        # track progress for error reporting below
        len = q+length(quoted[0]);
        pos += len;
        # move on
        str = tail(str,len);
    }
    # at this point we have no quoted strings left
    if (q = match(str,/".*$/)) {
        printf("%s:%d: unmatched quote at position %d\n",
               FILENAME,NR,pos+q) > "/dev/stderr";
        exit(1);
    }    
    # process what remains
    trprint(str);
    printf("\n");
}

-- 
Oleg Goldshmidt | [EMAIL PROTECTED]

=================================================================
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to