I've been trying to use regex for fast parsing of short xml fragments.
The first attempt went quite well:
----------cutme-----------
$ cat xmlfragparse.sh
xmltext='<h1><div> a text </div></h1>'

dummy="${xmltext//~(Ex-g)(?:
        (<.+>)|
        ([^[><]]+)
        )/dummy}"
print -v .sh.match
$ ksh xmlfragparse.sh
(
        (
                [0]='<h1>'
                [1]='<div>'
                [2]=' '
                [3]=a
                [4]=' '
                [5]=t
                [6]=e
                [7]=x
                [8]=t
                [9]=' '
                [10]='</div>'
                [11]='</h1>'
        )
        (
                [0]='<h1>'
                [1]='<div>'
                [10]='</div>'
                [11]='</h1>'
        )
        (
                [2]=' '
                [3]=a
                [4]=' '
                [5]=t
                [6]=e
                [7]=x
                [8]=t
                [9]=' '
        )
)
----------cutme-----------

This is all OK.


However if I try to add support for XML comments the hell breaks loose:

----------cutme-----------
$ cat xmlfragparsecomment.sh

xmltext='<h1><div> a text </div><!-- a comment (<disabled>) --></h1>'

dummy="${xmltext//~(Ex-g)(?:
        (<[^[!]].+>)| # xml tags
        ([^[><]]+)|   # xml text
        (<!--.+-->)   # xml comments
        )/dummy}"
print -v .sh.match
$ ksh xmlfragparsecomment.sh
(
        (
                [0]=h
                [1]=1
                [2]=d
                [3]=i
                [4]=v
                [5]=' '
                [6]=a
                [7]=' '
                [8]=t
                [9]=e
                [10]=x
                [11]=t
                [12]=' '
                [13]=/
                [14]=d
                [15]=i
                [16]=v
                [17]='<!-- a comment (<disabled>) -->'
                [18]=/
                [19]=h
                [20]=1
        )
        (
                [0]=
        )
        (
                [0]=h
                [1]=1
                [2]=d
                [3]=i
                [4]=v
                [5]=' '
                [6]=a
                [7]=' '
                [8]=t
                [9]=e
                [10]=x
                [11]=t
                [12]=' '
                [13]=/
                [14]=d
                [15]=i
                [16]=v
                [18]=/
                [19]=h
                [20]=1
        )
        (
                [17]='<!-- a comment (<disabled>) -->'
        )
)
----------cutme-----------

This is all wrong. I changed (<.+>) to match tags in the original to
        (<[^[!]].+>) to prevent the subpattern to match xml comments, i.e.
<!-- a comment --> and added a separate subpattern for these comments,
i.e. (<!--.+-->) but it does not work. Tags are no more matched, look
at [0]= in the output.

Olga
-- 
      ,   _                                    _   ,
     { \/`o;====-    Olga Kryzhanovska   -====;o`\/ }
.----'-/`-/     [email protected]   \-`\-'----.
 `'-..-| /       http://twitter.com/fleyta     \ |-..-'`
      /\/\     Solaris/BSD//C/C++ programmer   /\/\
      `--`                                      `--`

_______________________________________________
ast-users mailing list
[email protected]
https://mailman.research.att.com/mailman/listinfo/ast-users

Reply via email to