this is malformed
        [^[><]]
its taken as 2 REs
        [^[><]
and the literal char
        ]
has to be specified like this
        [^]><[]

On Thu, 21 Jun 2012 02:07:20 +0200 Roland Mainz wrote:
> AFAIK we found a bug in the libast regex engine which manifests itself
> when it should match&&capture text with '[' charcaters.

> The following example (derived from Olga's previous work on a
> quick&&dirty XML document scanner) shows the issue (note the "[TEXT]"
> in variable "xmltext"):
> -- snip --
> xmltext='<h1><div> a text </div>More [TEXT].<!-- a comment
> (<disabled>) --></h1>'

> # parse
> dummy="${xmltext//~(Ex)(?:
>        (<!--.+-->)+?| # xml comments
>        (<.+>)+?|      # xml tags
>        ([^[><]]+)+?   # xml text
>        )/dummy}"

> # debug output
> printf 'dummy=%q\n' "${dummy}"
> print -v .sh.match

> # rebuild the original text, based on our matches
> nameref nodes_all=.sh.match[0]                # contains all matches
> nameref nodes_comments=.sh.match[1]   # contains only XML comment matches
> nameref nodes_tags=.sh.match[2]               # contains only XML tag matches
> nameref nodes_text=.sh.match[3]               # contains only XML text matches
> integer i
> for (( i = 0 ; i <= ${#nodes_all[@]} ; i++ )) ; do
>       [[ -v nodes_comments[i] ]] && printf '%s' "${nodes_comments[i]}"
>       [[ -v nodes_tags[i]     ]] && printf '%s' "${nodes_tags[i]}"
>       [[ -v nodes_text[i]     ]] && printf '%s' "${nodes_text[i]}"
> done
> printf '\n'
> -- snip --

> If I run the example i get the following output. First sign of trouble
> is the '[' character in the "...dummydummy[dummy..." output. It looks
> like the '[' wasn't simple matched by any of the patterns:
> -- snip --
> $ ./arch/sol11.i386\-64/bin/ksh xmlparse.sh
> dummy='dummydummydummydummydummydummydummydummydummydummydummydummydummydummydummydummy[dummydummydummydummydummydummydummydummy'
> (
>         (
>                 [0]='<h1>'
>                 [1]='<div>'
>                 [2]=' '
>                 [3]=a
>                 [4]=' '
>                 [5]=t
>                 [6]=e
>                 [7]=x
>                 [8]=t
>                 [9]=' '
>                 [10]='</div>'
>                 [11]=M
>                 [12]=o
>                 [13]=r
>                 [14]=e
>                 [15]=' '
>                 [16]=T
>                 [17]=E
>                 [18]=X
>                 [19]=T
>                 [20]=']'
>                 [21]=.
>                 [22]='<!-- a comment (<disabled>) -->'
>                 [23]='</h1>'
>         )
>         (
>                 [22]='<!-- a comment (<disabled>) -->'
>         )
>         (
>                 [0]='<h1>'
>                 [1]='<div>'
>                 [10]='</div>'
>                 [23]='</h1>'
>         )
>         (
>                 [2]=' '
>                 [3]=a
>                 [4]=' '
>                 [5]=t
>                 [6]=e
>                 [7]=x
>                 [8]=t
>                 [9]=' '
>                 [11]=M
>                 [12]=o
>                 [13]=r
>                 [14]=e
>                 [15]=' '
>                 [16]=T
>                 [17]=E
>                 [18]=X
>                 [19]=T
>                 [20]=']'
>                 [21]=.
>         )
> )
> <h1><div> a text </div>More TEXT].<!-- a comment (<disabled>) --></h1>
> -- snip --

> Glenn: What do you think ? It looks like that ([^[><]]+)+? does not
> generate matches for '[', right ?

> ----

> Bye,
> Roland

> -- 
>   __ .  . __
>  (o.\ \/ /.o) roland.ma...@nrubsig.org
>   \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
>   /O /==\ O\  TEL +49 641 3992797
>  (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers

Reply via email to