Hi!

----

AFAIK we found a bug in the libast regex engine which manifests itself
when it should match&&capture text with '[' charcaters.

The following example (derived from Olga's previous work on a
quick&&dirty XML document scanner) shows the issue (note the "[TEXT]"
in variable "xmltext"):
-- snip --
xmltext='<h1><div> a text </div>More [TEXT].<!-- a comment
(<disabled>) --></h1>'

# parse
dummy="${xmltext//~(Ex)(?:
       (<!--.+-->)+?|   # xml comments
       (<.+>)+?|        # xml tags
       ([^[><]]+)+?     # xml text
       )/dummy}"

# debug output
printf 'dummy=%q\n' "${dummy}"
print -v .sh.match

# rebuild the original text, based on our matches
nameref nodes_all=.sh.match[0]          # contains all matches
nameref nodes_comments=.sh.match[1]     # contains only XML comment matches
nameref nodes_tags=.sh.match[2]         # contains only XML tag matches
nameref nodes_text=.sh.match[3]         # contains only XML text matches
integer i
for (( i = 0 ; i <= ${#nodes_all[@]} ; i++ )) ; do
        [[ -v nodes_comments[i] ]] && printf '%s' "${nodes_comments[i]}"
        [[ -v nodes_tags[i]     ]] && printf '%s' "${nodes_tags[i]}"
        [[ -v nodes_text[i]     ]] && printf '%s' "${nodes_text[i]}"
done
printf '\n'
-- snip --

If I run the example i get the following output. First sign of trouble
is the '[' character in the "...dummydummy[dummy..." output. It looks
like the '[' wasn't simple matched by any of the patterns:
-- snip --
$ ./arch/sol11.i386\-64/bin/ksh xmlparse.sh
dummy='dummydummydummydummydummydummydummydummydummydummydummydummydummydummydummydummy[dummydummydummydummydummydummydummydummy'
(
        (
                [0]='<h1>'
                [1]='<div>'
                [2]=' '
                [3]=a
                [4]=' '
                [5]=t
                [6]=e
                [7]=x
                [8]=t
                [9]=' '
                [10]='</div>'
                [11]=M
                [12]=o
                [13]=r
                [14]=e
                [15]=' '
                [16]=T
                [17]=E
                [18]=X
                [19]=T
                [20]=']'
                [21]=.
                [22]='<!-- a comment (<disabled>) -->'
                [23]='</h1>'
        )
        (
                [22]='<!-- a comment (<disabled>) -->'
        )
        (
                [0]='<h1>'
                [1]='<div>'
                [10]='</div>'
                [23]='</h1>'
        )
        (
                [2]=' '
                [3]=a
                [4]=' '
                [5]=t
                [6]=e
                [7]=x
                [8]=t
                [9]=' '
                [11]=M
                [12]=o
                [13]=r
                [14]=e
                [15]=' '
                [16]=T
                [17]=E
                [18]=X
                [19]=T
                [20]=']'
                [21]=.
        )
)
<h1><div> a text </div>More TEXT].<!-- a comment (<disabled>) --></h1>
-- snip --

Glenn: What do you think ? It looks like that ([^[><]]+)+? does not
generate matches for '[', right ?

----

Bye,
Roland

-- 
  __ .  . __
 (o.\ \/ /.o) roland.ma...@nrubsig.org
  \__\/\/__/  MPEG specialist, C&&JAVA&&Sun&&Unix programmer
  /O /==\ O\  TEL +49 641 3992797
 (;O/ \/ \O;)

_______________________________________________
ast-developers mailing list
ast-developers@research.att.com
https://mailman.research.att.com/mailman/listinfo/ast-developers

Reply via email to