I have the need to parse a Word doc that was saved as HTML. Word creates a bunch of junk when saving as HTML and I have been successful in getting most of it out through string replacements.

There is one remaining item that seems to have me baffled. It is in the situation of nested span tags. Word seems to create a lot of nested empty span tags. This seems easy enough with regex, however soon proves to be difficult.

For example, I have the following piece of HTML from the document. <span style='font-size:11.0pt;mso-bidi-font-size: 9.0pt;font-weight:normal'>sentence number one.<span style=\"mso-spacerun: yes\"> </span>Sentence number 2.<span style=\"mso-spacerun: yes\"> </span>Sentence number 3.<span style=\"mso-spacerun: yes\"> </span>Sentence Number 4.<span style=\"mso-spacerun: yes\"> </span></span>

Notice that there are opening and closing span tags nested with blank spaces between the open and close tag. I want these removed so that my final version would look like this.

<span style='font-size:11.0pt;mso-bidi-font-size: 9.0pt;font-weight:normal'>sentence number one. Sentence number 2. Sentence number 3. Sentence Number 4.</span>

Problem is, when I search for a pattern like this: <span.*?>\\s*?</span>
it matches from the first opening span tag through the first closing span tag. So, it matches this:
<span style='font-size:11.0pt;mso-bidi-font-size: 9.0pt;font-weight:normal'>sentence number one.<span style=\"mso-spacerun: yes\"> </span>


And if you think about it, that makes sense because of the fact that I am saying match any number of characters after the word span in the opening tag and do this until you find a greater than sign followed by some spaces followed by a closing span. So, it is doing what I am asking it to do, but not really what I want it to do.

It seems that this would be a simple thing to do, so I am assuming that I am just not familiar enough with the regex commands. Any help on this matter would be hugely appreciated. Here are some of the other patterns I have tried. Some of them might be far-fetched, but I was just trying anything out of desperation.

RE r = new RE("(<span(.*>\\s*?)</span>)");
RE r = new RE("(<span(.*?)>\\s*</span>)");
RE r = new RE("(<span(.*?)(>\\s*?<)/span>)");
RE r = new RE("(<span.+?>\\s+?</span>)");
RE r = new RE("(<span[^<span]*>\\s*</span>)");
RE r = new RE("<span[^>]*(>\\s*<)/span>");
RE r = new RE("<span.*?>(.*?)</span>"); //no good. pulls out the double span if nested.
RE r = new RE("<span.*?>([^abc]*?)</span>");
RE r = new RE("<span.*?[^<span.*>](></span>)"); //try to force it to not have a nested span
RE r = new RE(">\\s*</span>"); //ok, let's just look for the empty span stuff without opening span


Thanks,

Cheryl!







Reply via email to