Okay gang, it's time we boosted our programming power here using the powerful syntactic language of Regular Expressions.

For those of you who don't know what a Regular _expression_ is, it is basically a small language of normal alphabetic characters fed into an interpreter which then applies the instructions (regular _expression_) to a string of data for reasons of pattern-matching, complex find and replace statements etc. For a real comprehensive overview see the following link

http://en.wikipedia.org/wiki/Regular_expression

I can't highly recommend the wikipedia enough so bookmark it!  

A Regular _expression_ is a powerful tool that you can use to:

  • Validate a specialised format base upon abritrary rules.
  • Extract specific portions of text from data.
  • Replace all the occurrences of a word or phrase based upon a complex set of determinants.
  • Clean up superfluous characters around data.
  • Split complex data into it's many parts.

So what does a regular _expression_ look like? Well it can look like the following:

"^(\d{1,3}\.){3}\d{1,3}$"

Scary looking eh? Not really . The above regular _expression_ search pattern example validates an IP address where the IP address begins at the very left of the variable with no spaces or characters before it, where each number in the IP address must have a minimum of one digit but not more than three and where the patten is repeated four times (sort of ).

Basically what we have is a language understood by the Regular _expression_. How do we use a regular _expression_ string in VBA or _vbscript_. First we need to create the _vbscript_ interpreter. All you do is the following:

Dim Regex as Object
Set Regex = CreateObject("_vbscript_.RegExp")

The Regex object we have created has a number of methods and properties...

PROPERTIES

  • Pattern
  • IgnoreCase
  • Global
METHODS
  • Test
  • Replace
  • Execute

Some other languages provide more properties such as Greedy, BeginOfLine etc...

WHAT THE PROPERTIES DO?

Pattern: This is the regular _expression_ string like the IP address one we saw earlier "^(\d{1,3}\.){3}\d{1,3}". You must have one of these.

IgnoreCase: This instructs the Regex object whether the pattern we have given it is to be treated as case-sensitive. For example suppose we were looking for the word "Frog" as a pattern; if IgnoreCase was set to False then the Regexp object would ignore all instances of the word frog, fROG or FROG and only match Frog. If IgnoreCase was set to true then the Regexp object would match all instances of the word Frog regardless of upper or lower case letters.

Global: This instructs the Regex object whether to match the pattern to the first found instance of the passed in pattern or match all instances within a string of characters. For example if we had a pattern "Frog" and a string "The Frog sat on a log. What a silly Frog" and we wanted to replace "Frog" with "Parrot". When we run the replace method upon the sentence we would either replace only the first instance leaving us with "The Parrot sat on a log. What a silly Frog" or if Global is set to True it would replace all instances leaving us with "The Parrot sat on a log. What a silly Parrot". This is a real simple example and ordinarily we would use the Replace function for this, but regular expressions perform more complex functions than the Replace function can.

WHAT THE METHODS DO?

Test: This method will test a passed in string against the regular _expression_ pattern for a match and return True if a match can be found or False if it cannot. As a simple example say we want to know if a sentence has the word Frog at the end of it. Using the $ sign which tells the Regexp object to look for any text preceding the dollar sign starting at the end of a passed sentence: Regexp.Pattern = "Frog$" Print Regexp.Test ("Here lies a Frog") The result is True. Had we put a space, a number or another letter at the end of the word Frog in the sentence then the answer would be False.

Replace: Replaces any instance of the found pattern with another value. For a more complex, if not silly example, which we could not do with the ordinary VB (VBA) Replace function. Say we want to replace any instance in a sentence where there are three of any alphabetical letters followed by a full stop and a single digit between 1 and 5 with the word "Monkey". Regexp.Pattern = "\s\w{3}\.\d{1}\s+" Regexp.Global = True Print Regexp.Replace("aaa.a fed.1 f1a.3 bbb.2 kljm.2 fff.8 ios.34", " monkey ") The result is "aaa.a monkey f1a.3 monkey kljm.2 monkey ios.34". This example is slightly contrived because I didn't cover some other scenarios that could have appeared within the sentence but you get the gist.

Execute: Retuns a collection of matches for each time the search pattern text is found within a sentence. For example, using the sentence above: Dim Matches as Object Regexp.Pattern = "\s\w{3}\.\d{1}\s+" Regexp.Global = True Set Matches = Regexp.Execute("aaa.a fed.1 f1a.3 bbb.2 kljm.2 fff.8 ios.34") The result is ['fed.1','bbb.2', 'fff.8']. This is handy if you need to extract certain text say from a document or field of data but do not have an exact idea of what the specific chunk of data is surrounded by.

Anyhow, for our first lesson let's write a regular _expression_ example that cleans return characters and other white space from the beginning and end of a string of data.

The syntax or code to deal with return characters, form feeds, tabs, new line and other white space  is \s . The \ is an escape character and tells the interpreter that the following letter, s, is a special command and not a regular letter. On it's own a \s simply means one white space character, but we want to remove any white space characters from the beginning or end of a string of data. If we follow the \s with a * asterisk then we tell the interpreter that we want it to match a pattern or zero or more spaces. (There are three modification signs that we can use , *, ?, + we will cover the differences between them in another lesson.) So as an example:

Say we had a piece of data like the following (where {return}{space} is a return or space character) and we just want the Hello portion of the string.

{return}{return}{space}{space}Hello{space}{return}

A single \s would just match with the first {return} and stop.
A \s* would match {return}{return}{space}{space} and then stop.

To tell the interpreter that we want to look at both the begining and the ending of the string of data for whitespaces we need a couple of other modifiers. The first one is the carat (Shift-Key and Number 6 on your keyboard) which looks like ^. When used as the first character in a regular _expression_ this tells the interpreter that the spaces must start at the very beginning of the string of data with no exceptions. For example:

Say{return}{return}{space}{space}Hello{space}{return}

The above string would return no matches for a ^\s* because the string starts with a character other than a white space character, S.
To find/match white spaces at the end of a string of data we modify the regular _expression_ by putting a $ dollar sign at the end of the _expression_. So \s*$ tells the interpreter the the preceding pattern is to be searched for a the end of a string. As in the above example there will be no matches if the string of data ends with a character other than a white space, eg:

{return}{return}{space}{space}Hello{space}{return}There

The above string would return no matches for a \s*$ because the string starts with a character other than a white space character, e.

Knowing how to modify our regular _expression_ how would we write our regular _expression_ to tell the regular _expression_ interpreter to only look for spaces and return characters at either end of our string of data? You might think to do the following,

regular_expression = "^\s*\s*$" or
regular_expression = "^\s*$"

but both of the above examples would not produce the results you are looking for. Why? Because the interpreter would read this as look for a complete string of white space from beginning to end. We want to tell the interpreter to look for space and return characters at the beginning or at the end of a string of data, because there is something else in the middle of the string we don't want the interpreter to be concerned with. To put in an or statement we use a | pipe character and () braces to separate the arguments. So our regular _expression_ would rightly look like the following:

regular_expression = "(^\s*)|(\s*$)"

The above _expression_ says look for zero to many continuous white spaces starting at the very beginning of a string of data or look for zero to many continuous white spaces starting at the end of a string of data.

Now we have our regular _expression_ worked out, how do we use it with the regular _expression_ interpreter to remove white space from a string of data. See the following example:

Const string_of_data as String = VbCrLf & "        Hello     " & VbCrLf
Dim result_data as String 'The result of our routine.
Dim regular_expression as String
Dim regex as Object

'Create the Microsoft Regular _expression_ Interpreter

Set regex = CreateObject("_vbscript_.RegExp")

'Create our regular _expression_
regular_expression = "(^\s*)|(\s*$)"

'Assign our regular _expression_ to interpreter
regex.Pattern = regular_expression

'Tell our interpreter that we want it to search the entire string, otherwise it will only match the beginning of the string and think it has done it's job. Basically Global says keep looking for matches don't just look for match to our regular _expression_ once.
regex.Global = True

'Replace all the white space in our string of data with nothing and pass back to our result variable.
result_data = regex.Replace(string_of_data, "")

msgbox result_data

After all of the explanation above this code is all it took to replace white space in a variable. This small example which includes the removal of Tabs, Return Characters, Spaces etc already serves it's purpose in making the Trim$ function redundant because Trim$ does not remove such a wide range of white space.

Using this above function in itself can remove copious amounts of extant code that loops through strings removing VbCrLf's. Try it and use it , regular expressions can be fun to create and watch work and with each lesson I'll introduce more complex regular expressions to split complex data better than the Split function can, perform complex search and 'find and replace' operations upon data.

For those who would like to forge ahead and explore regular expressions for themselves I recommend the following sites:

DevGuru (for a good overview of _vbscript_'s Regular _expression_ Interpreter)
http://www.devguru.com/Technologies/_vbscript_/quickref/regexp.html

Microsoft's overview of _vbscript_'s Regular _expression_ Interpreter which includes a table of Regular _expression_ characters and what they do with examples further down the page.
http://msdn.microsoft.com/library/default.asp?url="">

Bueno suerte!
Hasta la vez próxima,

Duane Hennessy.
Bandicoot Software
Tropical Queensland, Australia
(ABN: 33 682 969 957)

Want Increased Productivity?
http://www.bandicootsoftware.com.au




Please zip all files prior to uploading to Files section.




YAHOO! GROUPS LINKS




Reply via email to