Paul Tremblay wrote: > > I just finished my first version of a script that converts rtf to > xml and was wondering if I went about writing it the wrong way. > > My method was to read in one line at a time and split the lines > into tokens, and then to read one token at a time. I used this > line to split up the text: > > @tokens = split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line); > > Splitting up the text on my test file of 1.8 megabytes tooks 25 > seconds. The entire script took 50 seconds.
A few points. \n is included in the \s character class. Braces don't have to be back-slashed in a character class. You should arrange the patterns with the longest before the shortest if the shorter one(s) are a subset of the longer one(s). With the (})|(\\}) in your pattern, the last one will never match. And finally, you don't need parens around each alternation. So your split could be simplified to: my @tokens = split /({\\[^\s}{]+|\\[^\s\\}]+|\\[\\}]|})/, $line; HTH John -- use Perl; program fulfillment -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]