On Sat, Jul 13, 2002 at 10:57:04PM -0400, Tanton Gibbs wrote: > > > I'm not exactly sure what the problems are; however, here are a couple of > things to try > 1.) If you don't need to save the value of each of the subexpressions, then > tell perl so by using ?: after each opening paren.
Once I tokenize the text, I don't use regular expressions at all. Here is an example of an rtf line: \pard \s1\fi720 \ldblquote Big guy, I didn\rquote t expect to see you so early, \rdblquote Joe said. \par Here are the tokens: '\pard':'\s1':'\fi720':'\ldblquote':' Big guy, I didn': '\rquote': t expect to see you so early,' :''\rdblquote' : ' Joe said. ' :'\par' Each of the escaped sequences represents some type of info that I have to decide what to do with. I use the substr function to determine the nature of the token. Actually, I have simplified the list of tokens. My split function actually produced 31 empty ('') tokens for this one line. So perl is doing a lot of searching. > 2.) Usually alternation is much slower than doing separate > regexes...however, in your case separating the regexes is seemingly > impossible. I'm not sure what alternation is. But now I am thinking that regexes are really not at all impossible. Perhaps they require a little more thought. That's not what stopped me from using them. I thought that I as I encoutnered more complex rtf, with different (and insidious versions) of word, I would have have to tweak my code so much that I wouldn't be able to maintain it. However, on second thought, I don't think the problem is that complicated. Let's take a look at the line above. It starts with "pard" this means "start a paragraph with a new style." the style names are stored in the escaped sequences afterwords. So this style name is "\s1 (stlye 1), \fi720" The fi means "first indent by 36 pts." There are a zillion other tokens, all of which I don't understand. What I need to know is when the text starts. I could just look for non-escaped text. But the '\ldblquote' actually marks the start of the text because it means "left quote." You can start to see some of the complexities and why I thought it better to handle one token at a time. However, I was just playing around with perl. I substituted every instance of \ldbquote and 4 other control sequences (right quote, em-dash, tab, and right curly). That only took 4 seconds for a 1.8 megabyte documents. So I am thinking of doing the simple substitutions first, and then proceeding. For example, if I substitute /\\ldblquote/<lft_quote/>/g; /\\rdblquote/<rt_quote/>/g; then my line looks like this: \pard \s1\fi720 <lft_quote/> Big guy, I didn\rquote t expect to see you so early, <rt_quote/> Joe said. \par now I can substitute: s/\\pard(.*?)\s[^\\]/<para style=\"$1\">/; # pard, followed by a #space, followed by #any character that # is not a backslash The most difficult part will be dealing with footnotes. They look something like this: {\footnote \pard \fi720 {\i italics word} text {\b bold words}} This line contains a nested structure, and I have to determine when it ends, because the paragraph styles are independent of the styles in the main body. For this I will have to use //g as you suggested, and keep counting the open and closed brackets until they equal zero. One last note on why I think I can change my strategy. an rtf line can look like this: \pard He was reading { \i The Sun Also Rises} when he heard the dog bark.\par This line should look like \pard He was reading {\i The Sun Also Rises} ... In other words, rtf is so scrwed up, that it even splits tokens across lines. However, I just read the Perl Cook book and realize I can do this: $\ = "\\par; <read in each line> s/\n//g; # get rid of line endings. This will work. The only line # line ending should come at the \par delimter Also, rtf does this \pard {i The Sun Also Rises \par } I have to read the whole file in and swith it so it reads: \pard {i The Sun Also Rises} \par I tried this on my big document, and it took only 4 tenths of a second. In sum, I am thinking that the regex are so super fast in perl that it I choose carefully what to substitute first, I can parse my document much faster. Thanks! -- ************************ *Paul Tremblay * *[EMAIL PROTECTED]* ************************ -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]