<posted & mailed> I expect you've seen it, but Microsoft have the RTF spec on MSDN:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/RTFSpec_2.asp They've got a fairly detailed description of a "Sample RTF Reader Application" in there, which might be useful in determining the best approach to this problem. HTH, Daniel Paul Tremblay wrote: > On Sat, Jul 13, 2002 at 10:57:04PM -0400, Tanton Gibbs wrote: >> >> >> I'm not exactly sure what the problems are; however, here are a couple of >> things to try >> 1.) If you don't need to save the value of each of the subexpressions, >> then tell perl so by using ?: after each opening paren. > > Once I tokenize the text, I don't use regular expressions at all. > > Here is an example of an rtf line: > > > \pard \s1\fi720 \ldblquote Big guy, I didn\rquote t expect to see you so > early, > \rdblquote Joe said. \par > > Here are the tokens: > > '\pard':'\s1':'\fi720':'\ldblquote':' Big guy, I didn': > '\rquote': t expect to see you so early,' :''\rdblquote' : > ' Joe said. ' :'\par' > > Each of the escaped sequences represents some type of info that I > have to decide what to do with. I use the substr function to > determine the nature of the token. > > Actually, I have simplified the list of tokens. My split > function actually produced 31 empty ('') tokens for this one > line. So perl is doing a lot of searching. > > >> 2.) Usually alternation is much slower than doing separate >> regexes...however, in your case separating the regexes is seemingly >> impossible. > > I'm not sure what alternation is. But now I am thinking that > regexes are really not at all impossible. Perhaps they require a > little more thought. That's not what stopped me from using them. > I thought that I as I encoutnered more complex rtf, with > different (and insidious versions) of word, I would have have to > tweak my code so much that I wouldn't be able to maintain it. > However, on second thought, I don't think the problem is that > complicated. > > Let's take a look at the line above. It starts with "pard" this > means "start a paragraph with a new style." the style names are > stored in the escaped sequences afterwords. So this style name is > "\s1 (stlye 1), \fi720" The fi means "first indent by 36 pts." > There are a zillion other tokens, all of which I don't > understand. What I need to know is when the text starts. I could > just look for non-escaped text. But the '\ldblquote' actually > marks the start of the text because it means "left quote." > > You can start to see some of the complexities and why I thought > it better to handle one token at a time. However, I was just > playing around with perl. I substituted every instance of > \ldbquote and 4 other control sequences (right quote, em-dash, > tab, and right curly). That only took 4 seconds for a 1.8 > megabyte documents. > > So I am thinking of doing the simple substitutions first, and > then proceeding. For example, if I substitute > > /\\ldblquote/<lft_quote/>/g; > /\\rdblquote/<rt_quote/>/g; > > then my line looks like this: > > \pard \s1\fi720 <lft_quote/> Big guy, I didn\rquote t expect to see you so > early, > <rt_quote/> Joe said. \par > > now I can substitute: > > s/\\pard(.*?)\s[^\\]/<para style=\"$1\">/; # pard, followed by a > #space, followed by > #any character that > # is not a backslash > > > The most difficult part will be dealing with footnotes. They look > something like this: > > {\footnote \pard \fi720 {\i italics word} text {\b bold words}} > > This line contains a nested structure, and I have to determine > when it ends, because the paragraph styles are independent of the > styles in the main body. For this I will have to use //g as you > suggested, and keep counting the open and closed brackets until > they equal zero. > > One last note on why I think I can change my strategy. an rtf > line can look like this: > > \pard He was reading { > \i The Sun Also Rises} when he heard the dog bark.\par > > This line should look like > > \pard He was reading {\i The Sun Also Rises} ... > > In other words, rtf is so scrwed up, that it even splits tokens > across lines. However, I just read the Perl Cook book and realize > I can do this: > > $\ = "\\par; > <read in each line> > s/\n//g; # get rid of line endings. This will work. The only line > # line ending should come at the \par delimter > > Also, rtf does this > > \pard {i The Sun Also Rises \par > } > > I have to read the whole file in and swith it so it reads: > > \pard {i The Sun Also Rises} \par > > I tried this on my big document, and it took only 4 tenths of a > second. > > In sum, I am thinking that the regex are so super fast in perl > that it I choose carefully what to substitute first, I can parse > my document much faster. > > Thanks! > > -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]