Re: script too slow?

intexo Mon, 15 Jul 2002 00:34:51 -0700

<posted & mailed>

I expect you've seen it, but Microsoft have the RTF spec on MSDN:


http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnrtfspec/html/RTFSpec_2.asp

They've got a fairly detailed description of a "Sample RTF Reader 
Application" in there, which might be useful in determining the best 
approach to this problem.

HTH,
Daniel


Paul Tremblay wrote:

> On Sat, Jul 13, 2002 at 10:57:04PM -0400, Tanton Gibbs wrote:
>>
>> 
>> I'm not exactly sure what the problems are; however, here are a couple of
>> things to try
>> 1.) If you don't need to save the value of each of the subexpressions,
>> then tell perl so by using ?: after each opening paren.
> 
> Once I tokenize the text, I don't use regular expressions at all.
> 
> Here is an example of an rtf line:
> 
> 
> \pard \s1\fi720 \ldblquote Big guy, I didn\rquote t expect to see you so
> early,
> \rdblquote  Joe said. \par
> 
> Here are the tokens:
> 
> '\pard':'\s1':'\fi720':'\ldblquote':' Big guy, I didn':
> '\rquote': t expect to see you so early,' :''\rdblquote' :
> '  Joe said. ' :'\par'
> 
> Each of the escaped sequences represents some type of info that I
> have to decide what to do with. I use the substr function to
> determine the nature of the token.
> 
> Actually, I have simplified the list of tokens. My split
> function actually produced 31 empty ('') tokens for this one
> line. So perl is doing a lot of searching.
> 
> 
>> 2.) Usually alternation is much slower than doing separate
>> regexes...however, in your case separating the regexes is seemingly
>> impossible.
> 
> I'm not sure what alternation is. But now I am thinking that
> regexes are really not at all impossible. Perhaps they require a
> little more thought. That's not what stopped me from using them.
> I thought that I as I encoutnered more complex rtf, with
> different (and insidious versions) of word, I would have have to
> tweak my code so much that I wouldn't be able to maintain it.
> However, on second thought, I don't think the problem is that
> complicated.
> 
> Let's take a look at the line above. It starts with "pard" this
> means "start a paragraph with a new style." the style names are
> stored in the escaped sequences afterwords. So this style name is
> "\s1 (stlye 1), \fi720" The fi means "first indent by 36 pts."
> There are a zillion other tokens, all of which I don't
> understand. What I need to know is when the text starts. I could
> just look for non-escaped text. But the '\ldblquote' actually
> marks the start of the text because it means "left quote."
> 
> You can start to see some of the complexities and why I thought
> it better to handle one token at a time. However, I was just
> playing around with perl. I substituted every instance of
> \ldbquote and 4 other control sequences (right quote, em-dash,
> tab, and right curly). That only took 4 seconds for a 1.8
> megabyte documents.
> 
> So I am thinking of doing the simple substitutions first, and
> then proceeding. For example, if I substitute
> 
> /\\ldblquote/<lft_quote/>/g;
> /\\rdblquote/<rt_quote/>/g;
> 
> then my line looks like this:
> 
> \pard \s1\fi720 <lft_quote/> Big guy, I didn\rquote t expect to see you so
> early,
> <rt_quote/>  Joe said. \par
> 
> now I can substitute:
> 
> s/\\pard(.*?)\s[^\\]/<para style=\"$1\">/;    # pard, followed by a
> #space, followed by
> #any character that
> # is not a backslash
> 
> 
> The most difficult part will be dealing with footnotes. They look
> something like this:
> 
> {\footnote \pard \fi720 {\i italics word} text {\b bold words}}
> 
> This line contains a nested structure, and I have to determine
> when it ends, because the paragraph styles are independent of the
> styles in the main body. For this I will have to use //g as you
> suggested, and keep counting the open and closed brackets until
> they equal zero.
> 
> One last note on why I think I can change my strategy. an rtf
> line can look like this:
> 
> \pard He was reading {
> \i The Sun Also Rises} when he heard the dog bark.\par
> 
> This line should look like
> 
> \pard He was reading {\i The Sun Also Rises} ...
> 
> In other words, rtf is so scrwed up, that it even splits tokens
> across lines. However, I just read the Perl Cook book and realize
> I can do this:
> 
> $\ = "\\par;
> <read in each line>
> s/\n//g;      # get rid of line endings. This will work. The only line
> # line ending should come at the \par delimter
> 
> Also, rtf does this
> 
> \pard {i The Sun Also Rises \par
> }
> 
> I have to read the whole file in and swith it so it reads:
> 
> \pard {i The Sun Also Rises} \par
> 
> I tried this on my big document, and it took only 4 tenths of a
> second.
> 
> In sum, I am thinking that the regex are so super fast in perl
> that it I choose carefully what to substitute first, I can parse
> my document much faster.
> 
> Thanks!
> 
> 


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: script too slow?

Reply via email to