I'm not exactly sure what the problems are; however, here are a couple of
things to try
1.) If you don't need to save the value of each of the subexpressions, then
tell perl so by using ?: after each opening paren.
2.) Usually alternation is much slower than doing separate
regexes...however, in your case separating the regexes is seemingly
impossible.
3.) Have you tried m//g and processing a token at a time instead of saving
them all into an array?

Good luck!
Tanton
----- Original Message -----
From: "Paul Tremblay" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 13, 2002 10:35 PM
Subject: script too slow?


> I just finished my first version of a script that converts rtf to
> xml and was wondering if I went about writing it the wrong way.
>
> My method was to read in one line at a time and split the lines
> into tokens, and then to read one token at a time. I used this
> line to split  up the text:
>
> @tokens =
split(/({\\[^\s\n\}{]+)|(\\[^\s\n\\}]+)|(\\\\)|(})|(\\})/,$line);
>
> Splitting up the text on my test file of 1.8 megabytes tooks 25
> seconds. The entire script took 50 seconds.
>
> I had written a previous uncompleted version in which I relied on
> regular expressions rather than tokens, and this script took only
> 10 seconds to run. I gave up on this method because it seemed
> there would always be an excpetion that would require another
> regexp.
>
> So why does splitting a text into tokens take so long? Has
> anybody done something similar to what I am trying, and do you
> have any advice?
>
> The good news is that relativley speaking, perl is very, very
> fast. I tried a similar script in python using a lexer called
> plex, and the 1.8 megabyte file took 12 minutes to parse!
>
> In case you are wondering why I'm seemingly obsessed with speed,
> I would like to make this script available to anyone. Right now
> the only free utilities for converting rtf to xml are a java
> utility call majix, which deletes your footnotes and only allows
> for 9 user-defined styles. If my perl script is too slow, it won't be
> very useful.
>
> Thanks
>
> Paul
>
>
> --
>
> ************************
> *Paul Tremblay         *
> *[EMAIL PROTECTED]*
> ************************
>
> --
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>


-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to