Re: Could you suggest optimisations ?

Terry Reedy Tue, 13 Jan 2009 15:58:24 -0800

Barak, Ron wrote:

Hi,
In the attached script, the longest time is spent in the followingfunctions (verified by psyco log):

I cannot help but wonder why and if you really need all the rigamorolewith file pointers, offsets, and tells instead of


for line in open(...):
  do your processing.


    def match_generator(self,regex):
        """
        Generate the next line of self.input_file that
        matches regex.
        """
        generator_ = self.line_generator()
        while True:
            self.file_pointer = self.input_file.tell()
            if self.file_pointer != 0:
                self.file_pointer -= 1
            if (self.file_pointer + 2) >= self.last_line_offset:
                break
            line_ = generator_.next()

print "%.2f%% \r" % (((self.last_line_offset -self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),

            if not line_:
                break
            else:
                match_ = regex.match(line_)
                groups_ = re.findall(regex,line_)
                if match_:
                    yield line_.strip("\n"), groups_

def get_matching_records_by_regex_extremes(self,regex_array):

        """
        Function will:
        Find the record matching the first item of regex_array.
        Will save all records until the last item of regex_array.
        Will save the last line.
        Will remember the position of the beginning of the next line in
        self.input_file.
        """
        start_regex = regex_array[0]
        end_regex = regex_array[len(regex_array) - 1]

all_recs = []

        generator_ = self.match_generator

try:

            match_start,groups_ = generator_(start_regex).next()
        except StopIteration:
            return(None)

if match_start != None:

            all_recs.append([match_start,groups_])

line_ = self.line_generator().next()

            while line_:
                match_ = end_regex.match(line_)
                groups_ = re.findall(end_regex,line_)
                if match_ != None:
                    all_recs.append([line_,groups_])
                    return(all_recs)
                else:
                    all_recs.append([line_,[]])
                    line_ = self.line_generator().next()

def line_generator(self):

        """
        Generate the next line of self.input_file, and update
        self.file_pointer to the beginning of that line.
        """
        while self.input_file.tell() <= self.last_line_offset:
            self.file_pointer = self.input_file.tell()
            line_ = self.input_file.readline()
            if not line_:
                break
            yield line_.strip("\n")

I was trying to think of optimisations, so I could cut down onprocessing time, but got no inspiration.

(I need the "print "%.2f%%   \r" ..." line for user's feedback).

Could you suggest any optimisations ?
Thanks,
Ron.

P.S.: Examples of processing times are:


        * 2m42.782s  on two files with combined size of    792544 bytes
          (no matches found).
        * 28m39.497s on two files with combined size of 4139320 bytes

(783 matches found).

    These times are quite unacceptable, as a normal input to the program
    would be ten files with combined size of ~17MB.


------------------------------------------------------------------------

--
http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list

Re: Could you suggest optimisations ?

Reply via email to