Barak, Ron wrote:
Hi,

In the attached script, the longest time is spent in the following functions (verified by psyco log):

I cannot help but wonder why and if you really need all the rigamorole with file pointers, offsets, and tells instead of

for line in open(...):
  do your processing.



    def match_generator(self,regex):
        """
        Generate the next line of self.input_file that
        matches regex.
        """
        generator_ = self.line_generator()
        while True:
            self.file_pointer = self.input_file.tell()
            if self.file_pointer != 0:
                self.file_pointer -= 1
            if (self.file_pointer + 2) >= self.last_line_offset:
                break
            line_ = generator_.next()
print "%.2f%% \r" % (((self.last_line_offset - self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
            if not line_:
                break
            else:
                match_ = regex.match(line_)
                groups_ = re.findall(regex,line_)
                if match_:
                    yield line_.strip("\n"), groups_
def get_matching_records_by_regex_extremes(self,regex_array):
        """
        Function will:
        Find the record matching the first item of regex_array.
        Will save all records until the last item of regex_array.
        Will save the last line.
        Will remember the position of the beginning of the next line in
        self.input_file.
        """
        start_regex = regex_array[0]
        end_regex = regex_array[len(regex_array) - 1]
all_recs = []
        generator_ = self.match_generator
try:
            match_start,groups_ = generator_(start_regex).next()
        except StopIteration:
            return(None)
if match_start != None:
            all_recs.append([match_start,groups_])
line_ = self.line_generator().next()
            while line_:
                match_ = end_regex.match(line_)
                groups_ = re.findall(end_regex,line_)
                if match_ != None:
                    all_recs.append([line_,groups_])
                    return(all_recs)
                else:
                    all_recs.append([line_,[]])
                    line_ = self.line_generator().next()
def line_generator(self):
        """
        Generate the next line of self.input_file, and update
        self.file_pointer to the beginning of that line.
        """
        while self.input_file.tell() <= self.last_line_offset:
            self.file_pointer = self.input_file.tell()
            line_ = self.input_file.readline()
            if not line_:
                break
            yield line_.strip("\n")

I was trying to think of optimisations, so I could cut down on processing time, but got no inspiration.
(I need the "print "%.2f%%   \r" ..." line for user's feedback).

Could you suggest any optimisations ?
Thanks,
Ron.
P.S.: Examples of processing times are:

        * 2m42.782s  on two files with combined size of    792544 bytes
          (no matches found).
        * 28m39.497s on two files with combined size of 4139320 bytes
(783 matches found).
    These times are quite unacceptable, as a normal input to the program
    would be ten files with combined size of ~17MB.


------------------------------------------------------------------------

--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to