Barak, Ron wrote:
Hi,
In the attached script, the longest time is spent in the following
functions (verified by psyco log):
I cannot help but wonder why and if you really need all the rigamorole
with file pointers, offsets, and tells instead of
for line in open(...):
do your processing.
def match_generator(self,regex):
"""
Generate the next line of self.input_file that
matches regex.
"""
generator_ = self.line_generator()
while True:
self.file_pointer = self.input_file.tell()
if self.file_pointer != 0:
self.file_pointer -= 1
if (self.file_pointer + 2) >= self.last_line_offset:
break
line_ = generator_.next()
print "%.2f%% \r" % (((self.last_line_offset -
self.input_file.tell()) / (self.last_line_offset * 1.0)) * 100.0),
if not line_:
break
else:
match_ = regex.match(line_)
groups_ = re.findall(regex,line_)
if match_:
yield line_.strip("\n"), groups_
def get_matching_records_by_regex_extremes(self,regex_array):
"""
Function will:
Find the record matching the first item of regex_array.
Will save all records until the last item of regex_array.
Will save the last line.
Will remember the position of the beginning of the next line in
self.input_file.
"""
start_regex = regex_array[0]
end_regex = regex_array[len(regex_array) - 1]
all_recs = []
generator_ = self.match_generator
try:
match_start,groups_ = generator_(start_regex).next()
except StopIteration:
return(None)
if match_start != None:
all_recs.append([match_start,groups_])
line_ = self.line_generator().next()
while line_:
match_ = end_regex.match(line_)
groups_ = re.findall(end_regex,line_)
if match_ != None:
all_recs.append([line_,groups_])
return(all_recs)
else:
all_recs.append([line_,[]])
line_ = self.line_generator().next()
def line_generator(self):
"""
Generate the next line of self.input_file, and update
self.file_pointer to the beginning of that line.
"""
while self.input_file.tell() <= self.last_line_offset:
self.file_pointer = self.input_file.tell()
line_ = self.input_file.readline()
if not line_:
break
yield line_.strip("\n")
I was trying to think of optimisations, so I could cut down on
processing time, but got no inspiration.
(I need the "print "%.2f%% \r" ..." line for user's feedback).
Could you suggest any optimisations ?
Thanks,
Ron.
P.S.: Examples of processing times are:
* 2m42.782s on two files with combined size of 792544 bytes
(no matches found).
* 28m39.497s on two files with combined size of 4139320 bytes
(783 matches found).
These times are quite unacceptable, as a normal input to the program
would be ten files with combined size of ~17MB.
------------------------------------------------------------------------
--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list