I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list. It appears that the user must create a tuple to place in a numpy record array. (Specifically, if you remove the 'tuple' command from f2 in the attached then you get an error.) Taking multiple lines at a time (using f4) does provide a speed benefit, but it's not very big since Python's re module won't let you capture more than 100 values, and I'm using capturing to extract the values. (This is done because we're allowing the user to use regular expressions to denote delimiters.)
In the example it's a bunch of space-delimited integers. f1 splits on the space and uses a list comprehension, f2 splits on the space and uses a generator, f3 uses regular expressions in a manner similar to the current code, and f4 uses regular expressions on multiple lines at once, and f5 uses fromiter. (Though fromiter isn't as useful as I'd hoped because you have to have already parsed out a line, since this is a record array.) f6 and f7 use stripped down versions of Chris Barker's accumulator idea. The difference is that f6 uses resize when expanding the array while f7 uses np.empty followed by np.append. This avoids the penalty from copying data that np.resize imposes. Note that f6 and f7 use the regular expression capturing line by line as in f3. To get a feel for the overheard involved with keeping track of string sizes, f8 is just f3 except with a list for the largest string sizes seen so far. The speeds I get using timeit are f1 : 1.66ms f2 : 2.01ms f3 : 2.35ms f4(2) : 3.02ms (Odd that it starts out worse than f3 when you take 2 lines at a time) f4(5) : 2.25ms f4(10) : 2.02ms f4(15) : 1.93ms f4(20) : error f5 : 2.28ms (As I said, fromiter can't do much when it's just filling in a record array. While it's slightly faster than f3, which it's based on, it also loads all the data as a list before creating a numpy array, which is rather suboptimal.) f6 : 3.26ms f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by append then do to resize) f8 : 3.04ms (Compared to f3, this shows there's a non-trivial performance hit from keeping track of the sizes) It seems like taking multiple lines at once isn't a big gain when we're limited to 100 captured entries at a time. (For Python 2.6, at least.) Especially since taking multiple lines at once would be rather complex since the code must still check each line to see if it's commented out or not. After talking to Chris Farrow, an Enthought developer, and doing some timing on a dataset he was working on, I had loadtable running about 1.7 to 3.3 times as fast as genfromtxt. The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion. To give some further context, I compared np.genfromtxt and np.loadtable on the same 'pseudo-file' f used in the above tests, when the data is just a bunch of integers. The results were: np.genfromtxt with dtype=None: 4.45 ms np.loadtable with defaults: 5.12ms np.loadtable with check_sizes=False: 3.7ms So it seems that np.loadtable is already competitive with np.genfromtxt other than checking the sizes. And the checking sizes isn't even that huge a penalty compared to genfromtxt. Based on all the above it seems like the accumulator is the most promising way that things could be sped up. But it's not completely clear to me by how much, since we still must keep track of the dtypes and the sizes. Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals. -Chris JS On Thu, Sep 8, 2011 at 3:57 PM, Chris.Barker <chris.bar...@noaa.gov> wrote: > On 9/8/11 1:43 PM, Christopher Jordan-Squire wrote: >> I just ran a quick test on my machine of this idea. With >> >> dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) >> temp = np.empty((), dtype=dt) >> temp2 = np.zeros(1,dtype=dt) >> >> In [96]: def f(): >> ...: l=[0]*3 >> ...: l[0] = 2.54 >> ...: l[1] = 4 >> ...: l[2] = 2.3645 >> ...: j = tuple(l) >> ...: temp2[0] = j >> >> vs >> >> >> In [97]: def g(): >> ...: temp['x'] = 2.54 >> ...: temp['y'] = 4 >> ...: temp['z'] = 2.3645 >> ...: temp2[0] = temp >> ...: >> >> The timing results were 2.73 us for f and 3.43 us for g. So good idea, >> but it doesn't appear to be faster. (Though the difference wasn't >> nearly as dramatic as I thought it would be, based on Pauli's >> comment.) > > my guess is that the lines like: temp['x'] = 2.54 are slower (it > requires a dict lookup, and a conversion from a python type to a "raw" type) > > and > > temp2[0] = temp > > is faster, as that doesn't require any conversion. > > Which means that if you has a larger struct dtype, it would be even > slower, so clearly not the way to go for performance. > > It would be nice to have a higher performing struct dtype scalar -- as > it is ordered, it might be nice to be able to index it with either the > name or an numeric index. > > -Chris > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
import itertools as it import numpy as np import re from StringIO import StringIO def grouper(n, iterable, fillvalue=None): "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx" args = [iter(iterable)] * n return it.izip_longest(fillvalue=fillvalue, *args) tot = 600 perLine = 5 n = tot/perLine tostr = lambda s : ' '.join(map(str, s)) numbers = grouper(perLine, xrange(tot)) S = '\n'.join(map(tostr, numbers)) f = StringIO(S) dt = np.dtype(','.join(['i4']*perLine)) #Simplest case, split on space and use list comprehension def f1(): f.seek(0) arr = np.empty(n, dtype=dt) for j, line in enumerate(f): tmp = [int(i) for i in line.split(' ')] arr[j] = tuple(tmp) return arr #Split on space and use generator instead def f2(): f.seek(0) arr = np.empty(n, dtype=dt) for j, line in enumerate(f): tmp = tuple(int(i) for i in line.split(' ')) arr[j] = tmp return arr s = r'(\d+)\s+' s1 = [s]*(perLine-1) s1.insert(0,'^') s1.append(r'(\d+)$') s2 = ''.join(s1) # Use regular expression. This most closely matches what is # currently done in the loadtable code def f3(): f.seek(0) arr = np.empty(n, dtype=dt) m = re.compile(s2) for j, line in enumerate(f): args = m.match(line).groups() args = (int(i) for i in args) arr[j] = tuple(args) return arr #Use regular expression and grab multiple lines at once #This has been suggested as an alternative to attempt #in loadtable. #Unforunately Python's re module won't let the caller #have more than 100 captured groups. def f4(ch): f.seek(0) arr = np.empty(n, dtype=dt) m = re.compile('\n'.join([s2]*ch), re.MULTILINE) g = grouper(ch,iter(f)) count = 0 for lines in g: args = m.match(''.join(lines)).groups() args = (int(i) for i in args) args = grouper(perLine, args) arr[count:count+ch] = list((tuple(g) for g in args)) count += ch return arr def f5(): f.seek(0) m = re.compile(s2) l = [] for line in f: args = m.match(line).groups() args = (int(i) for i in args) l.append(tuple(args)) return np.fromiter(l, dt) # The next two are checking times for two versions of accumulator # numpy arrays. Thanks to Chris Barker for the idea. class sAccum: def __init__(self, dtype, isize): self._arr = np.empty((isize,), dtype=dtype) self.size = 0 def __call__(self): return self._arr[:self.size] def add(self, row): if self.size+1<self._arr.size: self._arr[self.size] = row self.size += 1 else: temp = int(self._arr.size * 1.25) self._arr = np.resize(self._arr, temp) self._arr[self.size] = row self.size += 1 class sAccumAppend: def __init__(self, dtype, isize): self._arr = np.empty((isize,), dtype=dtype) self.size = 0 def __call__(self): return self._arr[:self.size] def add(self, row): if self.size+1<self._arr.size: self._arr[self.size] = row self.size += 1 else: temp = np.empty(int(self._arr.size * 0.25), dtype=self._arr.dtype) self._arr = np.append(self._arr, temp) self._arr[self.size] = row self.size += 1 def f6(): f.seek(0) arr = sAccum(dt, 10) m = re.compile(s2) for j, line in enumerate(f): args = m.match(line).groups() args = (int(i) for i in args) arr.add(tuple(args)) return arr() def f7(): f.seek(0) arr = sAccumAppend(dt, 10) m = re.compile(s2) for j, line in enumerate(f): args = m.match(line).groups() args = (int(i) for i in args) arr.add(tuple(args)) return arr() # Adding checking sizes just to get approximate times for the overhead def f8(): f.seek(0) arr = np.empty(n, dtype=dt) m = re.compile(s2) sizes = [0]*perLine for j, line in enumerate(f): args = m.match(line).groups() sizes = map(max, zip(sizes, map(len, args))) args = (int(i) for i in args) arr[j] = tuple(args) return arr # Checking against existing functions def f9(): f.seek(0) return np.genfromtxt(f, dtype=None) # This is only relevant if you have the loadtable code def f10(): f.seek(0) return np.loadtable(f) def f11(): f.seek(0) return np.loadtable(f, check_sizes=False)
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion