what machine spec are you using? Using your last function line2array5 WITH float conversion, i get the following timing on a mobile quad core extreme:
In [24]: a = np.arange(100).astype(str).tostring() In [25]: a Out[25]: '0123456789111111111122222222223333333333444444444455555555556666666666777777777788888888889999999999' In [26]: %timeit line2array(a, 1) 10000 loops, best of 3: 37.1 µs per loop In [27]: a = np.arange(1000).astype(str).tostring() In [28]: %timeit line2array(a, 10) 10000 loops, best of 3: 45.2 µs per loop Cheers, Chris On Mon, Jul 27, 2009 at 7:29 PM, Christopher Barker<[email protected]> wrote: > Hi all, > > When I first saws this problem: reading in a fixed-width text file as > numbers, it struck me that you really should be able to do it, and do it > well, with numpy by slicing character arrays. > > I got carried away, and worked out a number of ways to do it. Lastly was a > method inspired by a recent thread: "String to integer array of ASCII > values", which did indeed inspire the fastest way. Here's what I have : > > # my naive first attempt: > def line2array0(line, field_len): > nums = [] > i = 0 > while i < len(line): > nums.append(float(line[i:i+field_len])) > i += field_len > return np.array(nums) > > # list comprehension > def line2array1(line, field_len): > return np.array(map(float,[line[i*field_len:(i+1)*field_len] for i in > range(len(line)/field_len)])) > > # convert to a tuple, then to an 'S1' array -- no real reason to do > # this, as I figured out the next way. > def line2array2(line, field_len): > return np.array(tuple(line), dtype = > 'S1').view(dtype='S%i'%field_len).astype(np.float) > > # convert directly to a string array, then break into fields. > def line2array3(line, field_len): > return np.array((line,)).view(dtype='S%i'%field_len).astype(np.float) > > # use dtype-'c' instead of 'S1' -- better. > def line2array4(line, field_len): > return np.array(line, > dtype='c').view(dtype='S%i'%field_len).astype(np.float) > > # and the winner is: use fromstring to go straight to a 'c' array: > def line2array5(line, field_len): > return np.fromstring(line, > dtype='c').view(dtype='S%i'%field_len).astype(np.float) > > Here are some timings: > > Timing with a 10 number string: > List comp: 36.8073430061 > convert to tuple: 57.9741871357 > auto convert: 43.4103589058 > char type: 46.0047719479 > fromstring: 23.998103857 > without float conversion: 11.4827179909 > > So list comprehension is pretty fast, but using fromstring, and then slicing > is much better. The last one is the same thing, but without the convertion > from strings to float, showing that that's a big chunk of time no matter how > you slice it. > > Timing with a 100 number string: > List comp: 163.281736135 > convert to tuple: 333.081432104 > auto convert: 138.934411049 > char type: 279.897207975 > fromstring: 121.395509005 > without float conversion: 12.8342208862 > > > Interesting -- I thought a longer string would give greater advantage to > fromstring approach -- but I was wrong, now the time to parse strings into > floats is really washing everything else out -- so it doesn't matter much > how you do it, though I'd go with either list comprehension (which is what I > think is used in np.genfromtxt), or the fromstring method, which I kind of > like 'cause it's numpy. > > test and timing code attached. > > -Chris > > > > > > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > [email protected] > > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
