On 9/2/11 2:45 PM, Christopher Jordan-Squire wrote: > It doesn't have to parse the entire file to determine the dtypes. It > builds up a regular expression for what it expects to see, in terms of > dtypes. Then it just loops over the lines, only parsing if the regular > expression doesn't match. It seems that a regex match is fast, but a > regex fail is expensive.
interesting -- I wouldn't have expected a regex to be faster that simple parsing, but that's why you profile! > Setting array elements is not as fast for the masked record arrays. > You must set entire rows at a time, so I have to build up each row as > a list, and convert to a tuple, and then stuff it in the array. hmmm -- that is a lot -- I was thinking of simple "set a value in an array". I"ve also done a bunch of this in C, where's it's really fast. However, rather than: build a row as a list build a row as a tuple stuff into array could you create an empty array scalar, and fill that, then put that in your array: In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) In [5]: dt Out[5]: dtype([('x', '<f4'), ('y', '<i4'), ('z', '<f8')]) In [6]: temp = np.empty((), dtype=dt) In [9]: temp['x'] = 3 In [10]: temp['y'] = 4 In [11]: temp['z'] = 5 In [13]: a = np.zeros((4,), dtype = dt) In [14]: a[0] = temp In [15]: a Out[15]: array([(3.0, 4, 5.0), (0.0, 0, 0.0), (0.0, 0, 0.0), (0.0, 0, 0.0)], dtype=[('x', '<f4'), ('y', '<i4'), ('z', '<f8')]) (and you could pass the array scalar into accumulator as well) maybe it wouldn't be any faster, but with re-using temp, and one less list-tuple conversion, and fewer python type to numpy type conversions, maybe it would. > it's even slower for the record arrays with missing data because I > must branch between adding missing data versus adding real data. Might > that be the reason for the slower performance than you'd expect? could be -- I haven't thought about the missing data part much. > I wonder if there are any really important cases where you'd actually > lose something by simply recasting an entry to another dtype, as Derek > suggested. In general, it won't be a simple re-cast -- it will be a copy to a subset -- which may be hard to write the code, but would save having to re-parse the data. Anyway, you know the issues, this is good stuff either way. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion