On Wed, Sep 7, 2011 at 2:52 PM, Chris.Barker <chris.bar...@noaa.gov> wrote: > On 9/2/11 2:45 PM, Christopher Jordan-Squire wrote: >> It doesn't have to parse the entire file to determine the dtypes. It >> builds up a regular expression for what it expects to see, in terms of >> dtypes. Then it just loops over the lines, only parsing if the regular >> expression doesn't match. It seems that a regex match is fast, but a >> regex fail is expensive. > > interesting -- I wouldn't have expected a regex to be faster that simple > parsing, but that's why you profile! > >> Setting array elements is not as fast for the masked record arrays. >> You must set entire rows at a time, so I have to build up each row as >> a list, and convert to a tuple, and then stuff it in the array. > > hmmm -- that is a lot -- I was thinking of simple "set a value in an > array". I"ve also done a bunch of this in C, where's it's really fast. > > However, rather than: > > build a row as a list > build a row as a tuple > stuff into array > > could you create an empty array scalar, and fill that, then put that in > your array: > > In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) > > In [5]: dt > Out[5]: dtype([('x', '<f4'), ('y', '<i4'), ('z', '<f8')]) > > In [6]: temp = np.empty((), dtype=dt) > > In [9]: temp['x'] = 3 > > In [10]: temp['y'] = 4 > > In [11]: temp['z'] = 5 > > In [13]: a = np.zeros((4,), dtype = dt) > > In [14]: a[0] = temp > > In [15]: a > Out[15]: > array([(3.0, 4, 5.0), (0.0, 0, 0.0), (0.0, 0, 0.0), (0.0, 0, 0.0)], > dtype=[('x', '<f4'), ('y', '<i4'), ('z', '<f8')]) > > > (and you could pass the array scalar into accumulator as well) > > maybe it wouldn't be any faster, but with re-using temp, and one less > list-tuple conversion, and fewer python type to numpy type conversions, > maybe it would. >
I just ran a quick test on my machine of this idea. With dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) temp = np.empty((), dtype=dt) temp2 = np.zeros(1,dtype=dt) In [96]: def f(): ...: l=[0]*3 ...: l[0] = 2.54 ...: l[1] = 4 ...: l[2] = 2.3645 ...: j = tuple(l) ...: temp2[0] = j vs In [97]: def g(): ...: temp['x'] = 2.54 ...: temp['y'] = 4 ...: temp['z'] = 2.3645 ...: temp2[0] = temp ...: The timing results were 2.73 us for f and 3.43 us for g. So good idea, but it doesn't appear to be faster. (Though the difference wasn't nearly as dramatic as I thought it would be, based on Pauli's comment.) -Chris JS >> it's even slower for the record arrays with missing data because I >> must branch between adding missing data versus adding real data. Might >> that be the reason for the slower performance than you'd expect? > > could be -- I haven't thought about the missing data part much. > >> I wonder if there are any really important cases where you'd actually >> lose something by simply recasting an entry to another dtype, as Derek >> suggested. > > In general, it won't be a simple re-cast -- it will be a copy to a > subset -- which may be hard to write the code, but would save having to > re-parse the data. > > > Anyway, you know the issues, this is good stuff either way. > > -Chris > > > -- > Christopher Barker, Ph.D. > Oceanographer > > Emergency Response Division > NOAA/NOS/OR&R (206) 526-6959 voice > 7600 Sand Point Way NE (206) 526-6329 fax > Seattle, WA 98115 (206) 526-6317 main reception > > chris.bar...@noaa.gov > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion