> It doesn't have to parse the entire file to determine the dtypes. It
> builds up a regular expression for what it expects to see, in terms of
> dtypes. Then it just loops over the lines, only parsing if the regular
> expression doesn't match. It seems that a regex match is fast, but a
> regex fail is expensive.

interesting -- I wouldn't have expected a regex to be faster that simple 
parsing, but that's why you profile!

> Setting array elements is not as fast for the masked record arrays.
> You must set entire rows at a time, so I have to build up each row as
> a list, and convert to a tuple, and then stuff it in the array.

hmmm -- that is a lot -- I was thinking of simple "set a value in an 
array". I"ve also done a bunch of this in C, where's it's really fast.

However, rather than:

   build a row as a list
   build a row as a tuple
   stuff into array

could you create an empty array scalar, and fill that, then put that in 
your array:

In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)])

In [5]: dt
Out[5]: dtype([('x', '<f4'), ('y', '<i4'), ('z', '<f8')])

In [6]: temp = np.empty((), dtype=dt)

In [9]: temp['x'] = 3

In [10]: temp['y'] = 4

In [11]: temp['z'] = 5

In [13]: a = np.zeros((4,), dtype = dt)

In [14]: a[0] = temp

In [15]: a
array([(3.0, 4, 5.0), (0.0, 0, 0.0), (0.0, 0, 0.0), (0.0, 0, 0.0)],
       dtype=[('x', '<f4'), ('y', '<i4'), ('z', '<f8')])

(and you could pass the array scalar into accumulator as well)

maybe it wouldn't be any faster, but with re-using temp, and one less 
list-tuple conversion, and fewer python type to numpy type conversions, 
maybe it would.

> it's even slower for the record arrays with missing data because I
> must branch between adding missing data versus adding real data. Might
> that be the reason for the slower performance than you'd expect?

could be -- I haven't thought about the missing data part much.

> I wonder if there are any really important cases where you'd actually
> lose something by simply recasting an entry to another dtype, as Derek
> suggested.

In general, it won't be a simple re-cast -- it will be a copy to a 
subset -- which may be hard to write the code, but would save having to 
re-parse the data.

Anyway, you know the issues, this is good stuff either way.


