Pierre GM wrote: > All, > Here's the second round of genloadtxt. That's a tad cleaner version than > the previous one, where I tried to take into account the different > comments and suggestions that were posted. So, tabs should be supported > and explicit whitespaces are not collapsed.
Looks pretty good, but there's one breakage against what I had working with my local copy (with mods). When adding the filtering of names read from the file using usecols, there's a reason I set a flag and fixed it later: converters specified by name. If we have usecols and converters specified by name, and we read the names from a file, we have the following sequence: 1) Read names 2) Convert usecols names to column numbers. 3) Filter name list using usecols. Indices of names list no longer map to column numbers. 4) Change converters from mapping names->funcs to mapping col#->func using indices from names....OOPS. It's an admittedly complex combination, but it allows flexibly reading text files since you're only basing on field names, no column numbers. Here's a test case: def test_autonames_usecols_and_converter(self): "Tests names and usecols" data = StringIO.StringIO('A B C D\n aaaa 121 45 9.1') test = loadtxt(data, usecols=('A', 'C', 'D'), names=True, dtype=None, converters={'C':lambda s: 2 * int(s)}) control = np.array(('aaaa', 90, 9.1), dtype=[('A', '|S4'), ('C', int), ('D', float)]) assert_equal(test, control) This fails with your current implementation, but works for me when: 1) Set a flag when reading names from header line in file 2) Filter names from file using usecols (if the flag is true) *after* remapping the converters. There may be a better approach, but this is the simplest I've come up with so far. > FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit > comparison: same input, no missing data, one with genloadtxt, one with > np.loadtxt and a last one with matplotlib.mlab.csv2rec. > > As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but > twice faster than csv2rec. One of the explanation for the slowness is > indeed the use of classes for splitting lines and converting values. > Instead of a basic function, we use the __call__ method of the class, > which itself calls another function depending on the attribute values. > I'd like to reduce this overhead, any suggestion is more than welcome, > as usual. > > Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in > numpy.ma, with an alias recfromcsv for John, using his defaults. Unless > somebody comes with a brilliant optimization. Why only in numpy.ma and not somewhere in core numpy itself (missing values aside)? You have a pretty good masked array agnostic wrapper that IMO could go in numpy, though maybe not as loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion