[Numpy-discussion] More loadtxt() changes

Ryan May Tue, 25 Nov 2008 06:47:36 -0800

Hi,

I have a couple more changes to loadtxt() that I'd like to code up in time
for 1.3, but I thought I should run them by the list before doing too much
work.  These are already implemented in some fashion in
matplotlib.mlab.csv2rec(), but the code bases are different enough, that
pretty much only the idea can be lifted.  All of these changes would be done
in a manner that is backwards compatible with the current API.


1) Support for setting the names of fields in the returned structured array
without using dtype.  This can be a passed in list of names or reading the
names of fields from the first line of the file.  Many files have a header
line that gives a name for each column.  Adding this would obviously make
loadtxt much more general and allow for more generic code, IMO. My current
thinking is to add a *name* keyword parameter that defaults to None, for no
support for reading names.  Setting it to True would tell loadtxt() to read
the names from the first line (after skiprows).  The other option would be
to set names to a list of strings.

2) Support for automatic dtype inference.  Instead of assuming all values
are floats, this would try a list of options until one worked.  For strings,
this would keep track of the longest string within a given field before
setting the dtype.  This would allow reading of files containing a mixture
of types much more easily, without having to go to the trouble of
constructing a full dtype by hand.  This would work alongside any custom
converters one passes in.  My current thinking of API would just be to add
the option of passing the string 'auto' as the dtype parameter.

3) Better support for missing values.  The docstring mentions a way of
handling missing values by passing in a converter.  The problem with this is
that you have to pass in a converter for *every column* that will contain
missing values.  If you have a text file with 50 columns, writing this
dictionary of converters seems like ugly and needless boilerplate.  I'm
unsure of how best to pass in both what values indicate missing values and
what values to fill in their place.  I'd love suggestions

Here's an example of my use case (without 50 columns):

ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final
1234,Joe,Smith,85,90,,76,
5678,Jane,Doe,65,99,,78,
9123,Joe,Plumber,45,90,,92,

Currently reading in this code requires a bit of boilerplace (declaring
dtypes, converters).  While it's nothing I can't write, it still would be
easier to write it once within loadtxt and have it for everyone.

Any support for *any* of these ideas?  Any suggestions on how the user
should pass in the information?

Thanks,

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

[Numpy-discussion] More loadtxt() changes

Reply via email to