[Numpy-discussion] Re: Changes to `np.loadtxt(..., max_rows=)`

Sebastian Berg Wed, 29 Jun 2022 11:13:49 -0700

Hi all,

these changes came up in https://github.com/numpy/numpy/issues/21852
where a user had the use-case to look up the line number until where
they want to read a file.

The change is that `max_rows` now:
* Represents the number of *rows* in the result
* Gives a `UserWarning` when empty lines are skipped as a result.

While previously `max_rows` used the number of *lines* (except those
skipped initially).  The difference is for a file formatted like:

   1,2,3
   # comment
   2,3,4

The work-around to get the old version back is:

    import itertools
    lines = itertools.islice(open("file"), 0, max_rows)
    result = np.loadtxt(lines, ...)

(Noted in the release notes and `UserWarning` – although the warning
text could be improved.)

There three possible "actions" I can think of:
1. We can add `max_lines` to do the `itertools` trick for the user.
2. The change is considered too big, we could revert it.
3. We could revert+deprecate the name for a new one, e.g. `nrows`
   and `nlines`.

As an additional point of reference `pandas.read_csv` has `nrows`
matching the new behavior.

I do not have a strong opinion.  I lean towards the new one, Chuck
prefers the old meaning (I think).
One reasoning for me was that users may also read too few data right
now thinking `max_rows` has the new meaning already (i.e. we fix a bug
for them).

Cheers,

Sebastian

On Tue, 2022-02-08 at 08:08 -0600, Sebastian Berg wrote:
> Hi all,
> 
> just a brief heads up that:
> 
>     https://github.com/numpy/numpy/pull/20580
> 
> is now merged.  This moves `np.loadtxt` to C.  Mainly making it much
> faster.  There are also some other improvements and changes though:
> 
> * It now supports `quotechar='"'` to support Excel dialect CSV.
> * Parsing some numbers is stricter (e.g. removed support for `_`
>   or hex float parsing by default).
> * `max_rows` now actually counts rows and not lines.  A warning
>   is given if this makes a difference (blank lines).
> * Some exception will change, parsing failures now (almost) always
>   give an informative `ValueError`.
> * `converters=callable` is now valid to provide a single converter
>   for all columns.
> 
> Please test, and let us know if there is any issue or followup you
> would like to see.  
> 
> We do have possible followups planned
> * Consider deprecating the `encoding="bytes"` default which exists
>   for Python 2 compatibility.
> * Consider renaming `skip_rows` to the more precise `skip_lines`.
> 
> Moving to C unlocks possible further improvement, such as full
> `csv.Dialect` support.  We do not have this on the roadmap, but such
> contributions are possible now.
> Similarly, it should be possible to rewrite `genfromtxt` based on
> this
> work.
> 
> Cheers,
> 
> Sebastian
> _______________________________________________
> NumPy-Discussion mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: [email protected]

signature.asc
Description: This is a digitally signed message part

_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: [email protected]

[Numpy-discussion] Re: Changes to `np.loadtxt(..., max_rows=)`

Reply via email to