On Sun, May 24, 2009 at 5:28 PM, Pauli Virtanen <[email protected]> wrote: > Sun, 24 May 2009 14:29:42 -0600, Charles R Harris wrote: > > I am trying to put together some rule for parsing text strings/files in > > fromfile, fromstring so that the two are consistent. Tickets relevant to > > this are #1116 <http://projects.scipy.org/numpy/ticket/1116> and > > #883<http://projects.scipy.org/numpy/ticket/883>. The question here is > > the interpretation of the separators, not the parsing of the numbers > > themselves. Below is the current behavior of fromstring, fromfile, and > > python split for content of "", "1", "1 1", " " respectively. > > It should return only the data that's in the file, no extra elements. The > current behavior is a bug, IMHO, especially so since the default value is > uninitialized IIRC. > > So, > > fromstring("", sep=" ") -> [] > fromstring(" ", sep=" ") -> [] > fromstring("1 ", sep=" ") -> [1] > > fromfile should behave identically. > > Another question is perhaps what to do with malformed input: whether > to try best-efforts parsing, or bail out. I'd suggest bailing out > when encountering bad data rather than guessing: > > fromstring("1,2,,3", sep=",") -> [1,2] or ValueError > > Currently, something horrible happens: > > >>> np.fromstring('1,2,,3,,,6', sep=',') > array([ 1., 2., -1., 3., -1., -1., 6.]) > > > Also, on second thoughts, the idea about raising a warning on malformed > input seems more repulsive the more I think about it. Warnings are a bit > nasty to catch, spam stderr if uncaught, and IMHO should not be a part of > "business as usual" code paths. Having malformed input is business as > usual :) > > In some sense, it would be simpler if `fromfile` and `fromstring` would > be defined so that they read *at most* `count` entries, and return what > they got by parsing the leftmost valid part. This could be implemented by > fixing the current bugs and removing the fprintf that currently prints to > stderr there. > > As an addition, a flag could be added that forces them to raise a > ValueError on malformed input (eg. EOF when `count` was given, or bad > separator encountered). Ideally, the exceptions flag would be True by > default both for fromfile and fromstring, but I guess some legacy > applications might rely on the current behavior... > > Also, one could envision a "default" value that would denote a batch of > malformed input... > > *** > > So, I see a couple of alternatives (some already suggested): > > a) fromstring("1,2,x,4", sep=",") -> [1,2] > fromstring("1,2,x,4", sep=",", strict=True) -> ValueError > fromstring("1,2,x,4", sep=",", count=5) -> [1,2] > fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError > > b) fromstring("1,2,x,4", sep=",") -> [1,2] > fromstring("1,2,x,4", sep=",", strict=True) -> ValueError > fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4] > fromstring("1,2,x,4", sep=",", count=5) -> [1,2] > fromstring("1,2,x,4", sep=",", count=5, strict=True) -> ValueError > > c) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning > fromstring("1,2,x,4", sep=",", count=5) -> [1,2] + SomeWarning > > d) fromstring("1,2,x,4", sep=",") -> [1,2] + SomeWarning > fromstring("1,2,x,4", sep=",", default=3) -> [1,2,3,4] > fromstring("1,2,x,4", sep=",", default=3, count=5) -> [1,2,3,4] + > SomeWarning > > e) fromstring("1,2,x,4", sep=",") -> ValueError > fromstring("1,2,x,4", sep=",", strict=False) -> [1,2] > fromstring("1,2,x,4", sep=",", count=5) -> ValueError > fromstring("1,2,x,4", sep=",", count=5, strict=False) -> [1,2] > > Fromfile would always behave the same way as `fromstring(file.read())`.
I think a common behavior is basic to whatever we end up with. > > In the above, " " in sep would equal the regexp \w+, and binary data > implied by sep='' would be interpreted in the same way it would if first > converted to comma-separated text. > > Can you think of any other alternatives? (Let's forget the names of > the new keyword arguments for the present, and assume they have > perfectly fitting names.) > > > I'd vote for (e) if the slate was clean, but since it's not: > > +1 for (a) or (b) > (a) and (e) are the simplest and just differ in the default, so that would be the shortest path. OTOH, (b) is the most general and the default is a nice idea. Hmm... Chuck
_______________________________________________ Numpy-discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
