On Sun, 19 May 2019, Sven Schreiber wrote: > Am 19.05.2019 um 19:13 schrieb Riccardo (Jack) Lucchetti: >>> >>> Hmm, interesting idea. I think this could be made to work quite >>> nicely. Internally, nothing prevents us from creating a new, temporary >>> "hidden" dataset (then turning it into a matrix) without disturbing >>> the existing dataset or absence of dataset. >> >> This would be very nice of course, but in that case I would imagine the >> job would be less straightforward than it seems, because of the >> intrinsic differences between the eventual aims. > > Given that we have a function for reading a matrix from a file (mread) I > think the natural aim should be to extend that function eventually to > read from csv. Either with a new option or perhaps simply by recognizing > a ".csv" file extension. > (I'm speaking purely from a user's point of view here.) > But if that isn't feasible in the short term, maybe a transitory > function in "extra" could indeed be the solution.
A few points on this. 1) Jack's csv2mat is an outstanding example of accomplishing a lot with just a few lines of hansl. Of course this is not in the least unusual from Jack, but for the rest of us it's noteworthy all the same! 2) I take Jack's point that the "no error" criterion for reading a dataset from CSV (which we already do) is more restrictive than that for reading a matrix from CSV -- where we don't have to care about valid variable names, nor about handling non-numeric values, which we can just map onto "NA" without further ado. 3) Nonetheless, I find that it's not too difficult to handle the issues under point 2 in the context of our current CSV importation code. In current git, you can try out reading CSV into a matrix via mread() when the filename (or URL) has a ".csv" extension. Two comments on that: (a) "CSV" really just means delimited text, the delimiter doesn't have to be comma; and (b) if we want to pursue this option we could admit some other filename extensions. 4) One point supported by Jack's hansl code that is not supported by our built-in CSV importer is malformed CSV (e.g. some lines have more fields than others). I don't think we'd want to support this in our C code -- and actually I kinda wonder about the wisdom of supporting it at all. I'm attaching a sample script that derives from Jack's original upthread. It requires, and compares results with, Jack's csv2mat.inp. Allin
include csv2mat.inp # standard case open data4-1.gdt --quiet store test.csv X = csv2mat("test.csv") print X X = mread("test.csv") print X # malformed s = sprintf("a;b;c\n1;2;3\n5;6\n7;8;9;1000\n") outfile test.csv --write print s end outfile X = csv2mat("test.csv") print X printf "malformed input, not handed by mread()\n\n" # weird delimiter, and NAs interspersed s = sprintf("1!2!3\n!5!6\n7!NA!9\n") outfile test.csv --write print s end outfile X = csv2mat("test.csv", "!") print X set csv_delim "!" X = mread("test.csv") print X set csv_delim comma # from the web X = csv2mat("https://app.quadstat.net/system/files/datasets/dataset-65863.csv") print X X = mread("https://app.quadstat.net/system/files/datasets/dataset-65863.csv") print X # simple OK case, without column names s = sprintf("1;2;3\n;5;6\n7;NA;9\n") outfile test.csv --write print s end outfile X = csv2mat("test.csv") print X X = mread("test.csv") print X # simple case, but with a missing colname s = sprintf("a;;c\n1;2;3\n;5;6\n7;NA;9\n") outfile test.csv --write print s end outfile X = csv2mat("test.csv") print X X = mread("test.csv") print X