On Sun, 19 May 2019, Sven Schreiber wrote:

> Am 19.05.2019 um 19:13 schrieb Riccardo (Jack) Lucchetti:
>>> 
>>> Hmm, interesting idea. I think this could be made to work quite
>>> nicely. Internally, nothing prevents us from creating a new, temporary
>>> "hidden" dataset (then turning it into a matrix) without disturbing
>>> the existing dataset or absence of dataset.
>> 
>> This would be very nice of course, but in that case I would imagine the
>> job would be less straightforward than it seems, because of the
>> intrinsic differences between the eventual aims.
>
> Given that we have a function for reading a matrix from a file (mread) I
> think the natural aim should be to extend that function eventually to
> read from csv. Either with a new option or perhaps simply by recognizing
> a ".csv" file extension.
> (I'm speaking purely from a user's point of view here.)
> But if that isn't feasible in the short term, maybe a transitory
> function in "extra" could indeed be the solution.

A few points on this.

1) Jack's csv2mat is an outstanding example of accomplishing a lot 
with just a few lines of hansl. Of course this is not in the least 
unusual from Jack, but for the rest of us it's noteworthy all the 
same!

2) I take Jack's point that the "no error" criterion for reading a 
dataset from CSV (which we already do) is more restrictive than that 
for reading a matrix from CSV -- where we don't have to care about 
valid variable names, nor about handling non-numeric values, which 
we can just map onto "NA" without further ado.

3) Nonetheless, I find that it's not too difficult to handle the 
issues under point 2 in the context of our current CSV importation 
code. In current git, you can try out reading CSV into a matrix via 
mread() when the filename (or URL) has a ".csv" extension. Two 
comments on that: (a) "CSV" really just means delimited text, the 
delimiter doesn't have to be comma; and (b) if we want to pursue 
this option we could admit some other filename extensions.

4) One point supported by Jack's hansl code that is not supported by 
our built-in CSV importer is malformed CSV (e.g. some lines have 
more fields than others). I don't think we'd want to support this in 
our C code -- and actually I kinda wonder about the wisdom of 
supporting it at all.

I'm attaching a sample script that derives from Jack's original 
upthread. It requires, and compares results with, Jack's 
csv2mat.inp.

Allin
include csv2mat.inp

# standard case

open data4-1.gdt --quiet
store test.csv
X = csv2mat("test.csv")
print X
X = mread("test.csv")
print X

# malformed

s = sprintf("a;b;c\n1;2;3\n5;6\n7;8;9;1000\n")
outfile test.csv --write
    print s
end outfile
X = csv2mat("test.csv")
print X
printf "malformed input, not handed by mread()\n\n"

# weird delimiter, and NAs interspersed

s = sprintf("1!2!3\n!5!6\n7!NA!9\n")
outfile test.csv --write
    print s
end outfile
X = csv2mat("test.csv", "!")
print X
set csv_delim "!"
X = mread("test.csv")
print X
set csv_delim comma

# from the web

X = csv2mat("https://app.quadstat.net/system/files/datasets/dataset-65863.csv";)
print X
X = mread("https://app.quadstat.net/system/files/datasets/dataset-65863.csv";)
print X

# simple OK case, without column names

s = sprintf("1;2;3\n;5;6\n7;NA;9\n")
outfile test.csv --write
    print s
end outfile
X = csv2mat("test.csv")
print X
X = mread("test.csv")
print X

# simple case, but with a missing colname

s = sprintf("a;;c\n1;2;3\n;5;6\n7;NA;9\n")
outfile test.csv --write
    print s
end outfile
X = csv2mat("test.csv")
print X
X = mread("test.csv")
print X



Reply via email to