Re: [RFC] CSV parser

Robert Jacques Tue, 05 Apr 2011 22:56:15 -0700

On Tue, 05 Apr 2011 12:45:59 -0400, Jesse Phillips<[email protected]> wrote:

Robert Jacques Wrote:
* You should input ranges. It's fine to detect slicing and optimize for
it, but you should support simple input ranges as well.
This implementation only operates with input ranges, of text. I haveanother implementation which works on a slice-able forward range ofanything. I stopped development because it didn't support input ranges,and Phobos must have this.
https://github.com/he-the-great/JPDLibs/tree/csvoptimize

The library should work with both and be efficient with both. i.e. detectat compile-time whether you have a string or an input range, and actaccordingly.

* I'd think being able to retrieve the headings from the csv would be a
good [optional] feature.
Do you think it would be good for the interface to use an associativearray. Maybe the heading could be the first thing to get from aRecordList.
auto records  = csvText(str);
auto heading = records.heading;

or shouldn't it be passed in as an empty array:

string[] heading;
auto records = csvText(str, heading);

Hmm... The latter makes it explicit that there should be headers, whilewith the former, you'd need to call heading before parsing the records,which could be a source of bugs. Then again, records.heading is morenatural. Alternative syntax ideas:


auto records = csvText(str, null);
auto heading = records.heading;

or

auto records = csvText!(csvOptions.headers)(str);
auto heading = records.heading;

* Exposing the tokenizer would be useful.
Good. I'm thinking I will use an output range of type char[] instead ofAppender.

Output range? You mean as a buffer? I'd think Appender is the betterchoice for that internally, but you could allow the user to pass in achar[] buffer to use as the basis of the appender.

* Regarding buffering, it's okay for the tokenizer to expose bufferingin
it's API (and users should be able to supply their own buffers), but I
don't think an unbuffered version of csvText itself is correct;
As I'm thinking of using an output range I believe it has the job ofallowing a user specified buffer or not. I'm not sure if the interfacefor csvText or RecordList should allow for custom buffers.

An output range is not a buffer. A buffer requires a clear method and adata retrieval method. All output ranges provide is put.

csvByRecord or csvText!(T).byRecord would be more appropriate.
You always iterate CSV by record, what other option is there? If youdesire an array for the record:
auto records = csvText(str);

foreach(record: records)
    auto arr = array(record); // ...

Well, some of your example code includedcsvText!MyStruct(str,["my","headers"]); or possibly justcsvText!MyStruct(str). Which would have either provide a forward range ofMyStructs or an array of MyStruct. In either case, duplication of stringswhere appropriate should occur. Also, remember that records might not bemanually processed; the user might forward it to map,filter,etc andtherefore should be safe by default. A user should have to manually statethat they are going to want a lazy buffered version.

And
anyways, since you're only using strings, why is there any bufferinggoingon at all? string values should simply be sliced, not buffered.Buffering
should only come into play with input ranges.
Input ranges do not provide slicing. On top of that you _must_ modifythe returned data at times. So you can do buffering and slicing togetherwhich my other implementation will do, but as I said it won't supportinput ranges.

Why do you need to do any modification? Part of the advantage of csv isthat it's just text: you don't have to deal with escape characters, etc.

* There should be a way to specify other separators; I've started using
tab separated files as ','s show up in a lot of data.
You can use a custom "comma" and "quote" the record break is notmodifiable because it doesn't fit into a single character. I can makethis available through csvText and am not exactly sure why I didn't.
* Any thought of parsing a file into a tuple of arrays? Writing csv?
A static CSV parser? I would hope CTFE would allow for using thisparser, at some point.

By tuple I meant the tuple(5,7) kind, not the TypeTuple!(int,char) kind.Basically,


auto records = csvText!(string,real)(str);
string[] names   = records._0;
real[]   grades  = records._1;

vs

auto records = csvText!(Tuple!(string,"name",real,"grade"))(str);
foreach(record; records) {
        writeln(record.name," got ",record.grade);
}

For writing I made a quick and dirty one to write structures to a file.But I have not considered adding such functionality to the library.Thank you for the comment.


void writeStruct(string file, Layout data) {
    string[] elements;

    foreach(i, U; FieldTypeTuple!Layout) {
        elements ~= to!string(data.tupleof[i]);
    }

    string record;
    foreach(e; elements) {
        if(find(e, `"`, ",", "\n", "\r")[1] == 0)
            record ~= e ~ ",";
        else
            record ~= `"` ~ replace(e, `"`, `""`) ~ `",`;
    }

    std.file.append(file, record[0..$-1] ~ "\n");
}

https://gist.github.com/805392

Re: [RFC] CSV parser

Reply via email to