I'll try to get around to comparing against the DataFrames version and profiling this week. I got stuck trying to figure out the action semantics.
On Tuesday, May 27, 2014 6:58:42 PM UTC-4, John Myles White wrote: > > I'd be really interested to see how this parser compares with DataFrames. > There's a bunch of test files in the DataFrames.jl/test directory. > > -- John > > On May 27, 2014, at 3:49 PM, Abe Schneider <abe.sc...@gmail.com > <javascript:>> wrote: > > I don't know how the speed of the parser will be compared to DataFrames -- > I've done absolutely no work to date on profiling the code, but I thought > writing a CSV parser was a good way to test out code (and helped find a > bunch of bugs). > > I've also committed (under examples/) the CSV parser. The grammar (from > the RFC) is: > > @grammar csv begin > start = data > data = record + *(crlf + record) > record = field + *(comma + field) > field = escaped_field | unescaped_field > escaped_field = dquote + *(textdata | comma | cr | lf | dqoute2) + > dquote > unescaped_field = textdata > textdata = r"[ !#$%&'()*+\-./0-~]+" > cr = '\r' > lf = '\n' > crlf = cr + lf > dquote = '"' > dqoute2 = "\"\"" > comma = ',' > end > > and the actions are: > > tr["crlf"] = (node, children) -> nothing > tr["comma"] = (node, children) -> nothing > > tr["escaped_field"] = (node, children) -> node.children[2].value > tr["unescaped_field"] = (node, children) -> node.children[1].value > tr["field"] = (node, children) -> children > tr["record"] = (node, children) -> unroll(children) > tr["data"] = (node, children) -> unroll(children) > tr["textdata"] = (node, children) -> node.value > > > give the data: > > parse_data = """1,2,3\r\nthis is,a test,of csv\r\n"these","are","quotes ( > "")"""" > > and running the parser: > > (node, pos, error) = parse(csv, parse_data) > result = transform(tr, node) > > I get: > > {{"1","2","3"},{"this is","a test","of csv"},{"these","are","quotes > (\"\")"}} > > > > > > On Monday, May 26, 2014 3:41:26 AM UTC-4, harven wrote: >> >> Nice! >> >> If you are interested by testing your library on a concrete problem, you >> may want to parse comma separated value (csv) files. The bnf is in the >> specification RFC4180. http://tools.ietf.org/html/rfc4180 >> >> AFAIK, the readcsv function provided in Base does not handle quotations >> well whereas the csv parser in DataFrames is slow, so that julia does not >> have yet a native efficient way to parse csv files. >> > >