I don't know how the speed of the parser will be compared to DataFrames --
I've done absolutely no work to date on profiling the code, but I thought
writing a CSV parser was a good way to test out code (and helped find a
bunch of bugs).
I've also committed (under examples/) the CSV parser. The grammar (from the
RFC) is:
@grammar csv begin
start = data
data = record + *(crlf + record)
record = field + *(comma + field)
field = escaped_field | unescaped_field
escaped_field = dquote + *(textdata | comma | cr | lf | dqoute2) + dquote
unescaped_field = textdata
textdata = r"[ !#$%&'()*+\-./0-~]+"
cr = '\r'
lf = '\n'
crlf = cr + lf
dquote = '"'
dqoute2 = "\"\""
comma = ','
end
and the actions are:
tr["crlf"] = (node, children) -> nothing
tr["comma"] = (node, children) -> nothing
tr["escaped_field"] = (node, children) -> node.children[2].value
tr["unescaped_field"] = (node, children) -> node.children[1].value
tr["field"] = (node, children) -> children
tr["record"] = (node, children) -> unroll(children)
tr["data"] = (node, children) -> unroll(children)
tr["textdata"] = (node, children) -> node.value
give the data:
parse_data = """1,2,3\r\nthis is,a test,of csv\r\n"these","are","quotes ("")
""""
and running the parser:
(node, pos, error) = parse(csv, parse_data)
result = transform(tr, node)
I get:
{{"1","2","3"},{"this is","a test","of csv"},{"these","are","quotes (\"\")"
}}
On Monday, May 26, 2014 3:41:26 AM UTC-4, harven wrote:
>
> Nice!
>
> If you are interested by testing your library on a concrete problem, you
> may want to parse comma separated value (csv) files. The bnf is in the
> specification RFC4180. http://tools.ietf.org/html/rfc4180
>
> AFAIK, the readcsv function provided in Base does not handle quotations
> well whereas the csv parser in DataFrames is slow, so that julia does not
> have yet a native efficient way to parse csv files.
>