I don't know how the speed of the parser will be compared to DataFrames -- 
I've done absolutely no work to date on profiling the code, but I thought 
writing a CSV parser was a good way to test out code (and helped find a 
bunch of bugs).

I've also committed (under examples/) the CSV parser. The grammar (from the 
RFC) is:

@grammar csv begin
  start = data
  data = record + *(crlf + record)
  record = field + *(comma + field)
  field = escaped_field | unescaped_field
  escaped_field = dquote + *(textdata | comma | cr | lf | dqoute2) + dquote
  unescaped_field = textdata
  textdata = r"[ !#$%&'()*+\-./0-~]+"
  cr = '\r'
  lf = '\n'
  crlf = cr + lf
  dquote = '"'
  dqoute2 = "\"\""
  comma = ','
end

and the actions are:

tr["crlf"] = (node, children) -> nothing
tr["comma"] = (node, children) -> nothing

tr["escaped_field"] = (node, children) -> node.children[2].value
tr["unescaped_field"] = (node, children) -> node.children[1].value
tr["field"] = (node, children) -> children
tr["record"] = (node, children) -> unroll(children)
tr["data"] = (node, children) -> unroll(children)
tr["textdata"] = (node, children) -> node.value


give the data:

parse_data = """1,2,3\r\nthis is,a test,of csv\r\n"these","are","quotes ("")
""""

and running the parser:

(node, pos, error) = parse(csv, parse_data)
result = transform(tr, node)

I get:

{{"1","2","3"},{"this is","a test","of csv"},{"these","are","quotes (\"\")"
}}





On Monday, May 26, 2014 3:41:26 AM UTC-4, harven wrote:
>
> Nice!
>
> If you are interested by testing your library on a concrete problem, you 
> may want to parse comma separated value (csv) files. The bnf is in the 
> specification RFC4180. http://tools.ietf.org/html/rfc4180
>
> AFAIK, the readcsv function provided in Base does not handle quotations 
> well whereas the csv parser in DataFrames is slow, so that julia does not 
> have yet a native efficient way to parse csv files.
>

Reply via email to