Re: [julia-users] PEG Parser

Abe Schneider Sun, 13 Jul 2014 07:28:19 -0700

I agree, I think PEG grammars are likely not well-suited to parsing large 
amounts of data (especially where a hand-crafted parser can be created). 
That said, I am curious as to why the speed difference exists. In the case 
of CSV, I think the packrat caching mechanism probably doesn't help, but 
disabling that is still slow.


The next logical point of slowness is the number of function-calls 
required. I know that the multi-dispatch has some added penalty over 
single-dispatch, but I'm not sure by how much. It's possible that I could 
rewrite some of the CSV grammar to make less calls by either rewriting some 
of the rules or by using more regular expressions.

Finally, the last likely place for the slow-down is the fact that PEGParser 
has to create the AST as it goes along. This is a huge overhead that the 
DataFrames CSV parser does not need to do since it's a single-purpose 
parser.

I'm going to continue work on the Dot parser as I think it's probably more 
useful (and more difficult to write by hand).

Also, I've created a pull request to add PEGParser to the official package 
list. If it's something that interests other people, it would be great to 
have other people involved.

On Sunday, July 6, 2014 9:08:40 PM UTC-4, John Myles White wrote:ly,, 

> Thanks for looking into this, Abe. That’s too bad that the CSV parser is 
> much slower than the hand-crafted one. PEG seems like a great tool for 
> tasks where maximum performance isn’t as important.
>
>  — John
>
> On Jul 4, 2014, at 11:09 AM, Abe Schneider <abe.sc...@gmail.com 
> <javascript:>> wrote:
>
> I got sidetracked by a couple of other things, but the parser is now 
> updated with a bunch of bug fixes. I have a preliminary CSV and graphdot 
> parser (very reduced from the full grammar). I'm starting to put together 
> some more comprehensive tests together.
>
> As for speed comparison to DataFrames for parsing CSV it's much slower. 
> I've spent time trying to optimize things, but I suspect a large part of 
> the speed issue is the overhead of function calls. Also, I suspect it will 
> be hard to come close to the speed of DataFrames as the code looks like 
> it's fairly optimized for reading just CSV files.
>
> After a few more tests are written, I'm getting ready to officially call a 
> version 0.1.
>
> On Thursday, June 5, 2014 6:56:37 AM UTC-4, Abe Schneider wrote:
>>
>> I also forgot to push the changes last night.
>>
>> On Wednesday, June 4, 2014 11:01:33 PM UTC-4, Abe Schneider wrote:
>>>
>>> After playing around with a bunch of alternatives, I think I've come up 
>>> with decent action semantics:
>>>
>>> @transform <name> begin
>>>  <label> = <action>
>>> end
>>>
>>> For example, a simple graph grammar might look like:
>>>
>>> @grammar nodetest begin
>>>   start = +node_def
>>>   node_def = node_label + node_name + lbrace + data + rbrace
>>>   node_name = string_value + space
>>>
>>>   data = *(line + semicolon)
>>>   line = string_value + space
>>>   string_value = r"[_a-zA-Z][_a-zA-Z0-9]*"
>>>
>>>   lbrace = "{" + space
>>>   rbrace = "}" + space
>>>   semicolon = ";" + space
>>>   node_label = "node" + space
>>>   space = r"[ \t\n]*"
>>> end
>>>
>>> with it's actions to create some data structure:
>>>
>>> type MyNode
>>>   name
>>>   values
>>>
>>>   function MyNode(name, values)
>>>     new(name, values)
>>>   end
>>> end
>>>
>>>
>>> with:
>>> @transform tograph begin
>>>   # ignore these
>>>   lbrace = nothing
>>>   rbrase = nothing
>>>   semicolon = nothing
>>>   node_label = nothing
>>>   space = nothing
>>>
>>>   # special action so we don't have to define every label
>>>   default = children
>>>
>>>   string_value = node.value
>>>   value = node.value
>>>   line = children
>>>   data = MyNode("", children)
>>>   node_def = begin
>>>     local name = children[1]
>>>     local cnode = children[2]
>>>     cnode.name = name
>>>     return cnode
>>>   end
>>> end
>>>
>>> and finally, to apply the transform:
>>>
>>> (ast, pos, error) = parse(nodetest, data)
>>> result = apply(tograph, ast)
>>> println(result)    # {MyNode("foo",{"a","b"}),MyNode("bar",{"c","d"})}
>>>
>>> The magic in '@transform' basically just creates the dictionary like 
>>> before, but automatically wraps the expression on the RHS  as an anonymous 
>>> function  (node, children) -> expr.
>>>
>>> I'm currently looking for a better name than 'children', as it's 
>>> potentially confusing and misleading. It's actually the values of the child 
>>> nodes (as opposed to node.children). Maybe cvalues?
>>>
>>> On Sunday, May 25, 2014 10:28:45 PM UTC-4, Abe Schneider wrote:
>>>>
>>>> I wrote a quick PEG Parser for Julia with Packrat capabilities:
>>>>
>>>> https://github.com/abeschneider/PEGParser
>>>>
>>>> It's a first draft and needs a ton of work, testing, etc., but if this 
>>>> is of interest to anyone else, here is a quick description.
>>>>
>>>> Grammars can be defined using most of the standard EBNF syntax. For 
>>>> example, a simple math grammar can be defined as:
>>>>
>>>> @grammar mathgrammar begin
>>>>
>>>>   start = expr
>>>>   number = r"([0-9]+)"
>>>>   expr = (term + op1 + expr) | term
>>>>   term = (factor + op2 + term) | factor
>>>>   factor = number | pfactor
>>>>   pfactor = ('(' + expr + ')')
>>>>   op1 = '+' | '-'
>>>>   op2 = '*' | '/'
>>>> end
>>>>
>>>>
>>>>
>>>> To parse a string with the grammar:
>>>>
>>>> (node, pos, error) = parse(mathgrammar, "5*(2-6)")
>>>>
>>>> This will create an AST which can then be transformed to a value. 
>>>> Currently this is accomplished by doing:
>>>>
>>>> math = Dict()
>>>>
>>>> math["number"] = (node, children) -> float(node.value)
>>>> math["expr"] = (node, children) ->
>>>>     length(children) == 1 ? children : eval(Expr(:call, children[2], 
>>>> children[1], children[3]))
>>>> math["factor"] = (node, children) -> children
>>>> math["pfactor"] = (node, children) -> children[2]
>>>> math["term"] = (node, children) ->
>>>>     length(children) == 1 ? children : eval(Expr(:call, children[2], 
>>>> children[1], children[3]))
>>>> math["op1"] = (node, children) -> symbol(node.value)
>>>> math["op2"] = (node, children) -> symbol(node.value)
>>>>
>>>>
>>>> Ideally, I would like to simplify this to using multi-dispatch on 
>>>> symbols (see previous post), but for now this is the easiest way to define 
>>>> actions based on node attributes.
>>>>
>>>> Finally, to transform the tree:
>>>>
>>>> result = transform(math, node)  # will give the value of 20
>>>>
>>>> Originally I was going to attach the transforms to the rules themselves 
>>>> (similiar to boost::spirit). However, there were two reasons for not doing 
>>>> this:
>>>>
>>>>    1. To implement the packrat part of the parser, I needed to cache 
>>>>    the results which meant building an AST anyways
>>>>    2. It's nice to be apply to get different transforms for the same 
>>>>    grammar (e.g. you may want to transform the result into HTML, LaTeX, 
>>>> etc.)
>>>>
>>>> The downside of the separation is that it adds some more complexity to 
>>>> the process.
>>>>
>>>
>

Re: [julia-users] PEG Parser

Reply via email to