Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Roman D. Boiko

On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote:
It's great that you're working on this. We definitely need more 
of this sort of
stuff. However, I would point out that I'm not sure that it 
will be acceptable
for Phobos (it _may_ be, but it's not quite what we've been 
looking for).
There are two types of lexers/parsers that we've been looking 
to add to

Phobos.

1. A direct port of the lexer from dmd's C++ front-end. The API 
would be D-
ified to use ranges, but the internals would be almost 
identical to the C++
front-end so that porting changes back and forth would be very 
easy. It also
would make it easier to replace the C++ front-end with a D one 
later if the D

one is very close to the C++ one.

2. A lexer/parser generator using templates and/or CTFE to 
generate a
lexer/parser of whatever language using a BNF or EBNF grammar - 
probably
similar to what Philippe Sigaud has done to generate a parser 
using PEGs. Such
could be used to generate a lexer/parser for D or any other 
language.


What you seem to be doing is creating a D-specific parser from 
scratch, which
is great, but it's not quite what we've been looking to add to 
Phobos. That
doesn't necessarily mean that we can't or won't add it to 
Phobos once it's
complete (especially if nothing else gets done first), but it 
also may not be
acceptable, because it's not either #1 or #2. That would have 
to be decided
once it's done and ready to be reviewed for inclusion in 
Phobos. I'm not
trying to discourage you at all. I'm just pointing out that 
what you're doing
is quite what we're looking for. Regardless, it _would_ be 
interesting to be
able to compare implementations of #1 and #2 against what 
you're doing and

seeing which performs better.

- Jonathan M Davis
Yes, my project has different goals from #1 or #2. The primary 
need I'm trying to address is making development of tools for D 
development much easier. Tools are needed for D to become more 
widely used in practise. Secondary (a nice-to-have) would be to 
build a compiler on top of that. I plan to do both, but I don't 
intend to be backwards-compatible with DMD backend. I hope that 
at least part of my code will either influence or be reused in 
the reference D compiler (whatever that will be in the future).


It is important to have a reference frontend implementation which 
can be used in many tools, otherwise a lot of effort would be 
completely wasted and no tool would comply with the D 
specification and compilers.


Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Ary Manzana

On 5/12/12 12:17 PM, Roman D. Boiko wrote:

On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote:

As deadalnix says, I think you are over-complicating things.

I mean, to store the column and line information it's just:

if (isNewLine(c)) {
line++;
column = 0;
} else {
column++;
}

(I think you need to add that to the SourceRange class. Then copy line
and column to token on the Lexer#lex() method)

Do you really think it's that costly in terms of performance?

I think you are wasting much more memory and performance by storing
all the tokens in the lexer.

Imagine I want to implement a simple syntax highlighter: just
highlight keywords. How can I tell DCT to *not* store all tokens
because I need each one in turn? And since I'll be highlighting in the
editor I will need column and line information. That means I'll have
to do that O(log(n)) operation for every token.

So you see, for the simplest use case of a lexer the performance of
DCT is awful.

Now imagine I want to build an AST. Again, I consume the tokens one by
one, probably peeking in some cases. If I want to store line and
column information I just copy them to the AST. You say the tokens are
discarded but their data is not, and that's why their data is usually
copied.


Would it be possible for you to fork my code and tweak it for
comparison? You will definitely discover more problems this way, and
such feedback would really help me. That doesn't seem to be inadequately
much work to do.

Any other volunteers for this?


Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.


Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread deadalnix

Le 11/05/2012 13:50, Ary Manzana a écrit :

On 5/11/12 4:22 PM, Roman D. Boiko wrote:

What about line and column information?

Indices of the first code unit of each line are stored inside lexer and
a function will compute Location (line number, column number, file
specification) for any index. This way size of Token instance is reduced
to the minimum. It is assumed that Location can be computed on demand,
and is not needed frequently. So column is calculated by reverse walk
till previous end of line, etc. Locations will possible to calculate
both taking into account special token sequences (e.g., #line 3
ab/c.d), or discarding them.


But then how do you do to efficiently (if reverse walk is any efficient)
compute line numbers?

Usually tokens are used and discarded. I mean, somebody that uses the
lexer asks tokens, process them (for example to highlight code or to
build an AST) and then discards them. So you can reuse the same Token
instance. If you want to peek the next token, or have a buffer of token,
you can use a freelist ( http://dlang.org/memory.html#freelists , one of
the many nice things I learned by looking at DMD's source code ).

So adding line and column information is not like wasting a lot of
memory: just 8 bytes more for each token in the freelist.


SDC uses struct Location to store such data.

Every token and AST element have a member data of type location.

As it is value type, no need for free list all thoses sort of thing. 
When the token is discarded, the location is discarded too. The same 
goes for AST nodes.


Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Roman D. Boiko

On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote:

Le 11/05/2012 13:50, Ary Manzana a écrit :
Usually tokens are used and discarded. I mean, somebody that 
uses the
lexer asks tokens, process them (for example to highlight code 
or to
build an AST) and then discards them. So you can reuse the 
same Token
instance. If you want to peek the next token, or have a buffer 
of token,
you can use a freelist ( 
http://dlang.org/memory.html#freelists , one of
the many nice things I learned by looking at DMD's source code 
).


So adding line and column information is not like wasting a 
lot of

memory: just 8 bytes more for each token in the freelist.


SDC uses struct Location to store such data.

Every token and AST element have a member data of type location.

As it is value type, no need for free list all thoses sort of 
thing. When the token is discarded, the location is discarded 
too. The same goes for AST nodes.

1. If a struct is a field of heap allocated object, it will be
allocated and garbage collected. Only if it only exists on stack
(i.e., in method body), GC is not used.

2. If struct which is often allocated and deallocated, free list
would allow to skip its initialization, which is especially
important if size of struct is large. This is true for both heap
and stack allocated data.


Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Roman D. Boiko

On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:

1. If a struct is a field of heap allocated object, it will be
allocated and garbage collected. Only if it only exists on 
stack

(i.e., in method body), GC is not used.


As far as I can tell, it won't be allocated on it's own, since 
it is stored in the garbage collected object as a value field. 
So using a freelist would actually increase the overhead. You 
have to manage the freelist and do the allocation of the 
containing object.
Sorry for confusion. In the first point I was trying to clarify 
that using a struct doesn't mean that it is stack allocated, but 
that statement was off-topic for the given context. Currently I 
consider my first point to be useless, and the second one is not 
relevant in *my* context, since I don't do many allocations of 
Location, as well as in a majority of other contexts. As it is 
often the case, performance optimization should be performed only 
when benchmarks show it is needed.




Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Timon Gehr

On 05/12/2012 11:08 PM, Roman D. Boiko wrote:

On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote:

1. If a struct is a field of heap allocated object, it will be
allocated and garbage collected. Only if it only exists on stack
(i.e., in method body), GC is not used.


As far as I can tell, it won't be allocated on it's own, since it is
stored in the garbage collected object as a value field. So using a
freelist would actually increase the overhead. You have to manage the
freelist and do the allocation of the containing object.

Sorry for confusion. In the first point I was trying to clarify that
using a struct doesn't mean that it is stack allocated, but that
statement was off-topic for the given context. Currently I consider my
first point to be useless, and the second one is not relevant in *my*
context, since I don't do many allocations of Location, as well as in a
majority of other contexts. As it is often the case, performance
optimization should be performed only when benchmarks show it is needed.



Well yes, but it seems to me that you are deliberately complicating the 
design in order to make it less efficient?


Re: DCT: D compiler as a collection of libraries

2012-05-12 Thread Roman D. Boiko

On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote:

On 05/12/2012 11:08 PM, Roman D. Boiko wrote:
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath 
wrote:
1. If a struct is a field of heap allocated object, it will 
be
allocated and garbage collected. Only if it only exists on 
stack

(i.e., in method body), GC is not used.


As far as I can tell, it won't be allocated on it's own, 
since it is
stored in the garbage collected object as a value field. So 
using a
freelist would actually increase the overhead. You have to 
manage the

freelist and do the allocation of the containing object.
Sorry for confusion. In the first point I was trying to 
clarify that
using a struct doesn't mean that it is stack allocated, but 
that
statement was off-topic for the given context. Currently I 
consider my
first point to be useless, and the second one is not relevant 
in *my*
context, since I don't do many allocations of Location, as 
well as in a
majority of other contexts. As it is often the case, 
performance
optimization should be performed only when benchmarks show it 
is needed.




Well yes, but it seems to me that you are deliberately 
complicating the design in order to make it less efficient?


Do you mean something like what I summarized from Ary:
http://forum.dlang.org/post/pggkobonzmgdwjigy...@forum.dlang.org ?
If that is the case, could you please add your concerns to the
dedicated issue:
https://github.com/roman-d-boiko/dct/issues/15 ?

Otherwise, if by overcomplicating you meant using a free list,
I'd like to clarify that I don't plan to introduce it yet because 
I see no reason to use it in my context.


Thanks