Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 05:41:16 UTC, Jonathan M Davis wrote: It's great that you're working on this. We definitely need more of this sort of stuff. However, I would point out that I'm not sure that it will be acceptable for Phobos (it _may_ be, but it's not quite what we've been looking for). There are two types of lexers/parsers that we've been looking to add to Phobos. 1. A direct port of the lexer from dmd's C++ front-end. The API would be D- ified to use ranges, but the internals would be almost identical to the C++ front-end so that porting changes back and forth would be very easy. It also would make it easier to replace the C++ front-end with a D one later if the D one is very close to the C++ one. 2. A lexer/parser generator using templates and/or CTFE to generate a lexer/parser of whatever language using a BNF or EBNF grammar - probably similar to what Philippe Sigaud has done to generate a parser using PEGs. Such could be used to generate a lexer/parser for D or any other language. What you seem to be doing is creating a D-specific parser from scratch, which is great, but it's not quite what we've been looking to add to Phobos. That doesn't necessarily mean that we can't or won't add it to Phobos once it's complete (especially if nothing else gets done first), but it also may not be acceptable, because it's not either #1 or #2. That would have to be decided once it's done and ready to be reviewed for inclusion in Phobos. I'm not trying to discourage you at all. I'm just pointing out that what you're doing is quite what we're looking for. Regardless, it _would_ be interesting to be able to compare implementations of #1 and #2 against what you're doing and seeing which performs better. - Jonathan M Davis Yes, my project has different goals from #1 or #2. The primary need I'm trying to address is making development of tools for D development much easier. Tools are needed for D to become more widely used in practise. Secondary (a nice-to-have) would be to build a compiler on top of that. I plan to do both, but I don't intend to be backwards-compatible with DMD backend. I hope that at least part of my code will either influence or be reused in the reference D compiler (whatever that will be in the future). It is important to have a reference frontend implementation which can be used in many tools, otherwise a lot of effort would be completely wasted and no tool would comply with the D specification and compilers.
Re: DCT: D compiler as a collection of libraries
On 5/12/12 12:17 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 03:32:20 UTC, Ary Manzana wrote: As deadalnix says, I think you are over-complicating things. I mean, to store the column and line information it's just: if (isNewLine(c)) { line++; column = 0; } else { column++; } (I think you need to add that to the SourceRange class. Then copy line and column to token on the Lexer#lex() method) Do you really think it's that costly in terms of performance? I think you are wasting much more memory and performance by storing all the tokens in the lexer. Imagine I want to implement a simple syntax highlighter: just highlight keywords. How can I tell DCT to *not* store all tokens because I need each one in turn? And since I'll be highlighting in the editor I will need column and line information. That means I'll have to do that O(log(n)) operation for every token. So you see, for the simplest use case of a lexer the performance of DCT is awful. Now imagine I want to build an AST. Again, I consume the tokens one by one, probably peeking in some cases. If I want to store line and column information I just copy them to the AST. You say the tokens are discarded but their data is not, and that's why their data is usually copied. Would it be possible for you to fork my code and tweak it for comparison? You will definitely discover more problems this way, and such feedback would really help me. That doesn't seem to be inadequately much work to do. Any other volunteers for this? Sure, I'll do it and provide some benchmarks. Thanks for creating the issue.
Re: DCT: D compiler as a collection of libraries
Le 11/05/2012 13:50, Ary Manzana a écrit : On 5/11/12 4:22 PM, Roman D. Boiko wrote: What about line and column information? Indices of the first code unit of each line are stored inside lexer and a function will compute Location (line number, column number, file specification) for any index. This way size of Token instance is reduced to the minimum. It is assumed that Location can be computed on demand, and is not needed frequently. So column is calculated by reverse walk till previous end of line, etc. Locations will possible to calculate both taking into account special token sequences (e.g., #line 3 ab/c.d), or discarding them. But then how do you do to efficiently (if reverse walk is any efficient) compute line numbers? Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 13:24:11 UTC, deadalnix wrote: Le 11/05/2012 13:50, Ary Manzana a écrit : Usually tokens are used and discarded. I mean, somebody that uses the lexer asks tokens, process them (for example to highlight code or to build an AST) and then discards them. So you can reuse the same Token instance. If you want to peek the next token, or have a buffer of token, you can use a freelist ( http://dlang.org/memory.html#freelists , one of the many nice things I learned by looking at DMD's source code ). So adding line and column information is not like wasting a lot of memory: just 8 bytes more for each token in the freelist. SDC uses struct Location to store such data. Every token and AST element have a member data of type location. As it is value type, no need for free list all thoses sort of thing. When the token is discarded, the location is discarded too. The same goes for AST nodes. 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. 2. If struct which is often allocated and deallocated, free list would allow to skip its initialization, which is especially important if size of struct is large. This is true for both heap and stack allocated data.
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed.
Re: DCT: D compiler as a collection of libraries
On 05/12/2012 11:08 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed. Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient?
Re: DCT: D compiler as a collection of libraries
On Saturday, 12 May 2012 at 22:19:55 UTC, Timon Gehr wrote: On 05/12/2012 11:08 PM, Roman D. Boiko wrote: On Saturday, 12 May 2012 at 20:28:34 UTC, Tobias Pankrath wrote: 1. If a struct is a field of heap allocated object, it will be allocated and garbage collected. Only if it only exists on stack (i.e., in method body), GC is not used. As far as I can tell, it won't be allocated on it's own, since it is stored in the garbage collected object as a value field. So using a freelist would actually increase the overhead. You have to manage the freelist and do the allocation of the containing object. Sorry for confusion. In the first point I was trying to clarify that using a struct doesn't mean that it is stack allocated, but that statement was off-topic for the given context. Currently I consider my first point to be useless, and the second one is not relevant in *my* context, since I don't do many allocations of Location, as well as in a majority of other contexts. As it is often the case, performance optimization should be performed only when benchmarks show it is needed. Well yes, but it seems to me that you are deliberately complicating the design in order to make it less efficient? Do you mean something like what I summarized from Ary: http://forum.dlang.org/post/pggkobonzmgdwjigy...@forum.dlang.org ? If that is the case, could you please add your concerns to the dedicated issue: https://github.com/roman-d-boiko/dct/issues/15 ? Otherwise, if by overcomplicating you meant using a free list, I'd like to clarify that I don't plan to introduce it yet because I see no reason to use it in my context. Thanks