On 9 May 2009, at 00:35, Michael Widenius wrote:
Hi!
I fixed the maria-developers address so that everyone else can follow
and participate in the discussion.
jim> jim> Monty,
jim> jim>
jim> jim> My T-SQL parser is now complete and a number of people are
using
jim> jim> the Java, C# and C version. Would you like to have a look
at it?
jim>
jim> Monty> Yes, I would be intrested to know more about this!
jim>
jim> OK - I have my demo online now. If you visit http://www.temporal-wave.com
jim>
jim> and take the Online Demos->T-SQL menu option you can enter T-SQL
jim> batches and see the parse results. Over the next day or so, I am
jim> adding demos of my C# and VB.Net parsers, so if the site goes
up and
jim> down then just wait a few minutes and try again :-) Once I have
jim> evaluated some of the things you mention below then I will let
you
jim> have a scan at the source code.
ok. I will take a look as soon as I have got through the emails that
has batched up during the last 3 weeks.
jim> jim> I think that I could steal a lot of common syntax from
this for a
jim> new MySQL parser. There are 1100 regression tests culled
from the
jim> T-SQL language spec and user examples; on my Linux system
the parse
jim> time, including building and walking the AST is:
jim> jim>
jim> jim> jimi(50_64)-clean: wc -l *.sql
jim> jim> ...
jim> jim> 12923 total
jim> jim>
jim> jim> jimi(50_64)-clean: time tsqlc . >out
jim> jim> 0.15user 0.19system 0:00.34elapsed 100%CPU (0avgtext
+0avgdata
jim> 0maxresident)k
jim> jim> 0inputs+232outputs (0major+102772minor)pagefaults 0swaps
jim>
jim> jim> Which includes the time taken to read each of the 1100
files from
jim> the file system and to print a message for each saying the
parse was
jim> successful of course.
jim> jim> Anyway, if you want to take a look at the code, just
let me know :-)
jim>
jim> Monty> The big question is how much work it would be to
jim> replace the MySQLparser with your T-SQL parser.
jim> Indeed. Obviously syntactical changes are just slog really and
jim> most of that is ensuring that there is enough coverage in
regression
jim> tests to exercise all the syntactic differences between T-SQL and
jim> MySql, or rather, all the syntax of MySQL. A lot of commonality
but
jim> still a fair bit of work. What is the definitive language
reference
jim> for MySQL Marina? I can locate reference manuals and ANSI specs
of
jim> course, but if you have one particular document in mind, then
it would
jim> be useful to work off that so I can evaluate the syntactical
changes I
jim> would need to make.
Unfortunately there is no other spec than the sql_yacc.yy file that
comes with the MySQL source code.
I have a tool somewhere which can build .dot files from the .yy files
in the same way how ANTLRworks draws its syntax diagrams.... Copies of
the tool exist somewhere on the old MySQL internal repositories but I
am sure I have a backup copy somewhere.
jim>
jim> I would like to do this in two steps:
jim>
jim> a) Replace the mysql_yacc.yy file, with as few changes as
possible in
jim> the rest of the MySQL source.
jim>
jim>
jim> ANTLR is essentially the same gestalt as bison: one or more
source files define the grammar and contain actions in C, triggered
at the parse points; These grammar definition files are converted
into C source by invoking an external program (in the case of ANTLR
it is a Java program) and the resulting .c/.h files are compiled
with the rest of the code.
jim>
jim> I wrote the C output to be machine independent. The idea being
that you can generate the .c files on one machine, check that source
code in and use it on any other machine without re-generating. This
can be useful when targeting systems without Java machines (or ones
that don't work very well). So, like many other systems you can have
a 'build completely from scratch' option and a 'use pre-generated
files option'.
jim>
jim> In the main then, I suspect that the bigger changes to the
MySQL base would be build changes. I say this based upon the
assumption that any new parser should build the same AST for a query
as the yacc based version does. This means the new parser makes the
same calls as the old one and thus the code external to the parser
would probably require little, if anything in the way of change (but
see b below).
That sounds very good.
Not possible to build the same AST as the yacc/bison based parser
because there is no AST.
The execution structures are directly built during the parsing phase.
jim> b) Optimze MySQL to be even faster, with the new parser at
a base.
jim>
jim> Obviously there are two aspects, being the speed of the parser
itself, which I don't think will be an issue, and then what kind of
transformations are done on the resulting AST for optimization
purposes. If we use exactly the same AST as MySQL does now, then
improving MySQL performance is the same problem domain as it is now.
However, perhaps there is opportunity for a new parser to provide
more information about the query that would aid optimization.
Possibilities here include modifications to the AST, external
information structures that current code can use when walking the
AST, or even a completely new AST. The latter would probably have
quite an effect on the front end of the parse->AST->execution phase
of course, so I would think that the first step is to try to achieve
your step a with the existing AST as output.
Agree
You will have to implement an all-new AST which then can be used to
'compile' the execution structures.
(this was a minor sticking part of an earlier attempt to replace the
parser - there were objections to building an AST and adding an
additional stage to the execution)
jim> Of course maintaining and changing an ANTLR based grammar is
somewhat easier than changing bison/yacc, if for no other reason
than ANTLR doesn't need much in the way of code assists for correct
parsing, whereas Bison quite often does.
jim> The a) option is quite important to be able to do easy
merges with the
jim> main MySQL tree.
jim>
jim> I have time this week to take a look at this. Mechanically I
know it is easy enough, but I need to look at the calls that are
made from the current yacc/bison parser and see if there is anything
in there that would be awkward. The only things I could think of
would be if the existing code relies on internal structures of bison
rather than more generic structures. That should be easy to define.
The MySQL code doesn't rely on bison structures at all (as far as I
know).
AFAIK, MySQL does not use any Bison internals as such. A past project
from many years ago, I put into MySQL a PCCTS (pre-ANTLR) LL parser,
basically keeping MySQL's custom lexer but it generated Token objects
instead suitable for the PCCTS generated parser. The project was never
completed because integration of Stored Procedures became the
objective for MySQL 5.0 instead of implementation of a new parser.
I have asked Stewart Smith to see if he can help make public that old
source repository as a Bazaar repository as people may be able to
learn from it how to replace the parser without major pain.
Did you get time to do the analyse?
jim>
jim> A few other things that may be of note:
jim>
jim> 1) The ANTLR generated code is free threading so long as any
action code that you place withing the parser is also free threading.
In other words, same as with bison
The ANTLR runtime does, of course, require Java. So developers who
wish to extend the grammar do have to have the Java Runtime installed
as well as the ANTLR jars.
jim> 2) The only system calls it uses is {m|c|re}alloc/free, which
the C runtime component can re-direct (say to the MySQL overrides).
For completely free threading, those calls also need to be free
threading and so they are the only limit to threading of course.
Sounds perfect.
I would *really* like to see a new parser for MariaDB. Anything that
we can do to help get this done ?
I too am interested to help if at all possible.
Regards,
Antony,
_______________________________________________
Mailing list: https://launchpad.net/~maria-developers
Post to : [email protected]
Unsubscribe : https://launchpad.net/~maria-developers
More help : https://help.launchpad.net/ListHelp