Hi "ender" Nice to here from you.
Here are some interesting starting points: SQL2 and Oracle PLSQL in visual, hyperlinked BNF http://cui.unige.ch/db-research/Enseignement/analyseinfo/SQL92/BNFindex.html http://cui.unige.ch/db-research/Enseignement/analyseinfo/PLSQL21/BNFindex.html Open source Java DBMS implementations using parser generators for their SQL processing: http://mckoi.com/database/ -- uses JavaCC (has a .jj file you can look at) http://www.quadcap.com/ -- uses ANTLR (has a .g file you can look at) /-- Personally, I would advise against ANTLR. ANTLR is very attractive from the perspective of a power yet simple API and ease of use (it is more natural to Java and is well documented compared to JavaCC). Originally, ANTLR was what I started with when investigating the same subproject proposal (converting HSQLDB to use a parser generator). I had high hopes. And I think ANTLR might be OK for generating a C/C++ Parser, since the generated C/C++ Lexer/Parser may well perform decently. Certainly, the .NET C# generator option is apparently getting good reviews. But unless things have changed radically over the last couple of years, ANTLR still has a __long__ way to go before it can provide acceptable performance. Indeed, I just finished reading through the latest ANTLR source for an hour or so, and it does not look like things have changed in this department. For example, here is the kind of production often produced: else if ((LA(1)==EOF||LA(1)==LITERAL_create||LA(1)==LITERAL_unique||LA(1)==LITERAL_o n||LA(1)==RPAREN||LA(1)==LITERAL_constraint||LA(1)==LITERAL_primary||LA(1)== LITERAL_check||LA(1)==LITERAL_first||LA(1)==LITERAL_after||LA(1)==LITERAL_de fault||LA(1)==LITERAL_as||LA(1)==LITERAL_with||LA(1)==COMMA||LA(1)==LITERAL_ where||LA(1)==LITERAL_from||LA(1)==LITERAL_grant||LA(1)==LITERAL_references| |LA(1)==LITERAL_not||LA(1)==LITERAL_full||LA(1)==LITERAL_initially||LA(1)==L ITERAL_deferrable||LA(1)==LITERAL_for||LA(1)==LITERAL_order||LA(1)==LITERAL_ having||LA(1)==LITERAL_using||LA(1)==LITERAL_group||LA(1)==CONCAT||LA(1)==LI TERAL_or||LA(1)==LITERAL_and||LA(1)==LITERAL_when||LA(1)==LITERAL_then||LA(1 )==LITERAL_else||LA(1)==LITERAL_end||LA(1)==LITERAL_left||LA(1)==LITERAL_rig ht||LA(1)==LITERAL_cross||LA(1)==LITERAL_join||LA(1)==LITERAL_natural||LA(1) ==LITERAL_union||LA(1)==LITERAL_inner||LA(1)==LITERAL_intersect||LA(1)==LITE RAL_except||LA(1)==ID) && (LA(2)==EOF||LA(2)==LITERAL_create||LA(2)==LITERAL_unique||LA(2)==LITERAL_in dex||LA(2)==LITERAL_temporary||LA(2)==LITERAL_table||LA(2)==LITERAL_view||LA (2)==LITERAL_on||LA(2)==LPAREN||LA(2)==RPAREN||LA(2)==LITERAL_constraint||LA (2)==LITERAL_primary||LA(2)==LITERAL_foreign||LA(2)==LITERAL_check||LA(2)==L ITERAL_first||LA(2)==LITERAL_after||LA(2)==LITERAL_default||LA(2)==LITERAL_a s||LA(2)==LITERAL_with||LA(2)==LITERAL_cascaded||LA(2)==LITERAL_update||LA(2 )==COMMA||LA(2)==LITERAL_where||LA(2)==EQ||LA(2)==LITERAL_delete||LA(2)==LIT ERAL_from||LA(2)==LITERAL_global||LA(2)==LITERAL_local||LA(2)==LITERAL_grant ||LA(2)==LITERAL_all||LA(2)==LITERAL_select||LA(2)==LITERAL_insert||LA(2)==L ITERAL_references||LA(2)==LITERAL_usage||LA(2)==LITERAL_not||LA(2)==LITERAL_ null||LA(2)==LITERAL_key||LA(2)==LITERAL_identity||LA(2)==LITERAL_full||LA(2 )==LITERAL_initially||LA(2)==LITERAL_deferred||LA(2)==LITERAL_immediate||LA( 2)==LITERAL_deferrable||LA(2)==LITERAL_bit||LA(2)==LITERAL_varbinary||LA(2)= =LITERAL_binary||LA(2)==LITERAL_blob||LA(2)==LITERAL_clob||LA(2)==LITERAL_ch aracter||LA(2)==LITERAL_char||LA(2)==LITERAL_varchar||LA(2)==LITERAL_int||LA (2)==LITERAL_integer||LA(2)==LITERAL_smallint||LA(2)==LITERAL_tinyint||LA(2) ==LITERAL_bigint||LA(2)==LITERAL_dec||LA(2)==LITERAL_decimal||LA(2)==LITERAL _numeric||LA(2)==LITERAL_real||LA(2)==LITERAL_double||LA(2)==LITERAL_float|| LA(2)==LITERAL_boolean||LA(2)==LITERAL_date||LA(2)==LITERAL_time||LA(2)==LIT ERAL_timestamp||LA(2)==LITERAL_interval||LA(2)==LITERAL_for||LA(2)==LITERAL_ read||LA(2)==LITERAL_order||LA(2)==LITERAL_by||LA(2)==STAR||LA(2)==LITERAL_h aving||LA(2)==LITERAL_corresponding||LA(2)==LITERAL_values||LA(2)==LITERAL_u sing||LA(2)==LITERAL_group||LA(2)==CONCAT||LA(2)==LITERAL_or||LA(2)==LITERAL _and||LA(2)==LITERAL_is||LA(2)==LITERAL_like||LA(2)==LITERAL_between||LA(2)= =LITERAL_in||LA(2)==LITERAL_exists||LA(2)==LITERAL_escape||LA(2)==120||LA(2) ==PLUS||LA(2)==MINUS||LA(2)==STRING_LITERAL||LA(2)==QUESTION||LA(2)==INT||LA (2)==REAL||LA(2)==BINSTR||LA(2)==HEXSTR||LA(2)==LITERAL_case||LA(2)==LITERAL _when||LA(2)==LITERAL_then||LA(2)==LITERAL_else||LA(2)==LITERAL_end||LA(2)== LITERAL_hour||LA(2)==LITERAL_left||LA(2)==LITERAL_minute||LA(2)==LITERAL_mon th||LA(2)==LITERAL_right||LA(2)==LITERAL_second||LA(2)==LITERAL_year||LA(2)= =LITERAL_user||LA(2)==LITERAL_current_user||LA(2)==LITERAL_session_user||LA( 2)==LITERAL_system_user||LA(2)==LITERAL_current_date||LA(2)==LITERAL_current _time||LA(2)==LITERAL_current_timestamp||LA(2)==LITERAL_sql_tsi_frac_second| |LA(2)==LITERAL_sql_tsi_second||LA(2)==LITERAL_sql_tsi_minute||LA(2)==LITERA L_sql_tsi_hour||LA(2)==LITERAL_sql_tsi_day||LA(2)==LITERAL_sql_tsi_week||LA( 2)==LITERAL_sql_tsi_month||LA(2)==LITERAL_sql_tsi_quarter||LA(2)==LITERAL_sq l_tsi_year||LA(2)==LITERAL_cast||LA(2)==LITERAL_true||LA(2)==LITERAL_false|| LA(2)==LITERAL_avg||LA(2)==LITERAL_min||LA(2)==LITERAL_max||LA(2)==LITERAL_s um||LA(2)==LITERAL_count||LA(2)==LT||LA(2)==LE||LA(2)==GT||LA(2)==GE||LA(2)= =NE||LA(2)==SLASH||LA(2)==LITERAL_cross||LA(2)==LITERAL_join||LA(2)==LITERAL _natural||LA(2)==LITERAL_union||LA(2)==LITERAL_inner||LA(2)==LITERAL_outer|| LA(2)==LITERAL_intersect||LA(2)==LITERAL_except||LA(2)==ID)) { } While this is awful in and of itself (close to 100 method calls with high overhead, when it would do jst as well to generate a simple patten like: t1 = LA(1); switch(t1) { case: ... } What doesn't show here is the chain of calls inside LA(1) and LA(2), which have a combined depth/fanout factor of about 5: LA(i) method of QED SQLParser is an abstract method of antler.Parser, which resolves to: antlr.LLkParser # public int LA(int i) throws TokenStreamException { return inputState.input.LA(i); } Where inputState is a ParserSharedInputState and its input field is an antlr.TokenBuffer. TokenBuffer.LA(i) looks like: public final int LA(int i) throws TokenStreamException { fill(i); return queue.elementAt(markerOffset + i - 1).type; } fill(i) looks like: /** Ensure that the token buffer is sufficiently full */ private final void fill(int amount) throws TokenStreamException { syncConsume(); // Fill the buffer sufficiently to hold needed tokens while (queue.nbrEntries < amount + markerOffset) { // Append the next token queue.append(input.nextToken()); } } and queue is an antlr.TokenQueue, with elementAt looking like: /** Fetch a token from the queue by index * @param idx The index of the token to fetch, where zero is the token at the front of the queue */ public final Token elementAt(int idx) { return buffer[(offset + idx) & sizeLessOne]; } So, although it all looks "nice" and readable in the generated SQLParser file, the call graph underlying the generated code is still __terribly__ inefficient. I can't say accurately what the performance metrics are for the latest release (ANTLR 2.7.4), but I did some metrics the last time I was working on this (and gave up, BTW, based on the resulting metrics): After implementing portions of the SQL92 BNF, as pointed to above, in frustration (with performance and also subtle ANTLR weaknesses and difficult to fix "unresolvable grammar ambiguity" reports) I finally decided to write some absolutely trivial grammars and compare the performance to the existing HSQLDB engine (1.6/1.7 at the time). One grammar was something absolutely ridiculous like: sql : sql_stmt sql_stmt : "insert" "into" "test" "values" "(" "1" ")" ";" I found that with an empty implementation (the generated Lexer/Parser did nothing except process the input...no hookup to the HSQLDB engine), the performance was between 10 to 100 times slower than actually fully processing the same statement with HSQLDB (i.e. tokenizing, parsing, resolving elements to database objects, checking authorization, handling transactional aspect, performing operation including index, cache, trigger, integrity constraints and log operations). Basically, IIRC, it was in the 10X slower range when compared to talking with a Server (TCP/IP) connection doing the operation against an HSQLDB CACHED table, while the 100X figure was compared to talking with an in-process connection doing the operation against a TEMP table. Based on: 1.) the metrics, 2.) my experience with the kind of acrobatics required to quel the parser generator into accepting workably complex grammars, 3.) the fact that a generated ANTLR Lexer/Parser code is generally huge and also depends on an ANTLR runtime jar, I dedided to permanently shelve my efforts. My justification at the time was: what's the point of spending a large amount of time and effort to basically no more than reproduce the existing HSQLDB Tokenizer/Parser functionality, resuling in 10/100 times slower performance, while adding another 2-300K to the package, as well as introdudcing external jar dependencies (that may require a JDK higher that the minimum we stive to support). At the time, my evaluation was that the same time and effort could be spent much more profitably on simplifying (modularizing) and refining/extending the existing work, with far greater effect. I did some work with JavaCC at the time as well. IIRC, the performance metrics were more acceptable, and the generated files are standalone, both pluses. However, the work was not as well documented and the API did not seem to be as straight forward. It's been too long to comment properly on the more detailed/complex issues, such as how much work to overcome certain tricky grammar ambiguity issues. Hoping any of this is in the least useful, Campbell ----- Original Message ----- From: "刘欣" <[EMAIL PROTECTED]> To: "hsqldb-developers" <[EMAIL PROTECTED]> Sent: 23 Jul, 2004 8:06 AM Subject: [Hsqldb-developers] Re:store procedure -new lexer and parser > hi,all > i'm liuxin, you can call me "ender", I lived in beijing,china. I'm glad to join hsqldb team > and hope to give some contriutions to hsqldb. > I started to study source code yesterday, especially jdbcConnection.java, jdbcStatement.java,Database.java, Parser.java,DatabaseCommandInterpreter.java... . > I found a parser was build handly,not generated by machine ,it's very fast for parsing and executing sql, but the code is not easy to read and not very clear( maybe it's my problem :-) ), Maintenance would be annoying when store procedures are joined ,as fredt said, a grammar based approach is needed, I'm experienced in JavaCC, the following days i'm going to learn JFlex and Antler, try to find which is the fastest lexer. > I'd like to do some work on grammar and tests. > > > > ender > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by BEA Weblogic Workshop > FREE Java Enterprise J2EE developer tools! > Get your free copy of BEA WebLogic Workshop 8.1 today. > http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click > _______________________________________________ > hsqldb-developers mailing list > [EMAIL PROTECTED] > https://lists.sourceforge.net/lists/listinfo/hsqldb-developers > ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_idG21&alloc_id040&op=click _______________________________________________ hsqldb-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/hsqldb-developers