Re: [Hsqldb-developers] Re:store procedure -new lexer and parser

Campbell Boucher-Burnet Sat, 24 Jul 2004 03:34:47 -0700

Hi "ender"

Nice to here from you.


Here are some interesting starting points:

SQL2 and Oracle PLSQL in visual, hyperlinked BNF

http://cui.unige.ch/db-research/Enseignement/analyseinfo/SQL92/BNFindex.html

http://cui.unige.ch/db-research/Enseignement/analyseinfo/PLSQL21/BNFindex.html


Open source Java DBMS implementations using parser generators for their SQL
processing:

http://mckoi.com/database/ -- uses JavaCC (has a .jj file you can look at)

http://www.quadcap.com/ -- uses ANTLR (has a .g file you can look at)

/--

Personally, I would advise against ANTLR.

ANTLR is very attractive from the perspective of a power yet simple API and
ease of use (it is more natural to Java and is well documented compared to
JavaCC).

Originally, ANTLR was what I started with when investigating the same
subproject proposal (converting HSQLDB to use a parser generator).

I had high hopes.

And I think ANTLR might be OK for generating a C/C++ Parser, since the
generated C/C++ Lexer/Parser may well perform decently.  Certainly, the .NET
C# generator option is apparently getting good reviews.

But unless things have changed radically over the last couple of years,
ANTLR still has a __long__ way to go before it can provide acceptable
performance.

Indeed, I just finished reading through the latest ANTLR source for an hour
or so, and it does not look like things have changed in this department.

For example, here is the kind of production often produced:

else if
((LA(1)==EOF||LA(1)==LITERAL_create||LA(1)==LITERAL_unique||LA(1)==LITERAL_o
n||LA(1)==RPAREN||LA(1)==LITERAL_constraint||LA(1)==LITERAL_primary||LA(1)==
LITERAL_check||LA(1)==LITERAL_first||LA(1)==LITERAL_after||LA(1)==LITERAL_de
fault||LA(1)==LITERAL_as||LA(1)==LITERAL_with||LA(1)==COMMA||LA(1)==LITERAL_
where||LA(1)==LITERAL_from||LA(1)==LITERAL_grant||LA(1)==LITERAL_references|
|LA(1)==LITERAL_not||LA(1)==LITERAL_full||LA(1)==LITERAL_initially||LA(1)==L
ITERAL_deferrable||LA(1)==LITERAL_for||LA(1)==LITERAL_order||LA(1)==LITERAL_
having||LA(1)==LITERAL_using||LA(1)==LITERAL_group||LA(1)==CONCAT||LA(1)==LI
TERAL_or||LA(1)==LITERAL_and||LA(1)==LITERAL_when||LA(1)==LITERAL_then||LA(1
)==LITERAL_else||LA(1)==LITERAL_end||LA(1)==LITERAL_left||LA(1)==LITERAL_rig
ht||LA(1)==LITERAL_cross||LA(1)==LITERAL_join||LA(1)==LITERAL_natural||LA(1)
==LITERAL_union||LA(1)==LITERAL_inner||LA(1)==LITERAL_intersect||LA(1)==LITE
RAL_except||LA(1)==ID) &&
(LA(2)==EOF||LA(2)==LITERAL_create||LA(2)==LITERAL_unique||LA(2)==LITERAL_in
dex||LA(2)==LITERAL_temporary||LA(2)==LITERAL_table||LA(2)==LITERAL_view||LA
(2)==LITERAL_on||LA(2)==LPAREN||LA(2)==RPAREN||LA(2)==LITERAL_constraint||LA
(2)==LITERAL_primary||LA(2)==LITERAL_foreign||LA(2)==LITERAL_check||LA(2)==L
ITERAL_first||LA(2)==LITERAL_after||LA(2)==LITERAL_default||LA(2)==LITERAL_a
s||LA(2)==LITERAL_with||LA(2)==LITERAL_cascaded||LA(2)==LITERAL_update||LA(2
)==COMMA||LA(2)==LITERAL_where||LA(2)==EQ||LA(2)==LITERAL_delete||LA(2)==LIT
ERAL_from||LA(2)==LITERAL_global||LA(2)==LITERAL_local||LA(2)==LITERAL_grant
||LA(2)==LITERAL_all||LA(2)==LITERAL_select||LA(2)==LITERAL_insert||LA(2)==L
ITERAL_references||LA(2)==LITERAL_usage||LA(2)==LITERAL_not||LA(2)==LITERAL_
null||LA(2)==LITERAL_key||LA(2)==LITERAL_identity||LA(2)==LITERAL_full||LA(2
)==LITERAL_initially||LA(2)==LITERAL_deferred||LA(2)==LITERAL_immediate||LA(
2)==LITERAL_deferrable||LA(2)==LITERAL_bit||LA(2)==LITERAL_varbinary||LA(2)=
=LITERAL_binary||LA(2)==LITERAL_blob||LA(2)==LITERAL_clob||LA(2)==LITERAL_ch
aracter||LA(2)==LITERAL_char||LA(2)==LITERAL_varchar||LA(2)==LITERAL_int||LA
(2)==LITERAL_integer||LA(2)==LITERAL_smallint||LA(2)==LITERAL_tinyint||LA(2)
==LITERAL_bigint||LA(2)==LITERAL_dec||LA(2)==LITERAL_decimal||LA(2)==LITERAL
_numeric||LA(2)==LITERAL_real||LA(2)==LITERAL_double||LA(2)==LITERAL_float||
LA(2)==LITERAL_boolean||LA(2)==LITERAL_date||LA(2)==LITERAL_time||LA(2)==LIT
ERAL_timestamp||LA(2)==LITERAL_interval||LA(2)==LITERAL_for||LA(2)==LITERAL_
read||LA(2)==LITERAL_order||LA(2)==LITERAL_by||LA(2)==STAR||LA(2)==LITERAL_h
aving||LA(2)==LITERAL_corresponding||LA(2)==LITERAL_values||LA(2)==LITERAL_u
sing||LA(2)==LITERAL_group||LA(2)==CONCAT||LA(2)==LITERAL_or||LA(2)==LITERAL
_and||LA(2)==LITERAL_is||LA(2)==LITERAL_like||LA(2)==LITERAL_between||LA(2)=
=LITERAL_in||LA(2)==LITERAL_exists||LA(2)==LITERAL_escape||LA(2)==120||LA(2)
==PLUS||LA(2)==MINUS||LA(2)==STRING_LITERAL||LA(2)==QUESTION||LA(2)==INT||LA
(2)==REAL||LA(2)==BINSTR||LA(2)==HEXSTR||LA(2)==LITERAL_case||LA(2)==LITERAL
_when||LA(2)==LITERAL_then||LA(2)==LITERAL_else||LA(2)==LITERAL_end||LA(2)==
LITERAL_hour||LA(2)==LITERAL_left||LA(2)==LITERAL_minute||LA(2)==LITERAL_mon
th||LA(2)==LITERAL_right||LA(2)==LITERAL_second||LA(2)==LITERAL_year||LA(2)=
=LITERAL_user||LA(2)==LITERAL_current_user||LA(2)==LITERAL_session_user||LA(
2)==LITERAL_system_user||LA(2)==LITERAL_current_date||LA(2)==LITERAL_current
_time||LA(2)==LITERAL_current_timestamp||LA(2)==LITERAL_sql_tsi_frac_second|
|LA(2)==LITERAL_sql_tsi_second||LA(2)==LITERAL_sql_tsi_minute||LA(2)==LITERA
L_sql_tsi_hour||LA(2)==LITERAL_sql_tsi_day||LA(2)==LITERAL_sql_tsi_week||LA(
2)==LITERAL_sql_tsi_month||LA(2)==LITERAL_sql_tsi_quarter||LA(2)==LITERAL_sq
l_tsi_year||LA(2)==LITERAL_cast||LA(2)==LITERAL_true||LA(2)==LITERAL_false||
LA(2)==LITERAL_avg||LA(2)==LITERAL_min||LA(2)==LITERAL_max||LA(2)==LITERAL_s
um||LA(2)==LITERAL_count||LA(2)==LT||LA(2)==LE||LA(2)==GT||LA(2)==GE||LA(2)=
=NE||LA(2)==SLASH||LA(2)==LITERAL_cross||LA(2)==LITERAL_join||LA(2)==LITERAL
_natural||LA(2)==LITERAL_union||LA(2)==LITERAL_inner||LA(2)==LITERAL_outer||
LA(2)==LITERAL_intersect||LA(2)==LITERAL_except||LA(2)==ID)) {
  }

While this is awful in and of itself (close to 100 method calls with high
overhead, when it would do jst as well to generate a simple patten like: t1
= LA(1); switch(t1) { case: ... }

What doesn't show here is the chain of calls inside LA(1) and LA(2), which
have a combined depth/fanout factor of about 5:

LA(i) method of QED SQLParser is an abstract method of antler.Parser, which
resolves to:

antlr.LLkParser # public int LA(int i) throws TokenStreamException {
        return inputState.input.LA(i);
}

Where inputState is a ParserSharedInputState and its input field  is an
antlr.TokenBuffer.

TokenBuffer.LA(i) looks like:

    public final int LA(int i) throws TokenStreamException {
        fill(i);
        return queue.elementAt(markerOffset + i - 1).type;
    }

fill(i) looks like:

    /** Ensure that the token buffer is sufficiently full */
    private final void fill(int amount) throws TokenStreamException {
        syncConsume();
        // Fill the buffer sufficiently to hold needed tokens
        while (queue.nbrEntries < amount + markerOffset) {
            // Append the next token
            queue.append(input.nextToken());
        }
    }

and queue is an antlr.TokenQueue, with elementAt looking like:

    /** Fetch a token from the queue by index
     * @param idx The index of the token to fetch, where zero is the token
at the front of the queue
     */
    public final Token elementAt(int idx) {
        return buffer[(offset + idx) & sizeLessOne];
    }

So, although it all looks "nice" and readable in the generated SQLParser
file, the call graph underlying the generated code is still __terribly__
inefficient.

I can't say accurately what the performance metrics are for the latest
release (ANTLR 2.7.4), but I did some metrics the last time I was working on
this (and gave up, BTW, based on the resulting metrics):

After implementing portions of the SQL92 BNF, as pointed to above, in
frustration (with performance and also subtle ANTLR weaknesses and difficult
to fix "unresolvable grammar ambiguity" reports) I finally decided to write
some absolutely trivial grammars and compare the performance to the existing
HSQLDB engine (1.6/1.7 at the time).

One grammar was something absolutely ridiculous like:

sql : sql_stmt
sql_stmt : "insert" "into" "test" "values" "(" "1" ")" ";"

I found that with an empty implementation (the generated Lexer/Parser did
nothing except process the input...no hookup to the HSQLDB engine), the
performance was between 10 to 100 times slower than actually fully
processing the same statement with HSQLDB (i.e. tokenizing, parsing,
resolving elements to database objects, checking authorization, handling
transactional aspect, performing operation including index, cache, trigger,
integrity constraints and log operations).

Basically, IIRC, it was in the 10X slower range when compared to talking
with a Server (TCP/IP) connection doing the operation against an HSQLDB
CACHED table, while the 100X figure was compared to talking with an
in-process connection doing the operation against a TEMP table.

Based on:

1.) the metrics,
2.) my experience with the kind of acrobatics required to quel the parser
generator into accepting workably complex grammars,
3.) the fact that a generated ANTLR Lexer/Parser code is generally huge and
also depends on an ANTLR runtime jar,

I dedided to permanently shelve my efforts.

My justification at the time was:  what's the point of spending a large
amount of time and effort to basically no more than reproduce the existing
HSQLDB Tokenizer/Parser functionality, resuling in 10/100 times slower
performance, while adding another 2-300K to the package, as well as
introdudcing external jar dependencies (that may require a JDK higher that
the minimum we stive to support).

At the time, my evaluation was that the same time and effort could be spent
much more profitably on simplifying (modularizing) and refining/extending
the existing work, with far greater effect.

I did some work with JavaCC at the time as well.  IIRC, the performance
metrics were more acceptable, and the generated files are standalone, both
pluses.

However, the work was not as well documented and the API did not seem to be
as straight forward.

It's been too long to comment properly on the more detailed/complex issues,
such as how much work to overcome certain tricky grammar ambiguity issues.

Hoping any of this is in the least useful,
Campbell


----- Original Message ----- 
From: "刘欣" <[EMAIL PROTECTED]>
To: "hsqldb-developers" <[EMAIL PROTECTED]>
Sent: 23 Jul, 2004 8:06 AM
Subject: [Hsqldb-developers] Re:store procedure -new lexer and parser


> hi,all
>    i'm liuxin, you can call me "ender", I lived in beijing,china. I'm glad
to join hsqldb team
> and hope to give some contriutions to hsqldb.
>    I started to study source code yesterday, especially
jdbcConnection.java, jdbcStatement.java,Database.java,
Parser.java,DatabaseCommandInterpreter.java... .
>    I found a parser was build handly,not generated by machine ,it's very
fast for parsing and executing sql, but the code is not easy to read and not
very clear( maybe it's my problem :-) ), Maintenance would be annoying when
store procedures are joined ,as fredt said, a grammar based approach is
needed, I'm experienced in JavaCC, the following days i'm going to learn
JFlex and Antler, try to find which is the fastest lexer.
>    I'd like to do some work on grammar and tests.
>
>
>
> ender
>
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by BEA Weblogic Workshop
> FREE Java Enterprise J2EE developer tools!
> Get your free copy of BEA WebLogic Workshop 8.1 today.
> http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
> _______________________________________________
> hsqldb-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/hsqldb-developers
>




-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id040&op=click
_______________________________________________
hsqldb-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/hsqldb-developers

Re: [Hsqldb-developers] Re:store procedure -new lexer and parser

Reply via email to