xinyiZzz opened a new pull request, #30817: URL: https://github.com/apache/doris/pull/30817
## Proposed changes # 1. Motivation PL-SQL (Stored procedure) is a collection of sql, which is defined and used similarly to functions. It supports conditional judgments, loops and other control statements, supports cursor processing of result sets, and can write business logic in SQL. Hive uses Hplsql to support PL-SQL and is largely compatible with Oracle, Impala, MySQL, Redshift, PostgreSQL, DB2, etc. We support PL-SQL in Doris based on Hplsql to achieve compatibility with Stored procedures of database systems such as Oracle and PostgreSQL. Reference documentation: http://mail.hplsql.org https://docs.oracle.com/en/database/oracle/oracle-database/21/lnpls/plsql-language-fundamentals.html#GUID-640DB3AA-15AF-4825-BD6C-1D4EB5AB7715 # 2. Implementation Take the following case as an example to explain the execution process of the stored procedure. It supports two execution methods: connecting to FE using the Mysql protocol and using scripts. ``` CREATE OR REPLACE PROCEDURE A(IN name STRING, OUT result int) select count(*) from test; select count(*) into result from test where k = name; END declare result INT default = 0; call A(‘xxx’, result); print result; ``` ## 2.1 Use Mysql protocol connect FE to execute  1. Add procedure and persist the Procedure Name and Source (raw SQL) into Doris FE metadata. 2. Call procedure, extract the actual parameter Value and Procedure Name in Call Stmt. Use Procedure Name to find the Source in the metadata, extract the Name and Type of the Procedure parameter, and match them with the actual parameter Value to form a complete variable <Name, Type, Value>. 3. Execute Doris Statement - Use Doris Logical Plan Builder to parse the Doris Statement syntax in Source, replace parameter variables, remove the into variable clause, and generate a Plan Tree that conforms to Doris syntax. - Use stmtExecutor to execute SQL and encapsulate the query result set iterator into QueryResult. - Output the query results to Mysql Channel, or write them into Cursor, parameters, and variables. - Stored Programs compatible with Mysql protocol support multiple statements. 4. Execute PL-SQL Statement - Use Plsql Logical Plan Builder to parse and execute PL-SQL Statement syntax in Source, including Loop, Cursor, IF, Declare, etc., and basically reuse HplSQL. ## 2.2 Script execution Start reading and parsing SQL from the script, connect Doris FE to send Doris Statement through JDBC, process the returned Doris query result set locally, and return the final result. Instructions: 1. Modify the jdbc path and username and password to connect to doris in the fe/conf/plsql-site.xml configuration file: ``` <property> <name>plsql.conn.dorisconn</name> <value>com.mysql.cj.jdbc.Driver;jdbc:mysql://fehost:queryport/database;user;password</value> <description>Doris jdbc connection</description> </property> ``` 2. Execute stored procedure sql through script: ``` sh fe/bin/plsql.sh -f stored_procedure.sql ``` Implementation process:  It is similar to using the Mysql protocol to connect the FE execution process. The difference is that JDBC is used to execute the Doris Statement, and the created Stored procedure will not be persisted to the Doris FE metadata, but will be temporarily saved locally. # 3. TODO 1. Doris Parser supports declare variable. 2. Select Statement supports insert into variable. 3. Parameters and fields have the same name. 4. If Cursor exits halfway, will there be a memory leak? 5. Use getOriginSql(ctx) in syntax parsing LogicalPlanBuilder to obtain the original SQL. Is there any problem with special characters? 6. Supports complex types such as Map and Struct. 7. Test syntax such as Package. 8. Support UDF. 9. Is it compatible with Oracle PL-SQL? # 4. Some questions 1. JDBC does not support the execution of stored procedures that return results. You can only Into the execution results into a variable or write them into a table, because when multiple result sets are returned, JDBC needs to use the prepareCall statement to execute, otherwise the Statemnt of the returned result executes Finalize. Send EOF Packet will report an error; 2. Use PL-SQL Cursor to open multiple Query result set iterators at the same time. Doris BE will cache the intermediate status of these Queries (such as HashTable) and query results until the Query result set iteration is completed. If the Cursor is not available for a long time Being used will result in a lot of memory waste. 3. The current PL-SQL Statement is calculated in Doris FE, such as IF, Loop, Cursor, etc. This may affect other services on Doris FE and is limited by the performance of Doris FE. The creation of stored procedures will Forward Master FE, and execution can In the Observer, in the short term, consider expanding the number of FE Observer nodes to execute PL-SQL Statement. A better solution is to move it to Doris BE for execution. 4. The format of the result returned by Doris Statement is ```xxxx\n, xxxx\n, 2 rows affected (0.03 sec)```. PL-SQL uses Print to print variable values in an unformatted format, and JDBC cannot easily obtain them. Real results. # 5. Some thoughts The above execution of Doris Statement reuses Doris Logical Plan Builder for syntax parsing, parses it from top to bottom into a Plan Tree, and calls stmtExecutor for execution. PL-SQL replacement variables, removal of Into Variable and other operations are coupled in Doris syntax parsing. The advantage is that it is easier to It can be compatible with Doris grammar with a few changes, but the disadvantage is that it will invade the Doris grammar parsing process. HplSQL performs a syntax parsing independently of Hive to implement variable substitution and other operations, and finally outputs a SQL that conforms to Hive syntax. The following is a simple syntax parsing process for select, where, expression, table name, join, The parsing of agg, order and other grammars must be re-implemented. The advantage is that it is completely independent from the original system, but the changes are too complicated.  *** # 1. 动机 PL-SQL(Stored procedure)是一组sql的集合,定义及使用方式类似于函数。支持条件判断、循环等控制语句,支持游标处理结果集,可用sql的方式编写业务逻辑。 Hive 使用 Hplsql 支持PL-SQL,同时很大程度兼容 Oracle, Impala, MySQL, Redshift, PostgreSQL, DB2 等,我们基于 Hplsql 在 Doris 支持PL-SQL,实现对 Oracle、PostgreSQL 等数据库系统Stored procedure的兼容。 参考文档: http://mail.hplsql.org https://docs.oracle.com/en/database/oracle/oracle-database/21/lnpls/plsql-language-fundamentals.html#GUID-640DB3AA-15AF-4825-BD6C-1D4EB5AB7715 # 2. 实现 以下面的 case 为例,解释存储过程的执行流程,支持使用 Mysql 协议连接 FE 和使用脚本两种方式执行。 ``` CREATE OR REPLACE PROCEDURE A(IN name STRING, OUT result int) select count(*) from test; select count(*) into result from test where k = name; END declare result INT default = 0; call A(‘xxx’, result); print result; ``` ## 2.1 使用 Mysql 协议连接 FE 执行 [图片] 1. Add procedure,将 Procedure Name 和 Source(原始SQL)持久化到 Doris FE 元数据中。 2. Call procedure,提取 Call Stmt 中的实参 Value和 Procedure Name。使用 Procedure Name 在元数据中找到 Source,提取 Procedure 形参的 Name 和 Type,和实参 Value 匹配后组成完整变量<Name, Type, Value>。 3. 执行 Doris Statement - 使用 Doris Logical Plan Builder 对 Source 中的 Doris Statement 语法解析,替换参数变量,去除 into variable子句,生成符合 Doris 语法的Plan Tree。 - 使用 stmtExecutor 执行SQL,将查询结果集迭代器封装到 QueryResult 中。 - 将查询结果输出到Mysql Channel,或写入Cursor、出参、变量中。 - 兼容 Mysql 协议的 Stored Programs 支持多语句。 4. 执行 PL-SQL Statement - 使用 Plsql Logical Plan Builder 对 Source 中的 PL-SQL Statement 语法解析并执行,包括 Loop、Cursor、IF、Declare 等,基本复用 HplSQL。 ## 2.2 脚本执行 从脚本启动读取并解析SQL,通过 JDBC 连接 Doris FE 发送 Doris Statement,在本地处理返回的 Doris 查询结果集,并返回最终结果。 使用方法: 1. Modify the jdbc path and username and password to connect to doris in the fe/conf/plsql-site.xml configuration file: ``` <property> <name>plsql.conn.dorisconn</name> <value>com.mysql.cj.jdbc.Driver;jdbc:mysql://fehost:queryport/database;user;password</value> <description>Doris jdbc connection</description> </property> ``` 2. Execute stored procedure sql through script: ``` sh fe/bin/plsql.sh -f stored_procedure.sql ``` 执行流程: [图片] 与使用 Mysql 协议连接 FE 执行流程类似,区别是使用 JDBC 执行 Doris Statement,并且创建的 Stored procedure 不会持久化到 Doris FE 元数据中,只是临时保存在本地。 # 3. TODO 1. Doris Parser support declare variable。 2. Select Statement support insert into variable。 3. 参数和字段重名。 4. 如果Cursor中途退出,是否会内存泄漏。 5. 在语法解析 LogicalPlanBuilder 中使用 getOriginSql(ctx) 获得原始SQL,特殊字符是否有问题。 6. 支持 Map、Struct 等复杂类型。 7. 测试 Package 等语法。 8. 支持 UDF。 9. 是否兼容 Oracle PL-SQL, # 4. 一些问题 1. 不支持 JDBC 执行会返回结果的存储过程,只能把执行结果 Into 到一个变量,或写入表中,因为返回多个结果集时 JDBC 需使用 prepareCall 语句执行,否则返回结果的 Statemnt 执行 Finalize 时 Send EOF Packet 会报错; 2. 使用 PL-SQL Cursor 可以同时打开多个 Query 的结果集迭代器,在 Doris BE 会一直缓存这些 Query 的中间状态(比如 HashTable)和查询结果,直到 Query 结果集迭代完成,如果 Cursor 长时间没有被使用,会导致大量内存浪费。 3. 当前 PL-SQL Statement 在 Doris FE 计算,比如 IF、Loop、Cursor 等,这可能影响 Doris FE 上的其他服务,并受限于 Doris FE 的性能,存储过程的创建会 Forward Master FE,执行可以在 Observer,短期考虑扩展 FE Observer 节点个数执行 PL-SQL Statement,更好的方案是移到 Doris BE 上执行。 4. Doris Statement返回结果的格式是 ```xxxx\n, xxxx\n, 2 rows affected (0.03 sec)```,PL-SQL 使用 Print 打印变量值是无格式的,JDBC 无法方便的拿到真正结果。 # 5. 一些思考 上面执行 Doris Statement 复用 Doris Logical Plan Builder 做语法解析,自顶向下解析成一个 Plan Tree,并调用 stmtExecutor 执行,PL-SQL 替换变量、去除 Into Variable 等操作耦合在 Doris 语法解析中,好处是较少的改动就可以兼容 Doris 语法,坏处是会侵入 Doris 语法解析流程。 HplSQL是独立于 Hive 单独做了一次语法解析,实现变量替换等操作,最终输出一个符合 Hive 语法的 SQL,下面是一个简单的语法解析流程,对 SQL 中 select、 where、expression、table name、join、agg、order 等语法的解析都要重新实现,好处是与原系统完全独立,但改动太复杂。 [图片] -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
