jackylk commented on a change in pull request #3521: [doc_zh_cn] add a commercial inventory DB and carbon data query performance comparison doc chinese doc to carbondata URL: https://github.com/apache/carbondata/pull/3521#discussion_r361911969
########## File path: docs/zh_cn/CarbonData与商业列存DB性能对比.md ########## @@ -0,0 +1,115 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to you under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +## CarbonData与商业列存DB性能对比 + +本文主要在于给用户呈现CarbonData在对比某商业列存DB过程中对于该DB的查询性能提升,CarbonData自身的优势和特点,本文的数据仅为基于某领域查询特点框架下SQL的查询结果,只代表该特定查询特点下的性能对比。 + + + + + +## 1.测试环境对比 + +查询时某商业列存DB一台查询节点,配置SSD硬盘。CarbonData6个DataNode,配置STAT硬盘,但是查询队列设置1/6的资源,等同于1台商业DB服务器对比1台CarbonData服务器的查询性能。同时CarbonData使用的服务器的磁盘是STAT盘,成本比某商业列存DB服务器低。 + +| 集群 | 描述 | +| ---------------- | --------------------------------------------------------- | +| 某商业列存DB集群 | 3节点,SSD硬盘 | +| Hadoop集群 | 2个namenode,6个datanode,STAT硬盘,查询队列分配1/6的资源 | + +## 2.查询SQL模型介绍 + +某商业列存DB与CarbonData查询SQL本身存在差异,在执行性能测试之前需要对SQL进行修改。 + +```某商业列存DB的查询SQL模型:``` + +SELECT TOP 5000 SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_C , SUM(COALESCE(COLUMN_A, 0)) AS COLUMN_A_A , SUM(COALESCE(COLUMN_B, 0)) AS COLUMN_B_B , SUM(COALESCE(COLUMN_D, 0)) + SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_F , SUM(COALESCE(COLUMN_D, 0)) AS COLUMN_D_D , SUM(COALESCE(COLUMN_E, 0)) AS COLUMN_E_E , (SUM(COALESCE(COLUMN_A, 0)) + SUM(COALESCE(COLUMN_B, 0))) * delta AS COLUMN_F , SUM(COALESCE(COLUMN_A, 0)) * delta AS COLUMN_G , SUM(COALESCE(COLUMN_B, 0)) * delta AS COLUMN_H , MT."TEMP" AS "TEMP", COUNT(1) OVER () AS countNum FROM ( SELECT COALESCE(SUM("COLUMN_1_A"), 0) AS COLUMN_A , COALESCE(SUM("COLUMN_1_B"), 0) AS COLUMN_B , COALESCE(SUM("COLUMN_1_E"), 0) AS COLUMN_E , COALESCE(SUM("COLUMN_1_D"), 0) AS COLUMN_D , TABLE_A."TEMP" AS "TEMP" FROM TABLE_B LEFT JOIN ( SELECT "COLUMN_CSI" AS "TEMP2" , CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END AS "TEMP" , CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY "COLUMN_CSI", CASE WHEN "TYPE_ID" = 2 THEN "COLUMN_CSI" END, CASE WHEN "TYPE_ID" = 2 THEN "CLOUMN_NAME" END ) TABLE_A ON "COLUMN_CSI" = TABLE_A."TEMP2" WHERE TABLE_A.NAME_TEMP IS NOT NULL AND "TIME" < A AND "TIME" >= B GROUP BY TABLE_A."TEMP" ) MT GROUP BY MT."TEMP" ORDER BY COLUMN_C DESC + +其中一个SUM后面称为一个counter + +```Spark的查询SQL模型:``` + +SELECT COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C , COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A , COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B , COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F , COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D , COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E , (COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F , COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G , COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H , MT.`TEMP` AS `TEMP` FROM ( SELECT `COLUMN_1_A` AS COLUMN_A, `COLUMN_1_E` AS COLUMN_E, `COLUMN_1_B` AS COLUMN_B, `COLUMN_1_D` AS COLUMN_D, TABLE_A.`TEMP` AS `TEMP` FROM TABLE_B LEFT JOIN ( SELECT `COLUMN_CSI` AS `TEMP2` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY `COLUMN_CSI`, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END ) TABLE_A ON `COLUMN_CSI` = TABLE_A.`TEMP2` WHERE TABLE_A.NAME_TEMP IS NOT NULL AND `TIME` >= A AND `TIME` < B ) MT GROUP BY MT.`TEMP` ORDER BY COLUMN_C DESC LIMIT 5000 Review comment: ```suggestion ```SQL SELECT COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0) AS COLUMN_C , COALESCE(SUM(COLUMN_A), 0) AS COLUMN_A_A , COALESCE(SUM(COLUMN_B), 0) AS COLUMN_B_B , COALESCE(SUM(COLUMN_D), 0) + COALESCE(SUM(COLUMN_E), 0) AS COLUMN_F , COALESCE(SUM(COLUMN_D), 0) AS COLUMN_D_D , COALESCE(SUM(COLUMN_E), 0) AS COLUMN_E_E , (COALESCE(SUM(COLUMN_A), 0) + COALESCE(SUM(COLUMN_B), 0)) * delta AS COLUMN_F , COALESCE(SUM(COLUMN_A), 0) * delta AS COLUMN_G , COALESCE(SUM(COLUMN_B), 0) * delta AS COLUMN_H , MT.`TEMP` AS `TEMP` FROM ( SELECT `COLUMN_1_A` AS COLUMN_A, `COLUMN_1_E` AS COLUMN_E, `COLUMN_1_B` AS COLUMN_B, `COLUMN_1_D` AS COLUMN_D, TABLE_A.`TEMP` AS `TEMP` FROM TABLE_B LEFT JOIN ( SELECT `COLUMN_CSI` AS `TEMP2` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END AS `TEMP` , CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END AS NAME_TEMP FROM DIMENSION_TABLE GROUP BY `COLUMN_CSI`, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_CSI` END, CASE WHEN `TYPE_ID` = 2 THEN `COLUMN_NAME` END ) TABLE_A ON `COLUMN_CSI` = TABLE_A.`TEMP2` WHERE TABLE_A.NAME_TEMP IS NOT NULL AND `TIME` >= A AND `TIME` < B ) MT GROUP BY MT.`TEMP` ORDER BY COLUMN_C DESC LIMIT 5000 ``` ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
