[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429082195



##
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##
@@ -0,0 +1,397 @@
+---
+{
+"title": "Spark Load",
+"language": "zh-CN"
+}
+---  
+
+
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 
集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+ +
+ | 0. User create spark load job
++v+
+|   FE|-+
++++ |
+ | 3. FE send push tasks|
+ | 5. FE publish version|
++++ |
+||| |
++---v---++---v---++---v---+ |
+|  BE   ||  BE   ||  BE   | |1. FE submit Spark 
ETL job
++---^---++---^---++---^---+ |
+|4. BE push with broker   | |
++---+---++---+---++---+---+ |
+|Broker ||Broker ||Broker | |
++---^---++---^---++---^---+ |
+||| |
++---+++---+ 2.ETL +-v---+
+|   HDFS  +--->   Spark cluster |
+| <---+ |
++-+   +-+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource
 management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+( 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+ 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+ 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+ 

[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429083102



##
File path: fe/src/main/java/org/apache/doris/mysql/privilege/PaloPrivilege.java
##
@@ -25,7 +25,8 @@
 LOAD_PRIV("Load_priv", 4, "Privilege for loading data into tables"),
 ALTER_PRIV("Alter_priv", 5, "Privilege for alter database or table"),
 CREATE_PRIV("Create_priv", 6, "Privilege for createing database or table"),
-DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table");
+DROP_PRIV("Drop_priv", 7, "Privilege for dropping database or table"),
+USAGE_PRIV("Usage_priv", 8, "Privilege for use resource");

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429094156



##
File path: fe/src/main/java/org/apache/doris/catalog/ResourceMgr.java
##
@@ -0,0 +1,189 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.analysis.DropResourceStmt;
+import org.apache.doris.catalog.Resource.ResourceType;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.common.proc.ProcNodeInterface;
+import org.apache.doris.common.proc.ProcResult;
+import org.apache.doris.mysql.privilege.PrivPredicate;
+import org.apache.doris.qe.ConnectContext;
+
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.Maps;
+import org.apache.logging.log4j.LogManager;
+import org.apache.logging.log4j.Logger;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.locks.ReentrantLock;
+
+/**
+ * Resource manager is responsible for managing external resources used by 
Doris.
+ * For example, Spark/MapReduce used for ETL, Spark/GPU used for queries, 
HDFS/S3 used for external storage.
+ * Now only support Spark.
+ */
+public class ResourceMgr implements Writable {
+private static final Logger LOG = LogManager.getLogger(ResourceMgr.class);
+
+public static final ImmutableList RESOURCE_PROC_NODE_TITLE_NAMES = 
new ImmutableList.Builder()
+.add("Name").add("ResourceType").add("Key").add("Value")
+.build();
+
+// { resourceName -> Resource}
+private final Map nameToResource = Maps.newHashMap();
+private final ReentrantLock lock = new ReentrantLock();
+private final ResourceProcNode procNode = new ResourceProcNode();
+
+public ResourceMgr() {
+}
+
+public void createResource(CreateResourceStmt stmt) throws DdlException {
+lock.lock();
+try {
+if (stmt.getResourceType() != ResourceType.SPARK) {
+throw new DdlException("Only support Spark resource.");
+}
+
+String resourceName = stmt.getResourceName();
+if (nameToResource.containsKey(resourceName)) {
+throw new DdlException("Resource(" + resourceName + ") already 
exist");
+}
+
+Resource resource = Resource.fromStmt(stmt);
+nameToResource.put(resourceName, resource);
+// log add
+Catalog.getInstance().getEditLog().logCreateResource(resource);
+LOG.info("create resource success. resource: {}", resource);
+} finally {
+lock.unlock();
+}
+}
+
+public void replayCreateResource(Resource resource) {
+lock.lock();
+try {
+nameToResource.put(resource.getName(), resource);
+} finally {
+lock.unlock();
+}
+}
+
+public void dropResource(DropResourceStmt stmt) throws DdlException {
+lock.lock();
+try {
+String name = stmt.getResourceName();
+if (!nameToResource.containsKey(name)) {
+throw new DdlException("Resource(" + name + ") does not 
exist");
+}
+
+nameToResource.remove(name);
+// log drop
+Catalog.getInstance().getEditLog().logDropResource(name);
+LOG.info("drop resource success. resource name: {}", name);
+} finally {
+lock.unlock();
+}
+}
+
+public void replayDropResource(String name) {
+lock.lock();
+try {
+nameToResource.remove(name);
+} finally {
+lock.unlock();
+}
+}
+
+public boolean containsResource(String name) {
+lock.lock();
+try {
+return nameToResource.containsKey(name);
+} finally {
+lock.unlock();
+}
+}
+
+public Resource getResource(String name) {
+lock.lock();
+try {
+retu

[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429127459



##
File path: fe/src/main/java/org/apache/doris/analysis/ResourcePattern.java
##
@@ -0,0 +1,118 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.analysis;
+
+import org.apache.doris.common.AnalysisException;
+import org.apache.doris.common.FeNameFormat;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.mysql.privilege.PaloAuth.PrivLevel;
+
+import com.google.common.base.Preconditions;
+import com.google.common.base.Strings;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+
+// only the following 2 formats are allowed
+// *
+// resource
+public class ResourcePattern implements Writable {
+private String resourceName;
+boolean isAnalyzed = false;

Review comment:
   removed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429129008



##
File path: fe/src/main/java/org/apache/doris/catalog/Resource.java
##
@@ -0,0 +1,110 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package org.apache.doris.catalog;
+
+import org.apache.doris.analysis.CreateResourceStmt;
+import org.apache.doris.common.DdlException;
+import org.apache.doris.common.io.Text;
+import org.apache.doris.common.io.Writable;
+import org.apache.doris.common.proc.BaseProcResult;
+import org.apache.doris.persist.gson.GsonUtils;
+
+import com.google.gson.annotations.SerializedName;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.util.Map;
+
+public abstract class Resource implements Writable {
+public enum ResourceType {
+UNKNOWN,
+SPARK;
+
+public static ResourceType fromString(String resourceType) {

Review comment:
   CaseInsensitive in fromString function and return UNKNOWN if 
resourceType does not exist





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] HappenLee opened a new pull request #3665: Del some useless code and impove the performance of sort

2020-05-22 Thread GitBox


HappenLee opened a new pull request #3665:
URL: https://github.com/apache/incubator-doris/pull/3665


   1. Delete the code of Sort Node we do not use now.
   2. Optimize the quick sort by find_the_median and try to reduce recursion 
level of quick sort.
   
   I have tested 50 million pieces of data sort.   ordered, disordered and 
reversed order 
   there is improved by about 10%



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] HappenLee closed pull request #3665: Del some useless code and impove the performance of sort

2020-05-22 Thread GitBox


HappenLee closed pull request #3665:
URL: https://github.com/apache/incubator-doris/pull/3665


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429139485



##
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##
@@ -0,0 +1,397 @@
+---
+{
+"title": "Spark Load",
+"language": "zh-CN"
+}
+---  
+
+
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理,提高 Doris 大数据量的导入性能并且节省 Doris 
集群的计算资源。主要用于初次迁移,大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中,如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend(BE):Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL:在导入流程中主要负责数据的 ETL 工作,包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务,FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建(BITMAP类型)、分区、排序、聚合等。
+3. ETL 任务完成后,FE 获取预处理过的每个分片的数据路径,并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据,转化为 Doris 底层存储格式。
+5. FE 调度生效版本,完成导入任务。
+
+```
+ +
+ | 0. User create spark load job
++v+
+|   FE|-+
++++ |
+ | 3. FE send push tasks|
+ | 5. FE publish version|
++++ |
+||| |
++---v---++---v---++---v---+ |
+|  BE   ||  BE   ||  BE   | |1. FE submit Spark 
ETL job
++---^---++---^---++---^---+ |
+|4. BE push with broker   | |
++---+---++---+---++---+---+ |
+|Broker ||Broker ||Broker | |
++---^---++---^---++---^---+ |
+||| |
++---+++---+ 2.ETL +-v---+
+|   HDFS  +--->   Spark cluster |
+| <---+ |
++-+   +-+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理(DPP)
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource
 management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。
+
+语法:
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+( 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+ 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数,如下:
+
+- `type`:资源类型,必填,目前仅支持 spark。
+
+- Spark 相关参数如下:
+  - `spark.master`: 必填,目前支持yarn,spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式,必填,支持 cluster,client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。
+  - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+示例:
+
+```sql
+-- yarn cluster 模式 
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",
+  "broker" = "broker0",
+  "broker.username" = "user0",
+  "broker.password" = "password0"
+);
+
+-- spark standalone client 模式
+CREATE EXTERNAL RESOURCE "spark1"
+PROPERTIES
+(
+  "type" = "spark", 
+  "spark.master" = "spark://127.0.0.1:",
+  "spark.submit.deployMode" = "client",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",
+  "broker" = "broker1"
+);
+```
+
+ 查看资源
+
+普通账户只能看到自己有USAGE_PRIV使用权限的资源。
+
+root和admin账户可以看到所有的资源。
+
+ 资源权限
+

[GitHub] [incubator-doris] HappenLee opened a new pull request #3666: Del some useless code and impove the performance of sort

2020-05-22 Thread GitBox


HappenLee opened a new pull request #3666:
URL: https://github.com/apache/incubator-doris/pull/3666


   1. Delete the code of Sort Node we do not use now.
   2. Optimize the quick sort by find_the_median and try to reduce recursion 
level of quick sort.
   
   I have tested 50 million pieces of data sort. ordered, disordered and 
reversed order 
   there is improved by about 10%



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429148045



##
File path: docs/zh-CN/administrator-guide/resource-management.md
##
@@ -0,0 +1,125 @@
+---
+{
+"title": "资源管理",
+"language": "zh-CN"
+}
+---
+
+
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 
权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW 
RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 
的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"  
+ PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+ * `type`:资源类型,必填,目前仅支持 spark。
+ * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+ 参数
+
+# Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+# 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+ 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",

Review comment:
   Is "spark.hadoop.fs.defaultFS"  is used for load data from HDFS? Or it 
seems that only ```working_dir```  is enough





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wutiangan opened a new pull request #3667: fix lead() function throw analys Exception #3254

2020-05-22 Thread GitBox


wutiangan opened a new pull request #3667:
URL: https://github.com/apache/incubator-doris/pull/3667


   fix #3254



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] HangyuanLiu commented on a change in pull request #3651: Support create materialized view with bitmap or hll

2020-05-22 Thread GitBox


HangyuanLiu commented on a change in pull request #3651:
URL: https://github.com/apache/incubator-doris/pull/3651#discussion_r429155817



##
File path: fe/src/main/java/org/apache/doris/alter/MaterializedViewHandler.java
##
@@ -457,8 +462,23 @@ private RollupJobV2 createMaterializedViewJob(String 
mvName, String baseIndexNam
+ "duplicate table");
 }
 Column newMVColumn = new Column(baseColumn);
+
 newMVColumn.setIsKey(mvColumnItem.isKey());
 newMVColumn.setAggregationType(mvAggregationType, 
mvColumnItem.isAggregationTypeImplicit());
+newMVColumn.setDefineExpr(mvColumnItem.getDefineExpr());
+if (mvColumnItem.getDefineExpr() != null) {
+if (mvAggregationType.equals(BITMAP_UNION)) {
+newMVColumn.setType(Type.BITMAP);
+newMVColumn.setName(MATERIALIZED_VIEW_NAME_PRFIX + 
"bitmap_" + baseColumn.getName());
+} else if (mvAggregationType.equals(HLL_UNION)){
+newMVColumn.setType(Type.HLL);
+newMVColumn.setName(MATERIALIZED_VIEW_NAME_PRFIX + "hll_" 
+ baseColumn.getName());
+} else {
+throw new DdlException("The define expr of column is only 
support bitmap_union or hll_union");
+}
+newMVColumn.setIsKey(false);

Review comment:
   If the aggregate function is computed for the original column, it should 
be not key whether  the original column is a key column or a value column





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429167261



##
File path: docs/zh-CN/administrator-guide/resource-management.md
##
@@ -0,0 +1,125 @@
+---
+{
+"title": "资源管理",
+"language": "zh-CN"
+}
+---
+
+
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 
权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW 
RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 
的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"  
+ PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+ * `type`:资源类型,必填,目前仅支持 spark。
+ * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+ 参数
+
+# Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+# 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+ 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",

Review comment:
   In yarn cluster deploy mode, "spark.hadoop.fs.defaultFS" is used in 
spark etl job for storing 
hdfs://host:port/user/xxx/.sparkStaging/appid/__spark_libs__xxx.zip and 
hdfs://host:port/user/xxx/.sparkStaging/appid/__spark_conf__.zip files





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] HangyuanLiu commented on a change in pull request #3651: Support create materialized view with bitmap or hll

2020-05-22 Thread GitBox


HangyuanLiu commented on a change in pull request #3651:
URL: https://github.com/apache/incubator-doris/pull/3651#discussion_r429168634



##
File path: 
fe/src/main/java/org/apache/doris/analysis/CreateMaterializedViewStmt.java
##
@@ -167,15 +167,25 @@ private void analyzeSelectClause() throws 
AnalysisException {
 } else if (selectListItem.getExpr() instanceof FunctionCallExpr) {
 FunctionCallExpr functionCallExpr = (FunctionCallExpr) 
selectListItem.getExpr();
 String functionName = 
functionCallExpr.getFnName().getFunction();
+Expr defineExpr = null;
 // TODO(ml): support REPLACE, REPLACE_IF_NOT_NULL only for 
aggregate table, HLL_UNION, BITMAP_UNION
 if (!functionName.equalsIgnoreCase("sum")
 && !functionName.equalsIgnoreCase("min")
-&& !functionName.equalsIgnoreCase("max")) {
+&& !functionName.equalsIgnoreCase("max")
+&& !functionName.equalsIgnoreCase("bitmap_union")
+&& !functionName.equalsIgnoreCase("hll_union")) {
 throw new AnalysisException("The materialized view only 
support the sum, min and max aggregate "
 + "function. Error 
function: " + functionCallExpr.toSqlImpl());
 }
 Preconditions.checkState(functionCallExpr.getChildren().size() 
== 1);
 Expr functionChild0 = functionCallExpr.getChild(0);
+
+if (functionName.equalsIgnoreCase("bitmap_union") || 
functionName.equalsIgnoreCase("hll_union")) {

Review comment:
   I agree with you. But schema change and  expression compute  logical is 
non-universal. so we can only support few function now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429174060



##
File path: docs/zh-CN/administrator-guide/resource-management.md
##
@@ -0,0 +1,125 @@
+---
+{
+"title": "资源管理",
+"language": "zh-CN"
+}
+---
+
+
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 
权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW 
RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 
的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"  
+ PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+ * `type`:资源类型,必填,目前仅支持 spark。
+ * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+ 参数
+
+# Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+# 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+ 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",

Review comment:
   For the user case which they on has one cluster(one spark cluster, one 
hdfs) for ETL, spark client may read the ```hdfs-site.xml``` by 
```HADOOP_HOME```, so that user needn't specify ```defaultFS ``` every time 
submit a spark job.
   In this case, spark.hadoop.fs.defaultFS is not a necessary item, it should 
be a optional item





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wangbo commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wangbo commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429174060



##
File path: docs/zh-CN/administrator-guide/resource-management.md
##
@@ -0,0 +1,125 @@
+---
+{
+"title": "资源管理",
+"language": "zh-CN"
+}
+---
+
+
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 
权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW 
RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 
的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"  
+ PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+ * `type`:资源类型,必填,目前仅支持 spark。
+ * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+ 参数
+
+# Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+# 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+ 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",

Review comment:
   For the user case which they only has one cluster(one spark cluster, one 
hdfs) for ETL, spark client may read the ```hdfs-site.xml``` by 
```HADOOP_HOME```, so that user needn't specify ```defaultFS ``` every time 
submit a spark job.
   In this case, spark.hadoop.fs.defaultFS is not a necessary item, it should 
be a optional item





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] decster opened a new pull request #3668: [Memory Engine] Add TabletType to PartitionInfo and TabletMeta

2020-05-22 Thread GitBox


decster opened a new pull request #3668:
URL: https://github.com/apache/incubator-doris/pull/3668


   This CL add TabletType to TabletMeta and PartitionInfo, it also adds a 
create table property "tablet_type" : "disk/memory" for user to specify tablet 
type. This is a temporary entry point for testing only, it may change in the 
future, so this field does not persist to FE PartitionInfo(need to upgrade meta 
version).
   
   Resolves #3442 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] chaoyli commented on a change in pull request #3637: [Memory Engine] Add MemSubTablet, MemTablet, WriteTx, PartialRowBatch

2020-05-22 Thread GitBox


chaoyli commented on a change in pull request #3637:
URL: https://github.com/apache/incubator-doris/pull/3637#discussion_r429084494



##
File path: be/src/olap/memory/CMakeLists.txt
##
@@ -29,5 +29,8 @@ add_library(Memory STATIC
 delta_index.cpp
 hash_index.cpp
 mem_tablet.cpp
+mem_sub_tablet.cpp
+partial_row_batch.cpp
 schema.cpp
+write_txn.cpp

Review comment:
   write_txn use txn abbreviation, so I think you can unify the name in 
this pull request.

##
File path: be/src/olap/memory/mem_sub_tablet.cpp
##
@@ -0,0 +1,235 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "olap/memory/mem_sub_tablet.h"
+
+#include "olap/memory/column.h"
+#include "olap/memory/column_reader.h"
+#include "olap/memory/column_writer.h"
+#include "olap/memory/hash_index.h"
+#include "olap/memory/partial_row_batch.h"
+#include "olap/memory/schema.h"
+
+namespace doris {
+namespace memory {
+
+Status MemSubTablet::create(uint64_t version, const Schema& schema,
+std::unique_ptr* ret) {
+std::unique_ptr tmp(new MemSubTablet());
+tmp->_versions.reserve(64);
+tmp->_versions.emplace_back(version, 0);
+tmp->_columns.resize(schema.cid_size());
+for (size_t i = 0; i < schema.num_columns(); i++) {
+// TODO: support storage_type != c.type
+auto& c = *schema.get(i);
+if (!supported(c.type())) {
+return Status::NotSupported("column type not supported");
+}
+tmp->_columns[c.cid()].reset(new Column(c, c.type(), version));
+}
+tmp.swap(*ret);
+return Status::OK();
+}
+
+MemSubTablet::MemSubTablet() : _index(new HashIndex(1 << 16)) {}
+
+MemSubTablet::~MemSubTablet() {}
+
+Status MemSubTablet::get_size(uint64_t version, size_t* size) const {
+std::lock_guard lg(_lock);
+if (version == static_cast(-1)) {
+// get latest
+*size = _versions.back().size;
+return Status::OK();
+}
+if (_versions[0].version > version) {
+return Status::NotFound("get_size failed, version too old");
+}
+for (size_t i = 1; i < _versions.size(); i++) {
+if (_versions[i].version > version) {
+*size = _versions[i - 1].size;
+return Status::OK();
+}
+}
+*size = _versions.back().size;
+return Status::OK();
+}
+
+Status MemSubTablet::read_column(uint64_t version, uint32_t cid,
+ std::unique_ptr* reader) {
+scoped_refptr cl;
+{
+std::lock_guard lg(_lock);
+if (cid < _columns.size()) {
+cl = _columns[cid];
+}
+}
+if (!cl) {
+return Status::NotFound("column not found");
+}
+return cl->create_reader(version, reader);
+}
+
+Status MemSubTablet::get_index_to_read(scoped_refptr* index) {
+*index = _index;
+return Status::OK();
+}
+
+Status MemSubTablet::begin_write(scoped_refptr* schema) {
+_schema = *schema;
+_row_size = latest_size();
+_write_index = _index;
+_writers.clear();
+_writers.resize(_columns.size());
+// precache key columns
+for (size_t i = 0; i < _schema->num_key_columns(); i++) {
+uint32_t cid = _schema->get(i)->cid();
+if (!_writers[cid]) {
+RETURN_IF_ERROR(_columns[cid]->create_writer(&_writers[cid]));
+}
+}
+_temp_hash_entries.reserve(8);
+
+// setup stats
+_write_start = GetMonoTimeSecondsAsDouble();
+_num_insert = 0;
+_num_update = 0;
+_num_update_cell = 0;
+return Status::OK();
+}
+
+Status MemSubTablet::apply_partial_row(const PartialRowReader& row) {
+DCHECK_GE(row.cell_size(), 1);
+const ColumnSchema* dsc;
+const void* key;
+// get key column and find in hash index
+// TODO: support multi-column row key
+row.get_cell(0, &dsc, &key);
+ColumnWriter* keyw = _writers[1].get();
+// find candidate rowids, and check equality
+uint64_t hashcode = keyw->hashcode(key, 0);
+_temp_hash_entries.clear();
+uint32_t newslot = _write_index->find(hashcode, &_temp_hash_entries);
+uint32_t rid = -1;
+for (size_t i = 0; i < _temp_hash_entries.size(); i++) {
+uint32_t test_rid

[GitHub] [incubator-doris] decster commented on a change in pull request #3637: [Memory Engine] Add MemSubTablet, MemTablet, WriteTx, PartialRowBatch

2020-05-22 Thread GitBox


decster commented on a change in pull request #3637:
URL: https://github.com/apache/incubator-doris/pull/3637#discussion_r429252915



##
File path: be/src/olap/memory/CMakeLists.txt
##
@@ -29,5 +29,8 @@ add_library(Memory STATIC
 delta_index.cpp
 hash_index.cpp
 mem_tablet.cpp
+mem_sub_tablet.cpp
+partial_row_batch.cpp
 schema.cpp
+write_txn.cpp

Review comment:
   fixed





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] decster commented on a change in pull request #3637: [Memory Engine] Add MemSubTablet, MemTablet, WriteTx, PartialRowBatch

2020-05-22 Thread GitBox


decster commented on a change in pull request #3637:
URL: https://github.com/apache/incubator-doris/pull/3637#discussion_r429256669



##
File path: be/src/olap/memory/mem_sub_tablet.cpp
##
@@ -0,0 +1,235 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "olap/memory/mem_sub_tablet.h"
+
+#include "olap/memory/column.h"
+#include "olap/memory/column_reader.h"
+#include "olap/memory/column_writer.h"
+#include "olap/memory/hash_index.h"
+#include "olap/memory/partial_row_batch.h"
+#include "olap/memory/schema.h"
+
+namespace doris {
+namespace memory {
+
+Status MemSubTablet::create(uint64_t version, const Schema& schema,
+std::unique_ptr* ret) {
+std::unique_ptr tmp(new MemSubTablet());
+tmp->_versions.reserve(64);
+tmp->_versions.emplace_back(version, 0);
+tmp->_columns.resize(schema.cid_size());
+for (size_t i = 0; i < schema.num_columns(); i++) {
+// TODO: support storage_type != c.type
+auto& c = *schema.get(i);
+if (!supported(c.type())) {
+return Status::NotSupported("column type not supported");
+}
+tmp->_columns[c.cid()].reset(new Column(c, c.type(), version));
+}
+tmp.swap(*ret);
+return Status::OK();
+}
+
+MemSubTablet::MemSubTablet() : _index(new HashIndex(1 << 16)) {}
+
+MemSubTablet::~MemSubTablet() {}
+
+Status MemSubTablet::get_size(uint64_t version, size_t* size) const {
+std::lock_guard lg(_lock);
+if (version == static_cast(-1)) {
+// get latest
+*size = _versions.back().size;
+return Status::OK();
+}
+if (_versions[0].version > version) {
+return Status::NotFound("get_size failed, version too old");
+}
+for (size_t i = 1; i < _versions.size(); i++) {
+if (_versions[i].version > version) {
+*size = _versions[i - 1].size;
+return Status::OK();
+}
+}
+*size = _versions.back().size;
+return Status::OK();
+}
+
+Status MemSubTablet::read_column(uint64_t version, uint32_t cid,
+ std::unique_ptr* reader) {
+scoped_refptr cl;
+{
+std::lock_guard lg(_lock);
+if (cid < _columns.size()) {
+cl = _columns[cid];
+}
+}
+if (!cl) {
+return Status::NotFound("column not found");
+}
+return cl->create_reader(version, reader);
+}
+
+Status MemSubTablet::get_index_to_read(scoped_refptr* index) {
+*index = _index;
+return Status::OK();
+}
+
+Status MemSubTablet::begin_write(scoped_refptr* schema) {
+_schema = *schema;
+_row_size = latest_size();
+_write_index = _index;
+_writers.clear();
+_writers.resize(_columns.size());
+// precache key columns
+for (size_t i = 0; i < _schema->num_key_columns(); i++) {
+uint32_t cid = _schema->get(i)->cid();
+if (!_writers[cid]) {
+RETURN_IF_ERROR(_columns[cid]->create_writer(&_writers[cid]));
+}
+}
+_temp_hash_entries.reserve(8);
+
+// setup stats
+_write_start = GetMonoTimeSecondsAsDouble();
+_num_insert = 0;
+_num_update = 0;
+_num_update_cell = 0;
+return Status::OK();
+}
+
+Status MemSubTablet::apply_partial_row(const PartialRowReader& row) {
+DCHECK_GE(row.cell_size(), 1);
+const ColumnSchema* dsc;
+const void* key;
+// get key column and find in hash index
+// TODO: support multi-column row key
+row.get_cell(0, &dsc, &key);
+ColumnWriter* keyw = _writers[1].get();
+// find candidate rowids, and check equality
+uint64_t hashcode = keyw->hashcode(key, 0);
+_temp_hash_entries.clear();
+uint32_t newslot = _write_index->find(hashcode, &_temp_hash_entries);
+uint32_t rid = -1;
+for (size_t i = 0; i < _temp_hash_entries.size(); i++) {
+uint32_t test_rid = _temp_hash_entries[i];
+if (keyw->equals(test_rid, key, 0)) {
+rid = test_rid;
+break;
+}
+}
+// if rowkey not found, do insertion/append
+if (rid == -1) {
+_num_insert++;
+rid = _row_size;
+// add all columns
+//DLOG(INFO) << StringPrintf"insert rid=%u", rid);
+for (size_t i 

[GitHub] [incubator-doris] decster commented on a change in pull request #3637: [Memory Engine] Add MemSubTablet, MemTablet, WriteTx, PartialRowBatch

2020-05-22 Thread GitBox


decster commented on a change in pull request #3637:
URL: https://github.com/apache/incubator-doris/pull/3637#discussion_r429258167



##
File path: be/src/olap/memory/mem_sub_tablet.cpp
##
@@ -0,0 +1,247 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "olap/memory/mem_sub_tablet.h"
+
+#include "olap/memory/column.h"
+#include "olap/memory/column_reader.h"
+#include "olap/memory/column_writer.h"
+#include "olap/memory/hash_index.h"
+#include "olap/memory/partial_row_batch.h"
+#include "olap/memory/schema.h"
+
+namespace doris {
+namespace memory {
+
+Status MemSubTablet::create(uint64_t version, const Schema& schema,
+std::unique_ptr* ret) {
+std::unique_ptr tmp(new MemSubTablet());
+tmp->_versions.reserve(64);
+tmp->_versions.emplace_back(version, 0);
+tmp->_columns.resize(schema.cid_size());
+for (size_t i = 0; i < schema.num_columns(); i++) {
+// TODO: support storage_type != c.type
+auto& c = *schema.get(i);
+if (!supported(c.type())) {
+return Status::NotSupported("column type not supported");
+}
+tmp->_columns[c.cid()].reset(new Column(c, c.type(), version));
+}
+tmp.swap(*ret);
+return Status::OK();
+}
+
+MemSubTablet::MemSubTablet() : _index(new HashIndex(1 << 16)) {}
+
+MemSubTablet::~MemSubTablet() {}
+
+Status MemSubTablet::get_size(uint64_t version, size_t* size) const {
+std::lock_guard lg(_lock);
+if (version == static_cast(-1)) {
+// get latest
+*size = _versions.back().size;
+return Status::OK();
+}
+if (_versions[0].version > version) {
+return Status::NotFound("get_size failed, version too old");
+}
+for (size_t i = 1; i < _versions.size(); i++) {
+if (_versions[i].version > version) {
+*size = _versions[i - 1].size;
+return Status::OK();
+}
+}
+*size = _versions.back().size;
+return Status::OK();
+}
+
+Status MemSubTablet::read_column(uint64_t version, uint32_t cid,
+ std::unique_ptr* reader) {
+scoped_refptr cl;
+{
+std::lock_guard lg(_lock);
+if (cid < _columns.size()) {
+cl = _columns[cid];
+}
+}
+if (!cl) {
+return Status::NotFound("column not found");
+}
+return cl->create_reader(version, reader);
+}
+
+Status MemSubTablet::get_index_to_read(scoped_refptr* index) {
+*index = _index;
+return Status::OK();
+}
+
+Status MemSubTablet::begin_write(scoped_refptr* schema) {
+_schema = *schema;
+_row_size = latest_size();
+_write_index = _index;
+_writers.clear();
+_writers.resize(_columns.size());
+// precache key columns
+for (size_t i = 0; i < _schema->num_key_columns(); i++) {
+uint32_t cid = _schema->get(i)->cid();
+if (!_writers[cid]) {
+RETURN_IF_ERROR(_columns[cid]->create_writer(&_writers[cid]));
+}
+}
+_temp_hash_entries.reserve(8);
+
+// setup stats
+_write_start = GetMonoTimeSecondsAsDouble();
+_num_insert = 0;
+_num_update = 0;
+_num_update_cell = 0;
+return Status::OK();
+}
+
+Status MemSubTablet::apply_partial_row_batch(PartialRowBatch* batch) {
+while (true) {
+bool has_row = false;
+RETURN_IF_ERROR(batch->next_row(&has_row));
+if (!has_row) {
+break;
+}
+RETURN_IF_ERROR(apply_partial_row(*batch));
+}
+return Status::OK();
+}
+
+Status MemSubTablet::apply_partial_row(const PartialRowBatch& row) {
+DCHECK_GE(row.cur_row_cell_size(), 1);
+const ColumnSchema* dsc;
+const void* key;
+// get key column and find in hash index
+// TODO: support multi-column row key
+row.cur_row_get_cell(0, &dsc, &key);
+ColumnWriter* keyw = _writers[1].get();
+// find candidate rowids, and check equality
+uint64_t hashcode = keyw->hashcode(key, 0);
+_temp_hash_entries.clear();
+uint32_t newslot = _write_index->find(hashcode, &_temp_hash_entries);
+uint32_t rid = -1;
+for (size_t i = 0; i < _temp_hash_entries.size(); i++) {
+uint32_t test_rid = _temp_hash_entries[i];
+   

[GitHub] [incubator-doris] decster commented on a change in pull request #3637: [Memory Engine] Add MemSubTablet, MemTablet, WriteTx, PartialRowBatch

2020-05-22 Thread GitBox


decster commented on a change in pull request #3637:
URL: https://github.com/apache/incubator-doris/pull/3637#discussion_r429260157



##
File path: be/src/olap/memory/partial_row_batch.h
##
@@ -0,0 +1,172 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include "olap/memory/common.h"
+#include "olap/memory/schema.h"
+
+namespace doris {
+namespace memory {
+
+// A chunk of memory that stores a batch of serialized partial rows
+// User can iterate through all the partial rows, get each partial row's cells.
+//
+// Serialization format for a batch:
+// 4 byte len | serialized partial row
+// 4 byte len | serialized partial row
+// ...
+// 4 byte len | serialized partial row
+//
+// Serialization format for a partial row
+// bit vector(se + null) byte size (2 byte) |
+// bit vector mark set cells |
+// bit vector mark nullable cells' null value |
+// 8bit padding
+// serialized not null cells
+//
+// Example usage:
+// PartialRowBatch rb(&schema);
+// rb.load(buffer);
+// while (true) {
+// bool has;
+// rb.next(&has);
+// if (!has) break;
+// for (size_t j=0; j < reader.cell_size(); j++) {
+// const ColumnSchema* cs = nullptr;
+// const void* data = nullptr;
+// // get column cell type and data
+// rb.get_cell(j, &cs, &data);
+// }
+// }
+//
+// Note: currently only fixed length column types are supported. All length 
and scalar types store
+// in native byte order(little endian in x86-64).
+//
+// Note: The serialization format is simple, it only provides basic 
functionalities
+// so we can quickly complete the whole create/read/write pipeline. The format 
may change
+// as the project evolves.
+class PartialRowBatch {
+public:
+explicit PartialRowBatch(scoped_refptr* schema);
+~PartialRowBatch();
+
+const Schema& schema() const { return *_schema.get(); }
+
+// Load from a serialized buffer
+Status load(std::vector&& buffer);
+
+// Return row count in this batch
+size_t row_size() const { return _row_size; }
+
+// Iterate to next row, mark has_row to false if there is no more rows
+Status next_row(bool* has_row);
+
+// Get row operation cell count
+size_t cur_row_cell_size() const { return _cells.size(); }
+// Get row operation cell by index idx, return ColumnSchema and data 
pointer
+Status cur_row_get_cell(size_t idx, const ColumnSchema** cs, const void** 
data) const;
+
+private:
+scoped_refptr _schema;
+
+bool _delete = false;
+size_t _bit_set_size = 0;
+struct CellInfo {
+CellInfo(uint32_t cid, const void* data)
+: cid(cid), data(reinterpret_cast(data)) {}
+uint32_t cid = 0;
+const uint8_t* data = nullptr;
+};
+vector _cells;
+
+size_t _next_row = 0;
+size_t _row_size = 0;
+const uint8_t* _pos = nullptr;
+std::vector _buffer;
+};
+
+// Writer for PartialRowBatch
+//
+// Example usage:
+// scoped_refptr sc;
+// Schema::create("id int,uv int,pv int,city tinyint null", &sc);
+// PartialRowWriter writer(*sc.get());
+// writer.start_batch();
+// for (auto& row : rows) {
+// writer.start_row();
+// writer.set("column_name", value);
+// ...
+// writer.set(column_id, value);
+// writer.end_row();
+// }
+// vector buffer;
+// writer.end_batch(&buffer);
+class PartialRowWriter {
+public:
+static const size_t DEFAULT_BYTE_CAPACITY = 1 << 20;
+static const size_t DEFAULT_ROW_CAPACIT = 1 << 16;
+
+explicit PartialRowWriter(scoped_refptr* schema);
+~PartialRowWriter();
+
+Status start_batch(size_t row_capacity = DEFAULT_ROW_CAPACIT,
+   size_t byte_capacity = DEFAULT_BYTE_CAPACITY);
+
+// Start writing a new row
+Status start_row();
+
+// Set cell value by column name
+// param data's memory must remain valid before calling build
+Status set(const string& col, const void* data);
+
+// Set cell value by column id
+// param data's memory must remain valid before calling build
+Status set(uint32_t cid, const void* data);
+
+// set this row is delete operation
+Status set_delete();
+
+// Finish writing a row
+Status end_row();
+
+// F

[GitHub] [incubator-doris] morningman closed issue #3646: [Enhancement][Txn] Add more info to find way publish failed.

2020-05-22 Thread GitBox


morningman closed issue #3646:
URL: https://github.com/apache/incubator-doris/issues/3646


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[incubator-doris] branch master updated: [Enhancement] Add detail msg to show the reason of publish failure. (#3647)

2020-05-22 Thread morningman
This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git


The following commit(s) were added to refs/heads/master by this push:
 new 1124808  [Enhancement] Add detail msg to show the reason of publish 
failure. (#3647)
1124808 is described below

commit 1124808fbc22d1544cc4b83fd7cfee23af0c01fb
Author: Mingyu Chen 
AuthorDate: Fri May 22 22:59:53 2020 +0800

[Enhancement] Add detail msg to show the reason of publish failure. (#3647)

Add 2 new columns `PublishTime` and `ErrMsg` to show publish version time 
and  errors happen during the transaction process. Can be seen by executing:

`SHOW PROC "/transactions/dbId/";`
or
`SHOW TRANSACTION WHERE ID=xx;`

Currently is only record error happen in publish phase, which can help us 
to find out which txn
is blocked.

Fix #3646
---
 .../org/apache/doris/common/proc/TransProcDir.java |  2 ++
 .../doris/transaction/DatabaseTransactionMgr.java  | 32 +++---
 .../doris/transaction/PublishVersionDaemon.java|  4 ++-
 .../apache/doris/transaction/TransactionState.java | 19 -
 .../transaction/DatabaseTransactionMgrTest.java|  9 +++---
 5 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/fe/src/main/java/org/apache/doris/common/proc/TransProcDir.java 
b/fe/src/main/java/org/apache/doris/common/proc/TransProcDir.java
index 0cef447..118cc85 100644
--- a/fe/src/main/java/org/apache/doris/common/proc/TransProcDir.java
+++ b/fe/src/main/java/org/apache/doris/common/proc/TransProcDir.java
@@ -35,11 +35,13 @@ public class TransProcDir implements ProcDirInterface {
 .add("LoadJobSourceType")
 .add("PrepareTime")
 .add("CommitTime")
+.add("PublishTime")
 .add("FinishTime")
 .add("Reason")
 .add("ErrorReplicasCount")
 .add("ListenerId")
 .add("TimeoutMs")
+.add("ErrMsg")
 .build();
 
 public static final int MAX_SHOW_ENTRIES = 2000;
diff --git 
a/fe/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java 
b/fe/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java
index fef801f..ac2764e 100644
--- a/fe/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java
+++ b/fe/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java
@@ -17,13 +17,6 @@
 
 package org.apache.doris.transaction;
 
-import com.google.common.annotations.VisibleForTesting;
-import com.google.common.base.Joiner;
-import com.google.common.base.Preconditions;
-import com.google.common.collect.Lists;
-import com.google.common.collect.Maps;
-import com.google.common.collect.Sets;
-import org.apache.commons.collections.CollectionUtils;
 import org.apache.doris.catalog.Catalog;
 import org.apache.doris.catalog.Database;
 import org.apache.doris.catalog.MaterializedIndex;
@@ -44,9 +37,9 @@ import org.apache.doris.common.LabelAlreadyUsedException;
 import org.apache.doris.common.LoadException;
 import org.apache.doris.common.MetaNotFoundException;
 import org.apache.doris.common.Pair;
+import org.apache.doris.common.UserException;
 import org.apache.doris.common.util.DebugUtil;
 import org.apache.doris.common.util.TimeUtils;
-import org.apache.doris.common.UserException;
 import org.apache.doris.common.util.Util;
 import org.apache.doris.metric.MetricRepo;
 import org.apache.doris.mysql.privilege.PrivPredicate;
@@ -59,6 +52,15 @@ import org.apache.doris.task.ClearTransactionTask;
 import org.apache.doris.task.PublishVersionTask;
 import org.apache.doris.thrift.TTaskType;
 import org.apache.doris.thrift.TUniqueId;
+
+import com.google.common.annotations.VisibleForTesting;
+import com.google.common.base.Joiner;
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Sets;
+
+import org.apache.commons.collections.CollectionUtils;
 import org.apache.logging.log4j.LogManager;
 import org.apache.logging.log4j.Logger;
 
@@ -231,11 +233,13 @@ public class DatabaseTransactionMgr {
 info.add(txnState.getSourceType().name());
 info.add(TimeUtils.longToTimeString(txnState.getPrepareTime()));
 info.add(TimeUtils.longToTimeString(txnState.getCommitTime()));
+info.add(TimeUtils.longToTimeString(txnState.getPublishVersionTime()));
 info.add(TimeUtils.longToTimeString(txnState.getFinishTime()));
 info.add(txnState.getReason());
 info.add(String.valueOf(txnState.getErrorReplicas().size()));
 info.add(String.valueOf(txnState.getCallbackId()));
 info.add(String.valueOf(txnState.getTimeoutMs()));
+info.add(txnState.getErrMsg());
 }
 
 public long beginTransaction(List tableIdList, String label, 
TUniqueId requestId,
@@ -579,8 +583,8 @@ public

[GitHub] [incubator-doris] morningman merged pull request #3647: [Enhancement] Add detail msg to show the reason of publish failure.

2020-05-22 Thread GitBox


morningman merged pull request #3647:
URL: https://github.com/apache/incubator-doris/pull/3647


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[incubator-doris] branch master updated (1124808 -> ef9c716)

2020-05-22 Thread morningman
This is an automated email from the ASF dual-hosted git repository.

morningman pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git.


from 1124808  [Enhancement] Add detail msg to show the reason of publish 
failure. (#3647)
 add 74fb1b8  [Bug] Fix bug that missing OP_SET_REPLICA_STATUS when reading 
journal
 new ef9c716  [Bug] Fix bug that missing OP_SET_REPLICA_STATUS when reading 
journal (#3662)

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 fe/src/main/java/org/apache/doris/journal/JournalEntity.java | 6 ++
 1 file changed, 6 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[incubator-doris] 01/01: [Bug] Fix bug that missing OP_SET_REPLICA_STATUS when reading journal (#3662)

2020-05-22 Thread morningman
This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-doris.git

commit ef9c716682dd943cac5aa48618f108c9b33c9efb
Merge: 1124808 74fb1b8
Author: Mingyu Chen 
AuthorDate: Fri May 22 23:04:47 2020 +0800

[Bug] Fix bug that missing OP_SET_REPLICA_STATUS when reading journal 
(#3662)

 fe/src/main/java/org/apache/doris/journal/JournalEntity.java | 6 ++
 1 file changed, 6 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman closed issue #3663: [Bug] FE crash after executing ADMIN SET REPLICA STATUS

2020-05-22 Thread GitBox


morningman closed issue #3663:
URL: https://github.com/apache/incubator-doris/issues/3663


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman merged pull request #3662: [Bug] Fix bug that missing OP_SET_REPLICA_STATUS when reading journal

2020-05-22 Thread GitBox


morningman merged pull request #3662:
URL: https://github.com/apache/incubator-doris/pull/3662


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on a change in pull request #3584: [OUTFILE] Support `INTO OUTFILE` to export query result

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3584:
URL: https://github.com/apache/incubator-doris/pull/3584#discussion_r429319050



##
File path: docs/zh-CN/administrator-guide/outfile.md
##
@@ -0,0 +1,183 @@
+---
+{
+"title": "导出查询结果集",
+"language": "zh-CN"
+}
+---
+
+
+
+# 导出查询结果集
+
+本文档介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。
+
+## 语法
+
+`SELECT INTO OUTFILE` 语句可以将查询结果导出到文件中。目前仅支持通过 Broker 进程导出到远端存储,如 HDFS,S3,BOS 
上。语法如下
+
+```
+query_stmt
+INTO OUTFILE "file_path"
+[format_as]
+WITH BROKER `broker_name`
+[broker_properties]
+[other_properties]
+```
+
+* `file_path`
+
+`file_path` 指向文件存储的路径以及文件前缀。如 `hdfs://path/to/my_file`。
+
+最终的文件名将由 `my_file`,文件序号以及文件格式后缀组成。其中文件序号由0开始,数量为文件被分割的数量。如:
+
+```
+my_file_0.csv
+my_file_1.csv
+my_file_2.csv
+```
+
+* `[format_as]`
+
+```
+FORMAT AS CSV
+```
+
+指定导出格式。默认为 CSV。
+
+* `[broker_properties]`
+
+```
+("broker_prop_key" = "broker_prop_val", ...)
+``` 
+
+Broker 相关的一些参数,如 HDFS 的 认证信息等。具体参阅[Broker 文档](./broker.html)。
+
+* `[other_properties]`
+
+```
+("key1" = "val1", "key2" = "val2", ...)
+```
+
+其他属性,目前支持以下属性:
+
+* `column_separator`:列分隔符,仅对 CSV 格式适用。默认为 `\t`。
+* `line_delimiter`:行分隔符,仅对 CSV 格式适用。默认为 `\n`。
+* `max_file_size_bytes`:单个文件的最大大小。默认为 1GB。取值范围在 5MB 到 2GB 
之间。超过这个大小的文件将会被切分。
+
+1. 示例1
+
+将简单查询结果导出到文件 `hdfs:/path/to/result.txt`。指定导出格式为 CSV。使用 `my_broker` 并设置 
kerberos 认证信息。指定列分隔符为 `,`,行分隔符为 `\n`。
+
+```
+SELECT * FROM tbl
+INTO OUTFILE "hdfs:/path/to/result"
+FORMAT AS CSV
+WITH BROKER "my_broker"
+(
+"hadoop.security.authentication" = "kerberos",
+"kerberos_principal" = "do...@your.com",
+"kerberos_keytab" = "/home/doris/my.keytab"
+)
+PROPERTIELS
+(
+"column_separator" = ",",
+"line_delimiter" = "\n",
+"max_file_size_bytes" = "100MB"

Review comment:
   OK

##
File path: docs/zh-CN/administrator-guide/outfile.md
##
@@ -0,0 +1,183 @@
+---
+{
+"title": "导出查询结果集",
+"language": "zh-CN"
+}
+---
+
+
+
+# 导出查询结果集
+
+本文档介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。
+
+## 语法
+
+`SELECT INTO OUTFILE` 语句可以将查询结果导出到文件中。目前仅支持通过 Broker 进程导出到远端存储,如 HDFS,S3,BOS 
上。语法如下
+
+```
+query_stmt
+INTO OUTFILE "file_path"
+[format_as]
+WITH BROKER `broker_name`
+[broker_properties]
+[other_properties]
+```
+
+* `file_path`
+
+`file_path` 指向文件存储的路径以及文件前缀。如 `hdfs://path/to/my_file`。
+
+最终的文件名将由 `my_file`,文件序号以及文件格式后缀组成。其中文件序号由0开始,数量为文件被分割的数量。如:
+
+```
+my_file_0.csv

Review comment:
   OK

##
File path: docs/zh-CN/administrator-guide/outfile.md
##
@@ -0,0 +1,183 @@
+---
+{
+"title": "导出查询结果集",
+"language": "zh-CN"
+}
+---
+
+
+
+# 导出查询结果集
+
+本文档介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。
+
+## 语法
+
+`SELECT INTO OUTFILE` 语句可以将查询结果导出到文件中。目前仅支持通过 Broker 进程导出到远端存储,如 HDFS,S3,BOS 
上。语法如下
+
+```
+query_stmt
+INTO OUTFILE "file_path"
+[format_as]
+WITH BROKER `broker_name`

Review comment:
   OK





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on a change in pull request #3584: [OUTFILE] Support `INTO OUTFILE` to export query result

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3584:
URL: https://github.com/apache/incubator-doris/pull/3584#discussion_r429319147



##
File path: docs/zh-CN/administrator-guide/outfile.md
##
@@ -0,0 +1,183 @@
+---
+{
+"title": "导出查询结果集",
+"language": "zh-CN"
+}
+---
+
+
+
+# 导出查询结果集
+
+本文档介绍如何使用 `SELECT INTO OUTFILE` 命令进行查询结果的导出操作。
+
+## 语法
+
+`SELECT INTO OUTFILE` 语句可以将查询结果导出到文件中。目前仅支持通过 Broker 进程导出到远端存储,如 HDFS,S3,BOS 
上。语法如下
+
+```
+query_stmt
+INTO OUTFILE "file_path"
+[format_as]
+WITH BROKER `broker_name`

Review comment:
   OK,新的语法更新在了proposal里





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

2020-05-22 Thread GitBox


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r429501802



##
File path: docs/zh-CN/administrator-guide/resource-management.md
##
@@ -0,0 +1,125 @@
+---
+{
+"title": "资源管理",
+"language": "zh-CN"
+}
+---
+
+
+
+# 资源管理
+
+为了节省Doris集群内的计算、存储资源,Doris需要引入一些其他外部资源来完成相关的工作,如Spark/GPU用于查询,HDFS/S3用于外部存储,Spark/MapReduce用于ETL等,因此我们引入资源管理机制来管理Doris使用的这些外部资源。
+
+
+
+## 基本概念
+
+一个资源包含名字、类型等基本信息,名字为全局唯一,不同类型的资源包含不同的属性,具体参考各资源的介绍。
+
+资源的创建和删除只能由拥有 `admin` 权限的用户进行操作。一个资源隶属于整个Doris集群。拥有 `admin` 
权限的用户可以将使用权限`usage_priv` 赋给普通用户。可参考`HELP GRANT`或者权限文档。
+
+
+
+## 具体操作
+
+资源管理主要有三个命令:`CREATE RESOURCE`,`DROP RESOURCE` 和 `SHOW 
RESOURCES`,分别为创建、删除和查看资源。这三个命令的具体语法可以通过MySQL客户端连接到 Doris 后,执行 `HELP cmd` 
的方式查看帮助。
+
+1. CREATE RESOURCE
+
+   语法
+
+   ```sql
+   CREATE [EXTERNAL] RESOURCE "resource_name"  
+ PROPERTIES ("key"="value", ...); 
+   ```
+
+   在创建资源的命令中,用户必须提供以下信息:
+
+   * `resource_name` 为 Doris 中配置的资源的名字。
+   * `PROPERTIES` 是资源相关参数,如下:
+ * `type`:资源类型,必填,目前仅支持 spark。
+ * 其他参数见各资源介绍。
+
+2. DROP RESOURCE
+
+   该命令可以删除一个已存在的资源。具体操作见:`HELP DROP RESOURCE`
+
+3. SHOW RESOURCES
+
+   该命令可以查看用户有使用权限的资源。具体操作见:`HELP SHOW RESOURCES`
+
+
+
+## 支持的资源
+
+目前仅支持Spark资源,完成ETL工作。下面的示例都以Spark资源为例。
+
+### Spark
+
+ 参数
+
+# Spark 相关参数如下:
+
+`spark.master`: 必填,目前支持yarn,spark://host:port。
+
+`spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。
+
+`spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+
+`spark.hadoop.fs.defaultFS`: master为yarn时必填。
+
+其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html。
+
+
+
+# 如果Spark用于ETL,还需要指定以下参数:
+
+`working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。
+
+`broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 
+
+  * `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。
+
+
+
+ 示例
+
+创建 yarn cluster 模式,名为 spark0 的 Spark 资源。
+
+```sql
+CREATE EXTERNAL RESOURCE "spark0"
+PROPERTIES
+(
+  "type" = "spark",
+  "spark.master" = "yarn",
+  "spark.submit.deployMode" = "cluster",
+  "spark.jars" = "xxx.jar,yyy.jar",
+  "spark.files" = "/tmp/aaa,/tmp/bbb",
+  "spark.executor.memory" = "1g",
+  "spark.yarn.queue" = "queue0",
+  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:",
+  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:1",
+  "working_dir" = "hdfs://127.0.0.1:1/tmp/doris",

Review comment:
   Not all users have only one cluster, so i think we can't load 
configurations from single `HADOOP_HOME` source. 
   Now users need to specify `defaultFS` only one time when creating a new 
spark resource.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] chaoyli commented on a change in pull request #3661: [Optimize] Using sorted schema change processing to merge the data

2020-05-22 Thread GitBox


chaoyli commented on a change in pull request #3661:
URL: https://github.com/apache/incubator-doris/pull/3661#discussion_r429507733



##
File path: be/src/olap/schema_change.cpp
##
@@ -1859,6 +1859,29 @@ OLAPStatus 
SchemaChangeHandler::_parse_request(TabletSharedPtr base_tablet,
 // 若Key列的引用序列出现乱序,则需要重排序
 int num_default_value = 0;
 
+// A, B, C are keys(sort keys), D is value
+// The following cases are not changing the order, no need to resort:
+// (sort keys keep in same order)
+//  old keys:A   B   C   D
+//  new keys:A   X   B   C   D
+//
+//  old keys:A   B   C   D
+//  new keys:X   A   B   C   D
+//
+//  old keys:A   B   C   D
+//  new keys:A   B   C
+//
+//  old keys:A   B   C   D
+//  new keys:A   B

Review comment:
   After this pull request.
   old keys: A B C D
   new keys: A B C
   will be resorted.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] chaoyli commented on pull request #3666: Del some useless code and impove the performance of sort

2020-05-22 Thread GitBox


chaoyli commented on pull request #3666:
URL: https://github.com/apache/incubator-doris/pull/3666#issuecomment-632975491


   It's better to post the performance result.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on pull request #3604: (#3464) fix Query failed when fact table has no data in join case

2020-05-22 Thread GitBox


morningman commented on pull request #3604:
URL: https://github.com/apache/incubator-doris/pull/3604#issuecomment-632976533


   > 1 make HASH_TBL_SPACE_OVERHEAD a configurable param,but it seems not a 
common solution, because I'm not sure what factors can affect the real hash 
table size.
   > 2 maybe we can disable broadcast when dim table exceeds 100M
   
   These 2 solutions looks seem, that both require manual parameter adjustment. 
This is no different from directly adding hints manually.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on a change in pull request #3661: [Optimize] Using sorted schema change processing to merge the data

2020-05-22 Thread GitBox


morningman commented on a change in pull request #3661:
URL: https://github.com/apache/incubator-doris/pull/3661#discussion_r429508808



##
File path: be/src/olap/schema_change.cpp
##
@@ -1859,6 +1859,29 @@ OLAPStatus 
SchemaChangeHandler::_parse_request(TabletSharedPtr base_tablet,
 // 若Key列的引用序列出现乱序,则需要重排序
 int num_default_value = 0;
 
+// A, B, C are keys(sort keys), D is value
+// The following cases are not changing the order, no need to resort:
+// (sort keys keep in same order)
+//  old keys:A   B   C   D
+//  new keys:A   X   B   C   D
+//
+//  old keys:A   B   C   D
+//  new keys:X   A   B   C   D
+//
+//  old keys:A   B   C   D
+//  new keys:A   B   C
+//
+//  old keys:A   B   C   D
+//  new keys:A   B

Review comment:
   Sorry that this PR is not ready, I will consider more situation.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] morningman commented on pull request #3584: [OUTFILE] Support `INTO OUTFILE` to export query result

2020-05-22 Thread GitBox


morningman commented on pull request #3584:
URL: https://github.com/apache/incubator-doris/pull/3584#issuecomment-632980410


   Hi @imay Please review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] HappenLee commented on pull request #3666: Del some useless code and impove the performance of sort

2020-05-22 Thread GitBox


HappenLee commented on pull request #3666:
URL: https://github.com/apache/incubator-doris/pull/3666#issuecomment-632982262


   @chaoyli 
   OK!This is performance result.
   
   ### ENV
   
  CPU : Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 6cores * 2 
  RAM : 96GB
   
   ### Data
   Random numbers from 1 to 50 million
   
   ### Test Result
   
   Order of arrangement | origin |  find_the_median | Percentage increase
   :--:|:--:|:--:|:--:
   Order | 41s985ms | 39s389ms | 6.18%
   Reverse | 36s854ms | 34s376ms | 6.67%
   Random Shuffle |36s520ms | 33s888ms | 7.2%
   
   Now Doris's optimizer is not powerful enough. I test the scenario where the 
order by column  cardinality is small and the repetition rate is high. If we 
can Intelligently choose  three way quick sort, there will be five to ten times 
better performance. What a pity.
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] decster opened a new issue #3669: [Memory Engine] MemTablet creation and compatibility handling in BE

2020-05-22 Thread GitBox


decster opened a new issue #3669:
URL: https://github.com/apache/incubator-doris/issues/3669


   After adding meta, BE now can create MemTablet, and put it into 
TabletManager, but MemTablet is not compatible with Tablet, a lot of code may 
break. I propose the following changes:
   
   Step 1: when TabletManager::get_tablet is called, only return Tablet, if the 
underlying tablet is MemTablet, return an error, this keeps the initial code 
changes small.
   
   Step 2: For methods/functionalities Tablet&MemTablet both have, 
refactor/extract them to BaseTablet, and change the usage to 
TabletManager::get_base_tablet, and it will gradually remove errors introduced 
by step1, this is a long going process.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] caiconghui opened a new pull request #3670: Allow user to set thrift_client_timeout_ms config for thrift server

2020-05-22 Thread GitBox


caiconghui opened a new pull request #3670:
URL: https://github.com/apache/incubator-doris/pull/3670


The value for thrift_client_timeout_ms should set to be larger than zero to 
prevent
some hang up problems in java.net.SocketInputStream.socketRead0



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] caiconghui commented on pull request #3670: Allow user to set thrift_client_timeout_ms config for thrift server

2020-05-22 Thread GitBox


caiconghui commented on pull request #3670:
URL: https://github.com/apache/incubator-doris/pull/3670#issuecomment-632994065


   for #3671 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org



[GitHub] [incubator-doris] caiconghui opened a new issue #3671: Many thrift-server-pool threads stuck in java.net.SocketInputStream.socketRead0

2020-05-22 Thread GitBox


caiconghui opened a new issue #3671:
URL: https://github.com/apache/incubator-doris/issues/3671


   Now, the connection timeout and socket timeout is set to be 0 which may 
cause many threads stuck in  java.net.SocketInputStream.socketRead0 and can 
never be reused.
   The Problem is like the following:
   "thrift-server-pool-2493" #2999 daemon prio=5 os_prio=0 
tid=0x7f77341e2000 nid=0x3b5c runnable [0x7f750501e000]
  java.lang.Thread.State: RUNNABLE
   at java.net.SocketInputStream.socketRead0(Native Method)
   at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
   at java.net.SocketInputStream.read(SocketInputStream.java:171)
   at java.net.SocketInputStream.read(SocketInputStream.java:141)
   at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
   at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
   at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
   - locked <0x00071d55df60> (a java.io.BufferedInputStream)
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
   at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
   at 
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
   at 
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
   at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
   at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   
   When thrift client crash or encounter some network problem , the thrift 
server pool thread may stuck in  java.net.SocketInputStream.socketRead0 which 
cause the connection leak, so we need a thrift_client_timeout_ms config to 
prevent this problem.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org