This is an automated email from the ASF dual-hosted git repository. morningman pushed a commit to branch dev-1.0.1 in repository https://gitbox.apache.org/repos/asf/incubator-doris.git
commit a6787055fedf2d18511346677555fa1df8364706 Author: Mingyu Chen <[email protected]> AuthorDate: Sat Mar 19 15:45:17 2022 +0800 [doc] Fix some typo about spark load and broker load (#8520) 1. add hive-bitmap-udf link 2. modify preceding-filter --- docs/en/administrator-guide/load-data/spark-load-manual.md | 8 +++++--- .../sql-statements/Data Manipulation/BROKER LOAD.md | 12 ++++++------ .../zh-CN/administrator-guide/load-data/spark-load-manual.md | 10 +++------- .../sql-statements/Data Manipulation/BROKER LOAD.md | 12 ++++++------ 4 files changed, 20 insertions(+), 22 deletions(-) diff --git a/docs/en/administrator-guide/load-data/spark-load-manual.md b/docs/en/administrator-guide/load-data/spark-load-manual.md index 00c5922..062e182 100644 --- a/docs/en/administrator-guide/load-data/spark-load-manual.md +++ b/docs/en/administrator-guide/load-data/spark-load-manual.md @@ -92,8 +92,6 @@ The implementation of spark load task is mainly divided into the following five ``` - - ## Global dictionary ### Applicable scenarios @@ -132,6 +130,10 @@ In the existing Doris import process, the data structure of global dictionary is 6. Subsequent brokers will pull the files in HDFS and import them into Doris be. +## Hive Bitmap UDF + +Spark supports loading hive-generated bitmap data directly into Doris, see [hive-bitmap-udf documentation](../../extending-doris/hive-bitmap-udf.md) + ## Basic operation ### Configure ETL cluster @@ -627,4 +629,4 @@ If `spark_resource_path` is not set correctly. An error `file XXX/jars/spark-2x. * When using spark load `yarn_client_path` does not point to a executable file of yarn. -If `yarn_client_path` is not set correctly. An error `yarn client does not exist in path: XXX/yarn-client/hadoop/bin/yarn` will be reported. \ No newline at end of file +If `yarn_client_path` is not set correctly. An error `yarn client does not exist in path: XXX/yarn-client/hadoop/bin/yarn` will be reported. diff --git a/docs/en/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md b/docs/en/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md index 8e062b2..abebcca 100644 --- a/docs/en/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md +++ b/docs/en/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md @@ -69,8 +69,8 @@ under the License. [COLUMNS TERMINATED BY "column_separator"] [FORMAT AS "file_type"] [(column_list)] - [PRECEDING FILTER predicate] [SET (k1 = func(k2))] + [PRECEDING FILTER predicate] [WHERE predicate] [DELETE ON label=true] [read_properties] @@ -110,10 +110,6 @@ under the License. syntax: (col_name1, col_name2, ...) - PRECEDING FILTER predicate: - - Used to filter original data. The original data is the data without column mapping and transformation. The user can filter the data before conversion, select the desired data, and then perform the conversion. - SET: If this parameter is specified, a column of the source file can be transformed according to a function, and then the transformed result can be loaded into the table. The grammar is `column_name = expression`. Some examples are given to help understand. @@ -122,6 +118,10 @@ under the License. Example 2: There are three columns "year, month, day" in the table. There is only one time column in the source file, in the format of "2018-06-01:02:03". Then you can specify columns (tmp_time) set (year = year (tmp_time), month = month (tmp_time), day = day (tmp_time)) to complete the import. + PRECEDING FILTER predicate: + + Used to filter original data. The original data is the data without column mapping and transformation. The user can filter the data before conversion, select the desired data, and then perform the conversion. + WHERE: After filtering the transformed data, data that meets where predicates can be loaded. Only column names in tables can be referenced in WHERE statements. @@ -534,8 +534,8 @@ under the License. INTO TABLE `tbl1` COLUMNS TERMINATED BY "," (k1,k2,v1,v2) - PRECEDING FILTER k1 > 2 SET (k1 = k1 +1) + PRECEDING FILTER k1 > 2 WHERE k1 > 3 ) with BROKER "hdfs" ("username"="user", "password"="pass"); diff --git a/docs/zh-CN/administrator-guide/load-data/spark-load-manual.md b/docs/zh-CN/administrator-guide/load-data/spark-load-manual.md index c5185a5..4f16662 100644 --- a/docs/zh-CN/administrator-guide/load-data/spark-load-manual.md +++ b/docs/zh-CN/administrator-guide/load-data/spark-load-manual.md @@ -30,15 +30,11 @@ Spark load 通过外部的 Spark 资源实现对导入数据的预处理,提 Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 Spark 类型导入任务,并通过 `SHOW LOAD` 查看导入结果。 - - ## 适用场景 * 源数据在 Spark 可以访问的存储系统中,如 HDFS。 * 数据量在 几十 GB 到 TB 级别。 - - ## 名词解释 1. Frontend(FE):Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。 @@ -47,7 +43,6 @@ Spark load 是一种异步导入方式,用户需要通过 MySQL 协议创建 S 4. Broker:Broker 为一个独立的无状态进程。封装了文件系统接口,提供 Doris 读取远端存储系统中文件的能力。 5. 全局字典: 保存了数据从原始值到编码值映射的数据结构,原始值可以是任意数据类型,而编码后的值为整型;全局字典主要应用于精确去重预计算的场景。 - ## 基本原理 ### 基本流程 @@ -87,8 +82,6 @@ Spark load 任务的执行主要分为以下5个阶段。 ``` - - ## 全局字典 ### 适用场景 目前Doris中Bitmap列是使用类库```Roaringbitmap```实现的,而```Roaringbitmap```的输入数据类型只能是整型,因此如果要在导入流程中实现对于Bitmap列的预计算,那么就需要将输入数据的类型转换成整型。 @@ -111,6 +104,9 @@ Spark load 任务的执行主要分为以下5个阶段。 5. 每次完成聚合计算后,会对数据根据`bucket_id`进行分桶然后写入HDFS中。 6. 后续broker会拉取HDFS中的文件然后导入Doris Be中。 +## Hive Bitmap UDF + +Spark 支持将 hive 生成的 bitmap 数据直接导入到 Doris。详见 [hive-bitmap-udf 文档](../../extending-doris/hive-bitmap-udf.md) ## 基本操作 diff --git a/docs/zh-CN/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md b/docs/zh-CN/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md index e3b9f44..52613df 100644 --- a/docs/zh-CN/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md +++ b/docs/zh-CN/sql-reference/sql-statements/Data Manipulation/BROKER LOAD.md @@ -68,8 +68,8 @@ under the License. [COLUMNS TERMINATED BY "column_separator"] [FORMAT AS "file_type"] [(column_list)] - [PRECEDING FILTER predicate] [SET (k1 = func(k2))] + [PRECEDING FILTER predicate] [WHERE predicate] [DELETE ON label=true] [ORDER BY source_sequence] @@ -106,10 +106,6 @@ under the License. 语法: (col_name1, col_name2, ...) - PRECEDING FILTER predicate: - - 用于过滤原始数据。原始数据是未经列映射、转换的数据。用户可以在对转换前的数据前进行一次过滤,选取期望的数据,再进行转换。 - SET: 如果指定此参数,可以将源文件某一列按照函数进行转化,然后将转化后的结果导入到table中。语法为 `column_name` = expression。举几个例子帮助理解。 @@ -117,6 +113,10 @@ under the License. 例2: 表中有3个列“year, month, day"三个列,源文件中只有一个时间列,为”2018-06-01 01:02:03“格式。 那么可以指定 columns(tmp_time) set (year = year(tmp_time), month=month(tmp_time), day=day(tmp_time)) 完成导入。 + PRECEDING FILTER predicate: + + 用于过滤原始数据。原始数据是未经列映射、转换的数据。用户可以在对转换前的数据前进行一次过滤,选取期望的数据,再进行转换。 + WHERE: 对做完 transform 的数据进行过滤,符合 where 条件的数据才能被导入。WHERE 语句中只可引用表中列名。 @@ -549,8 +549,8 @@ under the License. INTO TABLE `tbl1` COLUMNS TERMINATED BY "," (k1,k2,v1,v2) - PRECEDING FILTER k1 > 2 SET (k1 = k1 +1) + PRECEDING FILTER k1 > 2 WHERE k1 > 3 ) with BROKER "hdfs" ("username"="user", "password"="pass"); --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
