wutiangan commented on a change in pull request #3867:
URL: https://github.com/apache/incubator-doris/pull/3867#discussion_r440105196
##########
File path: docs/en/extending-doris/doris-on-es.md
##########
@@ -26,62 +26,314 @@ under the License.
# Doris On ES
-Doris-On-ES combines Doris's distributed query planning capability with ES
(Elastic search)'s full-text search capability to provide a more complete OLAP
scenario solution:
+Doris-On-ES not only take advantage of Doris's distributed query planning
capability but also ES (Elastic search)'s full-text search capability, provide
a more complete OLAP scenario solution:
1. Multi-index Distributed Join Query in ES
2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and
Filtering
-3. Aggregated queries for fields of ES keyword type: suitable for frequent
changes in index, tens of millions or more of single fragmented documents, and
the cardinality of the field is very large
This document mainly introduces the realization principle and usage of this
function.
-## Noun Interpretation
+## Glossary
+
+### Noun in Doris
* FE: Frontend, the front-end node of Doris. Responsible for metadata
management and request access.
* BE: Backend, Doris's back-end node. Responsible for query execution and data
storage.
-* Elastic search (ES): The most popular open source distributed search engine.
+
+### Noun in ES
+
* DataNode: The data storage and computing node of ES.
* MasterNode: The Master node of ES, which manages metadata, nodes, data
distribution, etc.
* scroll: The built-in data set cursor feature of ES for streaming scanning
and filtering of data.
+* _source: contains the original JSON document body that was passed at index
time
+* doc_values: store the same values as the _source but in a column-oriented
fashion
+* keyword: string datatype in ES, but the content not analyzed by analyzer
+* text: string datatype in ES, the content analyzed by analyzer
-## How to use it
+## How To Use
-### Create appearance
+### Create ES Index
```
-CREATE EXTERNAL TABLE `es_table` (
- `id` bigint(20) COMMENT "",
+PUT test
+{
+ "settings": {
+ "index": {
+ "number_of_shards": "1",
+ "number_of_replicas": "0"
+ }
+ },
+ "mappings": {
+ "doc": { // ES 7.x版本之后创建索引时不需要指定type,会有一个默认且唯一的`_doc` type
+ "properties": {
+ "k1": {
+ "type": "long"
+ },
+ "k2": {
+ "type": "date"
+ },
+ "k3": {
+ "type": "keyword"
+ },
+ "k4": {
+ "type": "text",
+ "analyzer": "standard"
+ },
+ "k5": {
+ "type": "float"
+ }
+ }
+ }
+ }
+}
+```
+
+### Add JSON documents to ES index
+
+```
+POST /_bulk
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4":
"Trying out Elasticsearch", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out
Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES",
"k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
+{"index":{"_index":"test","_type":"doc"}}
+{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
+```
+
+### Create external ES table
+
+```
+CREATE EXTERNAL TABLE `test` (
+ `k1` bigint(20) COMMENT "",
+ `k2` datetime COMMENT "",
+ `k3` varchar(20) COMMENT "",
+ `k4` varchar(100) COMMENT "",
+ `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH // ENGINE必须是Elasticsearch
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
+"index" = "test”,
+"type" = "doc",
+
+"user" = "root",
+"password" = "root"
+);
+```
+
+The following parameters are accepted by ES table:
+
+参数 | 说明
+---|---
+**hosts** | ES Cluster Connection Address, maybe one or more node,
load-balance is also accepted
+**index** | the related ES index name
+**type** | the type for this index,If not specified, `_doc` will be used
+**user** | username for ES
+**password** | password for the user
+
+* For clusters before 7.x, please pay attention to choosing the correct type
when building the table
+* The authentication method only supports Http Bastic authentication, and the
fully open source ES cluster user and password do not need to be specified,Need
to ensure that the user has access to /_cluster/state and other path
permissions and read permissions to index
+* The column names in the Doris table need to exactly match the field names in
the ES, and the field types should be as consistent as possible
+* **ENGINE** must be: **Elasticsearch**
+
+##### Filter to push down
+
+An important ability of `Doris On ES` is the push-down of filter conditions:
The filtering conditions are pushed to ES, so that only the data that really
meets the conditions will be returned, which can significantly improve query
performance and reduce CPU, memory, and IO utilization of Doris and ES
+
+The following operators (Operators) will be optimized to the following ES
Query:
+
+| SQL syntax | ES 5.x+ syntax |
+|-------|:---:|
+| = | term query|
+| in | terms query |
+| > , < , >= , ⇐ | range query |
+| and | bool.filter |
+| or | bool.should |
+| not | bool.must_not |
+| not in | bool.must_not + terms query |
+| is\_not\_null | exists query |
+| is\_null | bool.must_not + exists query |
+| esquery | QueryDSL in ES native json form |
+
+##### Data type mapping
+
+Doris\ES | byte | short | integer | long | float | double| keyword | text |
date
+------------- | ------------- | ------ | ---- | ----- | ---- | ------ |
----| --- | --- |
+tinyint | √ | | | | | | | |
+smallint | √ | √ | | | | | | |
+int | √ | √ | √ | | | | | |
+bigint | √ | √ | √ | √ | | | | |
+float | | | | | √ | | | |
+double | | | | | | √ | | |
+char | | | | | | | √ | √ |
+varchar | | | | | | | √ | √ |
+date | | | | | | | | | √|
+datetime | | | | | | | | | √|
+
+
+### Enable column scan to optimize query speed(enable\_docvalue\_scan=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
-PARTITION BY RANGE(`id`)
-()
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
+"index" = "test”,
+"type" = "doc",
"user" = "root",
"password" = "root",
-"index" = "tindex”,
-"type" = "doc"
+
+"enable_docvalue_scan" = "true"
);
```
-Description of parameters:
+Parameter Description:
-Parameter | description
+Parameter | Description
---|---
-Hosts | ES Cluster Connection Address, which can specify one or more, through
which Doris obtains the share distribution information of ES version number and
index
-User | Open the user name of the ES cluster authenticated by basic, you need
to ensure that the user has access to: / cluster / state / nodes / HTTP and
other path permissions and read permissions for index
-Password | corresponding user's password information
-The index name of the ES corresponding to the table in index | Doris can be
alias
-Type | Specifies the type of index, defaulting to _doc
-Transport | Internal reservation, default to http
+**enable\_docvalue\_scan** | whether to enable ES/Lucene column storage to get
the value of the query field, the default is false
+
+Doris obtains data from ES following the following two principles:
+
+* **Best effort**: Automatically detect whether the column to be read has
column storage enabled (doc_value: true).If all the fields obtained have column
storage, Doris will obtain the values of all fields from the column
storage(doc_values)
+* **Automatic downgrade**: If the field to be obtained has one or more field
that is not have doc_value, the values of all fields will be parsed from the
line store `_source`
+
+##### Advantage:
+
+By default, Doris On ES will get all the required columns from the row
storage, which is `_source`, and the storage of `_source` is the origin json
format document,Inferior to column storage in batch read performance,Especially
obvious when only a few columns are needed,When only a few columns are
obtained, the performance of docvalue is about ten times that of _source
-### Query
+##### Tip
+1. Fields of type `text` are not column-stored in ES, so if the value of the
field to be obtained has a field of type `text`, it will be automatically
downgraded to get from `_source`
+2. In the case of too many fields obtained (`>= 25`), the performance of
getting field values from `docvalue` will be basically the same as getting
field values from `_source`
-#### Basic Conditions Filtration
+
+### Detect keyword type field(enable\_keyword\_sniff=true)
+
+```
+CREATE EXTERNAL TABLE `test` (
+ `k1` bigint(20) COMMENT "",
+ `k2` datetime COMMENT "",
+ `k3` varchar(20) COMMENT "",
+ `k4` varchar(100) COMMENT "",
+ `k5` float COMMENT ""
+) ENGINE=ELASTICSEARCH
+PROPERTIES (
+"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
+"index" = "test”,
+"type" = "doc",
+"user" = "root",
+"password" = "root",
+
+"enable_keyword_sniff" = "true"
+);
+```
+
+参数说明:
+
Review comment:
change it to english
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]