This is an automated email from the ASF dual-hosted git repository.
kunalkapoor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new 483e7da [CARBONDATA-3772] Update index documents
483e7da is described below
commit 483e7da5c5394251c9a04cefe9924393af72f39f
Author: Gampa Shreelekhya <[email protected]>
AuthorDate: Tue Apr 14 18:40:03 2020 +0530
[CARBONDATA-3772] Update index documents
Why is this PR needed?
update index documentation to comply with recent changes
What changes were proposed in this PR?
Does this PR introduce any user interface change?
No
Yes. (please explain the change and update document)
Is any new testcase added?
No
Yes
This closes #3708
---
README.md | 10 +--
docs/faq.md | 37 -----------
docs/index-developer-guide.md | 17 ++---
docs/index/bloomfilter-index-guide.md | 107 +++++++++++++++---------------
docs/index/index-management.md | 119 +++++++++++++++-------------------
docs/index/lucene-index-guide.md | 91 +++++++++++++-------------
docs/language-manual.md | 3 +-
7 files changed, 166 insertions(+), 218 deletions(-)
diff --git a/README.md b/README.md
index c7f935d..b1a712c 100644
--- a/README.md
+++ b/README.md
@@ -53,11 +53,11 @@ CarbonData is built using Apache Maven, to [build
CarbonData](https://github.com
* [CarbonData Data Manipulation
Language](https://github.com/apache/carbondata/blob/master/docs/dml-of-carbondata.md)
* [CarbonData Streaming
Ingestion](https://github.com/apache/carbondata/blob/master/docs/streaming-guide.md)
* [Configuring
CarbonData](https://github.com/apache/carbondata/blob/master/docs/configuration-parameters.md)
- * [DataMap Developer
Guide](https://github.com/apache/carbondata/blob/master/docs/datamap-developer-guide.md)
+ * [Index Developer
Guide](https://github.com/apache/carbondata/blob/master/docs/index-developer-guide.md)
* [Data
Types](https://github.com/apache/carbondata/blob/master/docs/supported-data-types-in-carbondata.md)
-* [CarbonData DataMap
Management](https://github.com/apache/carbondata/blob/master/docs/datamap/datamap-management.md)
- * [CarbonData BloomFilter
DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/bloomfilter-datamap-guide.md)
- * [CarbonData Lucene
DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/lucene-datamap-guide.md)
+* [CarbonData Index
Management](https://github.com/apache/carbondata/blob/master/docs/index/index-management.md)
+ * [CarbonData BloomFilter
Index](https://github.com/apache/carbondata/blob/master/docs/index/bloomfilter-index-guide.md)
+ * [CarbonData Lucene
Index](https://github.com/apache/carbondata/blob/master/docs/index/lucene-index-guide.md)
* [CarbonData MV
DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/mv-datamap-guide.md)
* [Carbondata Secondary
Index](https://github.com/apache/carbondata/blob/master/docs/index/secondary-index-guide.md)
* [SDK
Guide](https://github.com/apache/carbondata/blob/master/docs/sdk-guide.md)
@@ -70,7 +70,7 @@ CarbonData is built using Apache Maven, to [build
CarbonData](https://github.com
## Integration
* [Hive](https://github.com/apache/carbondata/blob/master/docs/hive-guide.md)
-*
[Presto](https://github.com/apache/carbondata/blob/master/docs/presto-guide.md)
+*
[Presto](https://github.com/apache/carbondata/blob/master/docs/prestodb-guide.md)
*
[Alluxio](https://github.com/apache/carbondata/blob/master/docs/alluxio-guide.md)
## Other Technical Material
diff --git a/docs/faq.md b/docs/faq.md
index 45607f4..f3f0a6d 100644
--- a/docs/faq.md
+++ b/docs/faq.md
@@ -25,7 +25,6 @@
* [What is Carbon Lock Type?](#what-is-carbon-lock-type)
* [How to resolve Abstract Method
Error?](#how-to-resolve-abstract-method-error)
* [How Carbon will behave when execute insert operation in abnormal
scenarios?](#how-carbon-will-behave-when-execute-insert-operation-in-abnormal-scenarios)
-* [Why aggregate query is not fetching data from aggregate
table?](#why-aggregate-query-is-not-fetching-data-from-aggregate-table)
* [Why all executors are showing success in Spark UI even after Dataload
command failed at Driver
side?](#why-all-executors-are-showing-success-in-spark-ui-even-after-dataload-command-failed-at-driver-side)
* [Why different time zone result for select query output when query SDK
writer
output?](#why-different-time-zone-result-for-select-query-output-when-query-sdk-writer-output)
* [How to check LRU cache memory
footprint?](#how-to-check-lru-cache-memory-footprint)
@@ -162,42 +161,6 @@ INSERT INTO TABLE carbon_table SELECT id, city FROM
source_table;
When the column type in carbon table is different from the column specified in
select statement. The insert operation will still success, but you may get NULL
in result, because NULL will be substitute value when conversion type failed.
-## Why aggregate query is not fetching data from aggregate table?
-Following are the aggregate queries that won't fetch data from aggregate table:
-
-- **Scenario 1** :
-When SubQuery predicate is present in the query.
-
-Example:
-
-```
-create table gdp21(cntry smallint, gdp double, y_year date) stored as
carbondata;
-create datamap ag1 on table gdp21 using 'preaggregate' as select cntry,
sum(gdp) from gdp21 group by cntry;
-select ctry from pop1 where ctry in (select cntry from gdp21 group by cntry);
-```
-
-- **Scenario 2** :
-When aggregate function along with 'in' filter.
-
-Example:
-
-```
-create table gdp21(cntry smallint, gdp double, y_year date) stored as
carbondata;
-create datamap ag1 on table gdp21 using 'preaggregate' as select cntry,
sum(gdp) from gdp21 group by cntry;
-select cntry, sum(gdp) from gdp21 where cntry in (select ctry from pop1) group
by cntry;
-```
-
-- **Scenario 3** :
-When aggregate function having 'join' with equal filter.
-
-Example:
-
-```
-create table gdp21(cntry smallint, gdp double, y_year date) stored as
carbondata;
-create datamap ag1 on table gdp21 using 'preaggregate' as select cntry,
sum(gdp) from gdp21 group by cntry;
-select cntry,sum(gdp) from gdp21,pop1 where cntry=ctry group by cntry;
-```
-
## Why all executors are showing success in Spark UI even after Dataload
command failed at Driver side?
Spark executor shows task as failed after the maximum number of retry
attempts, but loading the data having bad records and BAD_RECORDS_ACTION
(carbon.bad.records.action) is set as "FAIL" will attempt only once but will
send the signal to driver as failed instead of throwing the exception to retry,
as there is no point to retry if bad record found and BAD_RECORDS_ACTION is set
to fail. Hence the Spark executor displays this one attempt as successful but
the command has actually failed to [...]
diff --git a/docs/index-developer-guide.md b/docs/index-developer-guide.md
index 106bf39..198c3e9 100644
--- a/docs/index-developer-guide.md
+++ b/docs/index-developer-guide.md
@@ -18,17 +18,18 @@
# Index Developer Guide
### Introduction
-DataMap is a data structure that can be used to accelerate certain query of
the table. Different DataMap can be implemented by developers.
-Currently, there are two types of DataMap supported:
-1. IndexDataMap: DataMap that leverages index to accelerate filter query.
Lucene DataMap and BloomFiler DataMap belong to this type of DataMaps.
-2. MVDataMap: DataMap that leverages Materialized View to accelerate olap
style query, like SPJG query (select, predicate, join, groupby). Preaggregate,
timeseries and mv DataMap belong to this type of DataMaps.
+Index is a data structure that can be used to accelerate certain query of the
table. Different Index can be implemented by developers.
+Currently, Carbondata supports three types of Indexes:
+1. BloomFilter Index: A space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
+2. Lucene Index: High performance, full-featured text search engine.
+3. Secondary Index: Sencondary index tables to hold blocklets are created as
indexes and managed as child tables internally by Carbondata.
### Index Provider
-When user issues `CREATE INDEX index_name ON TABLE main AS 'provider'`, the
corresponding DataMapProvider implementation will be created and initialized.
+When user issues `CREATE INDEX index_name ON TABLE main AS 'provider'`, the
corresponding IndexProvider implementation will be created and initialized.
Currently, the provider string can be:
-1. class name IndexDataMapFactory implementation: Developer can implement new
type of IndexDataMap by extending IndexDataMapFactory
+1. class name IndexFactory implementation: Developer can implement new type of
Index by extending IndexFactory
-When user issues `DROP INDEX index_name ON TABLE main`, the corresponding
DataMapProvider interface will be called.
+When user issues `DROP INDEX index_name ON TABLE main`, the corresponding
IndexFactory class will be called.
-Click for more details about [DataMap
Management](./index/index-management.md#index-management) and supported
[DSL](./index/index-management.md#overview).
+Click for more details about [Index
Management](./index/index-management.md#index-management) and supported
[DSL](./index/index-management.md#overview).
diff --git a/docs/index/bloomfilter-index-guide.md
b/docs/index/bloomfilter-index-guide.md
index 264cf0b..85f284a 100644
--- a/docs/index/bloomfilter-index-guide.md
+++ b/docs/index/bloomfilter-index-guide.md
@@ -15,59 +15,59 @@
limitations under the License.
-->
-# CarbonData BloomFilter DataMap
+# CarbonData BloomFilter Index
-* [DataMap Management](#datamap-management)
-* [BloomFilter Datamap Introduction](#bloomfilter-datamap-introduction)
+* [Index Management](#index-management)
+* [BloomFilter Index Introduction](#bloomfilter-index-introduction)
* [Loading Data](#loading-data)
* [Querying Data](#querying-data)
-* [Data Management](#data-management-with-bloomfilter-datamap)
+* [Data Management](#data-management-with-bloomfilter-index)
* [Useful Tips](#useful-tips)
-#### DataMap Management
-Creating BloomFilter DataMap
+#### Index Management
+Creating BloomFilter Index
```
- CREATE DATAMAP [IF NOT EXISTS] datamap_name
- ON TABLE main_table
- USING 'bloomfilter'
- DMPROPERTIES ('index_columns'='city, name', 'BLOOM_SIZE'='640000',
'BLOOM_FPP'='0.00001')
+ CREATE INDEX [IF NOT EXISTS] index_name
+ ON TABLE main_table (city,name)
+ AS 'bloomfilter'
+ PROPERTIES ('BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.00001')
```
-Dropping Specified DataMap
+Dropping Specified Index
```
- DROP DATAMAP [IF EXISTS] datamap_name
+ DROP INDEX [IF EXISTS] index_name
ON TABLE main_table
```
-Showing all DataMaps on this table
+Showing all Indexes on this table
```
- SHOW DATAMAP
+ SHOW INDEXES
ON TABLE main_table
```
-Disable DataMap
-> The datamap by default is enabled. To support tuning on query, we can
disable a specific datamap during query to observe whether we can gain
performance enhancement from it. This is effective only for current session.
+Disable Index
+> The index by default is enabled. To support tuning on query, we can disable
a specific index during query to observe whether we can gain performance
enhancement from it. This is effective only for current session.
```
// disable the index
- SET carbon.index.visible.dbName.tableName.dataMapName = false
+ SET carbon.index.visible.dbName.tableName.indexName = false
// enable the index
- SET carbon.index.visible.dbName.tableName.dataMapName = true
+ SET carbon.index.visible.dbName.tableName.indexName = true
```
-## BloomFilter DataMap Introduction
+## BloomFilter Index Introduction
A Bloom filter is a space-efficient probabilistic data structure that is used
to test whether an element is a member of a set.
-Carbondata introduced BloomFilter as an index datamap to enhance the
performance of querying with precise value.
+Carbondata introduced BloomFilter as an index to enhance the performance of
querying with precise value.
It is well suitable for queries that do precise match on high cardinality
columns(such as Name/ID).
Internally, CarbonData maintains a BloomFilter per blocklet for each index
column to indicate that whether a value of the column is in this blocklet.
-Just like the other datamaps, BloomFilter datamap is managed along with main
tables by CarbonData.
-User can create BloomFilter datamap on specified columns with specified
BloomFilter configurations such as size and probability.
+Just like the other indexes, BloomFilter index is managed along with main
tables by CarbonData.
+User can create BloomFilter index on specified columns with specified
BloomFilter configurations such as size and probability.
-For instance, main table called **datamap_test** which is defined as:
+For instance, main table called **index_test** which is defined as:
```
- CREATE TABLE datamap_test (
+ CREATE TABLE index_test (
id string,
name string,
age int,
@@ -83,24 +83,25 @@ since `id` is in the sort_columns and it is orderd,
query on it will be fast because CarbonData can skip all the irrelative
blocklets.
But queries on `name` may be bad since the blocklet minmax may not help,
because in each blocklet the range of the value of `name` may be the same --
all from A* to z*.
-In this case, user can create a BloomFilter DataMap on column `name`.
-Moreover, user can also create a BloomFilter DataMap on the sort_columns.
+In this case, user can create a BloomFilter Index on column `name`.
+Moreover, user can also create a BloomFilter Index on the sort_columns.
This is useful if user has too many segments and the range of the value of
sort_columns are almost the same.
-User can create BloomFilter DataMap using the Create DataMap DDL:
+User can create BloomFilter Index using the Create Index DDL:
```
- CREATE DATAMAP dm
- ON TABLE datamap_test
- USING 'bloomfilter'
- DMPROPERTIES ('INDEX_COLUMNS' = 'name,id', 'BLOOM_SIZE'='640000',
'BLOOM_FPP'='0.00001', 'BLOOM_COMPRESS'='true')
+ CREATE INDEX dm
+ ON TABLE index_test (name,id)
+ AS 'bloomfilter'
+ PROPERTIES ('BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.00001',
'BLOOM_COMPRESS'='true')
```
-**Properties for BloomFilter DataMap**
+Here, (name,id) are INDEX_COLUMNS. Carbondata will generate BloomFilter index
on these columns. Queries on these columns are usually like 'COL = VAL'.
+
+**Properties for BloomFilter Index**
| Property | Is Required | Default Value | Description |
|-------------|----------|--------|---------|
-| INDEX_COLUMNS | YES | | Carbondata will generate BloomFilter index on these
columns. Queries on these columns are usually like 'COL = VAL'. |
| BLOOM_SIZE | NO | 640000 | This value is internally used by BloomFilter as
the number of expected insertions, it will affect the size of BloomFilter
index. Since each blocklet has a BloomFilter here, so the default value is the
approximate distinct index values in a blocklet assuming that each blocklet
contains 20 pages and each page contains 32000 records. The value should be an
integer. |
| BLOOM_FPP | NO | 0.00001 | This value is internally used by BloomFilter as
the False-Positive Probability, it will affect the size of bloomfilter index as
well as the number of hash functions for the BloomFilter. The value should be
in the range (0, 1). In one test scenario, a 96GB TPCH customer table with
bloom_size=320000 and bloom_fpp=0.00001 will result in 18 false positive
samples. |
| BLOOM_COMPRESS | NO | true | Whether to compress the BloomFilter index
files. |
@@ -108,41 +109,41 @@ User can create BloomFilter DataMap using the Create
DataMap DDL:
## Loading Data
When loading data to main table, BloomFilter files will be generated for all
the
-index_columns given in DMProperties which contains the blockletId and a
BloomFilter for each index column.
-These index files will be written inside a folder named with DataMap name
+index_columns provided in the CREATE statement which contains the blockletId
and a BloomFilter for each index column.
+These index files will be written inside a folder named with Index name
inside each segment folders.
## Querying Data
-User can verify whether a query can leverage BloomFilter DataMap by executing
`EXPLAIN` command,
-which will show the transformed logical plan, and thus user can check whether
the BloomFilter DataMap can skip blocklets during the scan.
-If the DataMap does not prune blocklets well, you can try to increase the
value of property `BLOOM_SIZE` and decrease the value of property `BLOOM_FPP`.
+User can verify whether a query can leverage BloomFilter Index by executing
`EXPLAIN` command,
+which will show the transformed logical plan, and thus user can check whether
the BloomFilter Index can skip blocklets during the scan.
+If the Index does not prune blocklets well, you can try to increase the value
of property `BLOOM_SIZE` and decrease the value of property `BLOOM_FPP`.
-## Data Management With BloomFilter DataMap
-Data management with BloomFilter DataMap has no difference with that on Lucene
DataMap.
-You can refer to the corresponding section in `CarbonData Lucene DataMap`.
+## Data Management With BloomFilter Index
+Data management with BloomFilter Index has no difference with that on Lucene
Index.
+You can refer to the corresponding section in [CarbonData Lucene
Index](https://github.com/apache/carbondata/blob/master/docs/index/lucene-index-guide.md)
## Useful Tips
-+ BloomFilter DataMap is suggested to be created on the high cardinality
columns.
++ BloomFilter Index is suggested to be created on the high cardinality columns.
Query conditions on these columns are always simple `equal` or `in`,
such as 'col1=XX', 'col1 in (XX, YY)'.
-+ We can create multiple BloomFilter DataMaps on one table,
- but we do recommend you to create one BloomFilter DataMap that contains
multiple index columns,
++ We can create multiple BloomFilter Indexes on one table,
+ but we do recommend you to create one BloomFilter Index that contains
multiple index columns,
because the data loading and query performance will be better.
+ `BLOOM_FPP` is only the expected number from user, the actually FPP may be
worse.
- If the BloomFilter DataMap does not work well,
+ If the BloomFilter Index does not work well,
you can try to increase `BLOOM_SIZE` and decrease `BLOOM_FPP` at the same
time.
Notice that bigger `BLOOM_SIZE` will increase the size of index file
and smaller `BLOOM_FPP` will increase runtime calculation while performing
query.
-+ '0' skipped blocklets of BloomFilter DataMap in explain output indicates that
- BloomFilter DataMap does not prune better than Main DataMap.
- (For example since the data is not ordered, a specific value may be contained
in many blocklets. In this case, bloom may not work better than Main DataMap.)
++ '0' skipped blocklets of BloomFilter Index in explain output indicates that
+ BloomFilter Index does not prune better than Main Index.
+ (For example since the data is not ordered, a specific value may be contained
in many blocklets. In this case, bloom may not work better than Main Index.)
If this occurs very often, it means that current BloomFilter is useless. You
can disable or drop it.
- Sometimes we cannot see any pruning result about BloomFilter DataMap in the
explain output,
- this indicates that the previous DataMap has pruned all the blocklets and
there is no need to continue pruning.
-+ In some scenarios, the BloomFilter DataMap may not enhance the query
performance significantly
+ Sometimes we cannot see any pruning result about BloomFilter Index in the
explain output,
+ this indicates that the previous Index has pruned all the blocklets and there
is no need to continue pruning.
++ In some scenarios, the BloomFilter Index may not enhance the query
performance significantly
but if it can reduce the number of spark task,
- there is still a chance that BloomFilter DataMap can enhance the performance
for concurrent query.
-+ Note that BloomFilter DataMap will decrease the data loading performance and
may cause slightly storage expansion (for DataMap index file).
+ there is still a chance that BloomFilter Index can enhance the performance
for concurrent query.
++ Note that BloomFilter Index will decrease the data loading performance and
may cause slightly storage expansion (for index file).
diff --git a/docs/index/index-management.md b/docs/index/index-management.md
index 01f3604..6b4b6ec 100644
--- a/docs/index/index-management.md
+++ b/docs/index/index-management.md
@@ -18,124 +18,107 @@
# CarbonData Index Management
- [Overview](#overview)
-- [DataMap Management](#datamap-management)
+- [Index Management](#index-management)
- [Automatic Refresh](#automatic-refresh)
- [Manual Refresh](#manual-refresh)
-- [DataMap Catalog](#datamap-catalog)
-- [DataMap Related Commands](#datamap-related-commands)
+- [Index Related Commands](#index-related-commands)
- [Explain](#explain)
- - [Show DataMap](#show-datamap)
+ - [Show Index](#show-index)
## Overview
-DataMap can be created using following DDL
+Index can be created using following DDL
```
-CREATE DATAMAP [IF NOT EXISTS] datamap_name
-[ON TABLE main_table]
-USING "datamap_provider"
-[WITH DEFERRED REBUILD]
-DMPROPERTIES ('key'='value', ...)
-AS
- SELECT statement
+CREATE INDEX [IF NOT EXISTS] index_name
+ON TABLE [db_name.]table_name (column_name, ...)
+AS carbondata/bloomfilter/lucene
+[WITH DEFERRED REFRESH]
+[PROPERTIES ('key'='value')]
```
-Currently, there are 5 DataMap implementations in CarbonData.
+Currently, there are 3 Index implementations in CarbonData.
-| DataMap Provider | Description | DMPROPERTIES
| Management |
-| ---------------- | ---------------------------------------- |
---------------------------------------- | ---------------- |
-| mv | multi-table pre-aggregate table | No DMPROPERTY
is required | Manual/Automatic |
-| lucene | lucene indexing for text column | index_columns
to specifying the index columns | Automatic |
-| bloomfilter | bloom filter for high cardinality column, geospatial
column | index_columns to specifying the index columns | Automatic |
+| Index Provider | Description
| Management |
+| ---------------- |
--------------------------------------------------------------------------------
| --------- |
+| secondary-index | secondary-index tables to hold blocklets as indexes and
managed as child tables | Automatic |
+| lucene | lucene indexing for text column
| Automatic |
+| bloomfilter | bloom filter for high cardinality column, geospatial
column | Automatic |
-## DataMap Management
+## Index Management
-There are two kinds of management semantic for DataMap.
+There are two kinds of management semantic for Index.
-1. Automatic Refresh: Create datamap without `WITH DEFERRED REBUILD` in the
statement, which is by default.
-2. Manual Refresh: Create datamap with `WITH DEFERRED REBUILD` in the statement
+1. Automatic Refresh: Create index without `WITH DEFERRED REBUILD` in the
statement, which is by default.
+2. Manual Refresh: Create index with `WITH DEFERRED REBUILD` in the statement
### Automatic Refresh
-When user creates a datamap on the main table without using `WITH DEFERRED
REBUILD` syntax, the datamap will be managed by system automatically.
-For every data load to the main table, system will immediately trigger a load
to the datamap automatically. These two data loading (to main table and
datamap) is executed in a transactional manner, meaning that it will be either
both success or neither success.
+When user creates a index on the main table without using `WITH DEFERRED
REFRESH` syntax, the index will be managed by system automatically.
+For every data load to the main table, system will immediately trigger a load
to the index automatically. These two data loading (to main table and index) is
executed in a transactional manner, meaning that it will be either both success
or neither success.
-The data loading to datamap is incremental based on Segment concept, avoiding
a expensive total rebuild.
+The data loading to index is incremental based on Segment concept, avoiding a
expensive total rebuild.
If user perform following command on the main table, system will return
failure. (reject the operation)
1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`.
2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE
DATATYPE`,
`ALTER TABLE RENAME`. Note that adding a new column is supported, and for
dropping columns and
- change datatype command, CarbonData will check whether it will impact the
pre-aggregate table, if
+ change datatype command, CarbonData will check whether it will impact the
index table, if
not, the operation is allowed, otherwise operation will be rejected by
throwing exception.
3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`.
-If user do want to perform above operations on the main table, user can first
drop the datamap, perform the operation, and re-create the datamap again.
+If user do want to perform above operations on the main table, user can first
drop the index, perform the operation, and re-create the index again.
-If user drop the main table, the datamap will be dropped immediately too.
+If user drop the main table, the index will be dropped immediately too.
-We do recommend you to use this management for index datamap.
+We do recommend you to use this management for index.
### Manual Refresh
-When user creates a datamap specifying manual refresh semantic, the datamap is
created with status *disabled* and query will NOT use this datamap until user
can issue REBUILD DATAMAP command to build the datamap. For every REBUILD
DATAMAP command, system will trigger a full rebuild of the datamap. After
rebuild is done, system will change datamap status to *enabled*, so that it can
be used in query rewrite.
+When user creates a index specifying manual refresh semantic, the index is
created with status *disabled* and query will NOT use this index until user can
issue REFRESH INDEX command to build the index. For every REFRESH INDEX
command, system will trigger a full rebuild of the index. After rebuild is
done, system will change index status to *enabled*, so that it can be used in
query rewrite.
-For every new data loading, data update, delete, the related datamap will be
made *disabled*,
-which means that the following queries will not benefit from the datamap
before it becomes *enabled* again.
+For every new data loading, data update, delete, the related index will be
made *disabled*,
+which means that the following queries will not benefit from the index before
it becomes *enabled* again.
-If the main table is dropped by user, the related datamap will be dropped
immediately.
+If the main table is dropped by user, the related index will be dropped
immediately.
**Note**:
-+ If you are creating a datamap on external table, you need to do manual
management of the datamap.
-+ For index datamap such as BloomFilter datamap, there is no need to do manual
refresh.
++ If you are creating a index on external table, you need to do manual
management of the index.
++ For index such as BloomFilter index, there is no need to do manual refresh.
By default it is automatic refresh,
- which means its data will get refreshed immediately after the datamap is
created or the main table is loaded.
- Manual refresh on this datamap will has no impact.
+ which means its data will get refreshed immediately after the index is
created or the main table is loaded.
+ Manual refresh on this index will has no impact.
-
-
-## DataMap Catalog
-
-Currently, when user creates a datamap, system will store the datamap metadata
in a configurable *system* folder in HDFS or S3.
-
-In this *system* folder, it contains:
-
-- DataMapSchema file. It is a json file containing schema for one datamap. Ses
DataMapSchema class. If user creates 100 datamaps (on different tables), there
will be 100 files in *system* folder.
-- DataMapStatus file. Only one file, it is in json format, and each entry in
the file represents for one datamap. Ses DataMapStatusDetail class
-
-There is a DataMapCatalog interface to retrieve schema of all datamap, it can
be used in optimizer to get the metadata of datamap.
-
-
-
-## DataMap Related Commands
+## Index Related Commands
### Explain
-How can user know whether datamap is used in the query?
+How can user know whether index is used in the query?
User can set enable.query.statistics = true and use EXPLAIN command to know,
it will print out something like
```text
== CarbonData Profiler ==
-Hit mv DataMap: datamap1
-Scan Table: default.datamap1_table
+Table Scan on default.main
++- total: 1 blocks, 1 blocklets
+- filter:
-+- pruning by CG DataMap
-+- all blocklets: 1
- skipped blocklets: 0
++- pruned by CG Index
+ - name: index1
+ - provider: lucene
+ - skipped: 0 blocks, 0 blocklets
```
-### Show DataMap
+### Show Index
-There is a SHOW DATAMAPS command, when this is issued, system will read all
datamap from *system* folder and print all information on screen. The current
information includes:
+There is a SHOW INDEXES command, when this is issued, system will read all
index from the carbon table and print all information on screen. The current
information includes:
-- DataMapName
-- DataMapProviderName like mv
-- Associated Table
-- DataMap Properties
-- DataMap status (ENABLED/DISABLED)
-- Sync Status - which displays Last segment Id of main table synced with
datamap table and its load
- end time (Applicable only for mv datamap)
+- Name
+- Provider like lucene
+- Indexed Columns
+- Properties
+- Status (ENABLED/DISABLED)
+- Sync Info - which displays Last segment Id of main table synced with index
table and its load
+ end time
diff --git a/docs/index/lucene-index-guide.md b/docs/index/lucene-index-guide.md
index d12aa47..c811ec3 100644
--- a/docs/index/lucene-index-guide.md
+++ b/docs/index/lucene-index-guide.md
@@ -15,46 +15,47 @@
limitations under the License.
-->
-# CarbonData Lucene DataMap (Alpha Feature)
+# CarbonData Lucene Index (Alpha Feature)
-* [DataMap Management](#datamap-management)
-* [Lucene Datamap](#lucene-datamap-introduction)
+* [Index Management](#index-management)
+* [Lucene Index](#lucene-index-introduction)
* [Loading Data](#loading-data)
* [Querying Data](#querying-data)
-* [Data Management](#data-management-with-lucene-datamap)
+* [Data Management](#data-management-with-lucene-index)
-#### DataMap Management
-Lucene DataMap can be created using following DDL
+#### Index Management
+Lucene Index can be created using following DDL
```
- CREATE DATAMAP [IF NOT EXISTS] datamap_name
- ON TABLE main_table
- USING 'lucene'
- DMPROPERTIES ('index_columns'='city, name', ...)
+ CREATE INDEX [IF NOT EXISTS] index_name
+ ON TABLE main_table (index_columns)
+ AS 'lucene'
+ [PROPERTIES ('key'='value')]
```
+index_columns is the list of string columns on which lucene creates indexes.
-DataMap can be dropped using following DDL:
+Index can be dropped using following DDL:
```
- DROP DATAMAP [IF EXISTS] datamap_name
+ DROP INDEX [IF EXISTS] index_name
ON TABLE main_table
```
-To show all DataMaps created, use:
+To show all Indexes created, use:
```
- SHOW DATAMAP
+ SHOW INDEXES
ON TABLE main_table
```
-It will show all DataMaps created on main table.
+It will show all Indexes created on main table.
-## Lucene DataMap Introduction
+## Lucene Index Introduction
Lucene is a high performance, full featured text search engine. Lucene is
integrated to carbon as
- an index datamap and managed along with main tables by CarbonData. User can
create lucene datamap
+ an index and managed along with main tables by CarbonData. User can create
lucene index
to improve query performance on string columns which has content of more
length. So, user can
search tokenized word or pattern of it using lucene query on text content.
- For instance, main table called **datamap_test** which is defined as:
+ For instance, main table called **index_test** which is defined as:
```
- CREATE TABLE datamap_test (
+ CREATE TABLE index_test (
name string,
age int,
city string,
@@ -62,28 +63,26 @@ It will show all DataMaps created on main table.
STORED AS carbondata
```
- User can create Lucene datamap using the Create DataMap DDL:
+ User can create Lucene index using the Create Index DDL:
```
- CREATE DATAMAP dm
- ON TABLE datamap_test
- USING 'lucene'
- DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
+ CREATE INDEX dm
+ ON TABLE index_test (name,country)
+ AS 'lucene'
```
-**DMProperties**
-1. INDEX_COLUMNS: The list of string columns on which lucene creates indexes.
-2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified
then it tries to
+**Properties**
+1. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if specified
then it tries to
aggregate the unique data till the cache limit and flush to Lucene. It is
best suitable for low
cardinality dimensions.
-3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in
lucene , it means new
+2. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise in
lucene , it means new
folder will be created for each blocklet, thus, it eliminates storing
blockletid in lucene and
also it makes lucene small chunks of data.
## Loading data
When loading data to main table, lucene index files will be generated for all
the
-index_columns(String Columns) given in DMProperties which contains information
about the data
-location of index_columns. These index files will be written inside a folder
named with datamap name
+index_columns(String Columns) given in CREATE statement which contains
information about the data
+location of index_columns. These index files will be written inside a folder
named with index name
inside each segment folders.
A system level configuration carbon.lucene.compression.mode can be added for
best compression of
@@ -99,7 +98,7 @@ fired, two jobs are fired. The first job writes the temporary
files in folder cr
which contains lucene's seach results and these files will be read in second
job to give faster
results. These temporary files will be cleared once the query finishes.
-User can verify whether a query can leverage Lucene datamap or not by
executing `EXPLAIN`
+User can verify whether a query can leverage Lucene index or not by executing
`EXPLAIN`
command, which will show the transformed logical plan, and thus user can check
whether TEXT_MATCH()
filter is applied on query or not.
@@ -109,50 +108,50 @@ filter condition like 'AND','OR' must be in upper case.
Ex:
```
- select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
+ select * from index_test where TEXT_MATCH('name:*10 AND name:*n*')
```
2. Query supports only one TEXT_MATCH udf for filter condition and not
multiple udfs.
The following query is supported:
```
- select * from datamap_test where TEXT_MATCH('name:*10 AND name:*n*')
+ select * from index_test where TEXT_MATCH('name:*10 AND name:*n*')
```
The following query is not supported:
```
- select * from datamap_test where TEXT_MATCH('name:*10) AND
TEXT_MATCH(name:*n*')
+ select * from index_test where TEXT_MATCH('name:*10) AND
TEXT_MATCH(name:*n*')
```
Below like queries can be converted to text_match queries as following:
```
-select * from datamap_test where name='n10'
+select * from index_test where name='n10'
-select * from datamap_test where name like 'n1%'
+select * from index_test where name like 'n1%'
-select * from datamap_test where name like '%10'
+select * from index_test where name like '%10'
-select * from datamap_test where name like '%n%'
+select * from index_test where name like '%n%'
-select * from datamap_test where name like '%10' and name not like '%n%'
+select * from index_test where name like '%10' and name not like '%n%'
```
Lucene TEXT_MATCH Queries:
```
-select * from datamap_test where TEXT_MATCH('name:n10')
+select * from index_test where TEXT_MATCH('name:n10')
-select * from datamap_test where TEXT_MATCH('name:n1*')
+select * from index_test where TEXT_MATCH('name:n1*')
-select * from datamap_test where TEXT_MATCH('name:*10')
+select * from index_test where TEXT_MATCH('name:*10')
-select * from datamap_test where TEXT_MATCH('name:*n*')
+select * from index_test where TEXT_MATCH('name:*n*')
-select * from datamap_test where TEXT_MATCH('name:*10 -name:*n*')
+select * from index_test where TEXT_MATCH('name:*10 -name:*n*')
```
**Note:** For lucene queries and syntax, refer to
[lucene-syntax](http://www.lucenetutorial.com/lucene-query-syntax.html)
-## Data Management with lucene datamap
-Once there is lucene datamap is created on the main table, following command
on the main
+## Data Management with lucene index
+Once there is lucene index is created on the main table, following command on
the main
table
is not supported:
1. Data management command: `UPDATE/DELETE`.
diff --git a/docs/language-manual.md b/docs/language-manual.md
index d8f30b0..9a4a79b 100644
--- a/docs/language-manual.md
+++ b/docs/language-manual.md
@@ -27,7 +27,8 @@ CarbonData has its own parser, in addition to Spark's SQL
Parser, to parse and p
- [Index](./index/index-management.md)
- [Bloom](./index/bloomfilter-index-guide.md)
- [Lucene](./index/lucene-index-guide.md)
- - Materialized Views (MV)
+ - [Secondary-index](./index/secondary-index-guide.md)
+ - [Materialized Views (MV)](./index/mv-guide.md)
- [Streaming](./streaming-guide.md)
- Data Manipulation Statements
- [DML:](./dml-of-carbondata.md) [Load](./dml-of-carbondata.md#load-data),
[Insert](./dml-of-carbondata.md#insert-data-into-carbondata-table),
[Update](./dml-of-carbondata.md#update), [Delete](./dml-of-carbondata.md#delete)