This is an automated email from the ASF dual-hosted git repository.
akashrn5 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/carbondata.git
The following commit(s) were added to refs/heads/master by this push:
new 3ea6b18 [CARBONDATA-3791] Correct spelling, link and ddl in SI and MV
Documentation
3ea6b18 is described below
commit 3ea6b181b41b0f9a6de348574d166df8ff7019f6
Author: Indhumathi27 <[email protected]>
AuthorDate: Sun May 3 17:17:06 2020 +0530
[CARBONDATA-3791] Correct spelling, link and ddl in SI and MV Documentation
Why is this PR needed?
Correct spelling, link and ddl in SI and MV Documentation
What changes were proposed in this PR?
Fixed spelling, link and ddl in SI and MV Documentation
This closes #3735
---
docs/configuration-parameters.md | 2 +-
docs/index/bloomfilter-index-guide.md | 15 +++--
docs/index/index-management.md | 37 +++++------
docs/index/lucene-index-guide.md | 30 ++++-----
docs/index/secondary-index-guide.md | 76 +++++++++++-----------
docs/mv-guide.md | 58 ++++++++---------
.../CarbonDataFileMergeTestCaseOnSI.scala | 2 +-
7 files changed, 109 insertions(+), 111 deletions(-)
diff --git a/docs/configuration-parameters.md b/docs/configuration-parameters.md
index 486b133..4627cac 100644
--- a/docs/configuration-parameters.md
+++ b/docs/configuration-parameters.md
@@ -116,7 +116,7 @@ This section provides the details of all the configurations
required for the Car
| carbon.compaction.prefetch.enable | false | Compaction operation is similar
to Query + data load where in data from qualifying segments are queried and
data loading performed to generate a new single segment. This configuration
determines whether to query ahead data from segments and feed it for data
loading. **NOTE: **This configuration is disabled by default as it needs extra
resources for querying extra data. Based on the memory availability on the
cluster, user can enable it to imp [...]
| carbon.merge.index.in.segment | true | Each CarbonData file has a companion
CarbonIndex file which maintains the metadata about the data. These CarbonIndex
files are read and loaded into driver and is used subsequently for pruning of
data during queries. These CarbonIndex files are very small in size(few KB) and
are many. Reading many small files from HDFS is not efficient and leads to slow
IO performance. Hence these CarbonIndex files belonging to a segment can be
combined into a sin [...]
| carbon.enable.range.compaction | true | To configure Ranges-based Compaction
to be used or not for RANGE_COLUMN. If true after compaction also the data
would be present in ranges. |
-| carbon.si.segment.merge | false | Making this true degrade the LOAD
performance. When the number of small files increase for SI segments(it can
happen as number of columns will be less and we store position id and reference
columns), user an either set to true which will merge the data files for
upcoming loads or run SI rebuild command which does this job for all segments.
(REBUILD INDEX <index_table>) |
+| carbon.si.segment.merge | false | Making this true degrade the LOAD
performance. When the number of small files increase for SI segments(it can
happen as number of columns will be less and we store position id and reference
columns), user an either set to true which will merge the data files for
upcoming loads or run SI refresh command which does this job for all segments.
(REFRESH INDEX <index_table>) |
## Query Configuration
diff --git a/docs/index/bloomfilter-index-guide.md
b/docs/index/bloomfilter-index-guide.md
index 85f284a..03085f1 100644
--- a/docs/index/bloomfilter-index-guide.md
+++ b/docs/index/bloomfilter-index-guide.md
@@ -36,14 +36,15 @@ Creating BloomFilter Index
Dropping Specified Index
```
DROP INDEX [IF EXISTS] index_name
- ON TABLE main_table
+ ON [TABLE] main_table
```
Showing all Indexes on this table
```
SHOW INDEXES
- ON TABLE main_table
+ ON [TABLE] main_table
```
+> NOTE: Keywords given inside `[]` is optional.
Disable Index
> The index by default is enabled. To support tuning on query, we can disable
> a specific index during query to observe whether we can gain performance
> enhancement from it. This is effective only for current session.
@@ -59,7 +60,7 @@ Disable Index
## BloomFilter Index Introduction
A Bloom filter is a space-efficient probabilistic data structure that is used
to test whether an element is a member of a set.
Carbondata introduced BloomFilter as an index to enhance the performance of
querying with precise value.
-It is well suitable for queries that do precise match on high cardinality
columns(such as Name/ID).
+It is well suitable for queries that do precise matching on high cardinality
columns(such as Name/ID).
Internally, CarbonData maintains a BloomFilter per blocklet for each index
column to indicate that whether a value of the column is in this blocklet.
Just like the other indexes, BloomFilter index is managed along with main
tables by CarbonData.
User can create BloomFilter index on specified columns with specified
BloomFilter configurations such as size and probability.
@@ -79,7 +80,7 @@ For instance, main table called **index_test** which is
defined as:
In the above example, `id` and `name` are high cardinality columns
and we always query on `id` and `name` with precise value.
-since `id` is in the sort_columns and it is orderd,
+since `id` is in the sort_columns and it is ordered,
query on it will be fast because CarbonData can skip all the irrelative
blocklets.
But queries on `name` may be bad since the blocklet minmax may not help,
because in each blocklet the range of the value of `name` may be the same --
all from A* to z*.
@@ -96,7 +97,7 @@ User can create BloomFilter Index using the Create Index DDL:
PROPERTIES ('BLOOM_SIZE'='640000', 'BLOOM_FPP'='0.00001',
'BLOOM_COMPRESS'='true')
```
-Here, (name,id) are INDEX_COLUMNS. Carbondata will generate BloomFilter index
on these columns. Queries on these columns are usually like 'COL = VAL'.
+Here, (name,id) are INDEX_COLUMNS. Carbondata will generate BloomFilter index
on these columns. Queries on these columns are usually like `'COL = VAL'`.
**Properties for BloomFilter Index**
@@ -131,7 +132,7 @@ You can refer to the corresponding section in [CarbonData
Lucene Index](https://
+ We can create multiple BloomFilter Indexes on one table,
but we do recommend you to create one BloomFilter Index that contains
multiple index columns,
because the data loading and query performance will be better.
-+ `BLOOM_FPP` is only the expected number from user, the actually FPP may be
worse.
++ `BLOOM_FPP` is only the expected number from user, the actual FPP may be
worse.
If the BloomFilter Index does not work well,
you can try to increase `BLOOM_SIZE` and decrease `BLOOM_FPP` at the same
time.
Notice that bigger `BLOOM_SIZE` will increase the size of index file
@@ -145,5 +146,5 @@ You can refer to the corresponding section in [CarbonData
Lucene Index](https://
+ In some scenarios, the BloomFilter Index may not enhance the query
performance significantly
but if it can reduce the number of spark task,
there is still a chance that BloomFilter Index can enhance the performance
for concurrent query.
-+ Note that BloomFilter Index will decrease the data loading performance and
may cause slightly storage expansion (for index file).
++ Note that BloomFilter Index will decrease the data loading performance and
may cause slight storage expansion (for index file).
diff --git a/docs/index/index-management.md b/docs/index/index-management.md
index 6b4b6ec..7bd9c75 100644
--- a/docs/index/index-management.md
+++ b/docs/index/index-management.md
@@ -51,54 +51,51 @@ Currently, there are 3 Index implementations in CarbonData.
There are two kinds of management semantic for Index.
-1. Automatic Refresh: Create index without `WITH DEFERRED REBUILD` in the
statement, which is by default.
-2. Manual Refresh: Create index with `WITH DEFERRED REBUILD` in the statement
+1. Automatic Refresh
+2. Manual Refresh
### Automatic Refresh
-When user creates a index on the main table without using `WITH DEFERRED
REFRESH` syntax, the index will be managed by system automatically.
-For every data load to the main table, system will immediately trigger a load
to the index automatically. These two data loading (to main table and index) is
executed in a transactional manner, meaning that it will be either both success
or neither success.
+When a user creates an index on the main table without using `WITH DEFERRED
REFRESH` syntax, the index will be managed by the system automatically.
+For every data load to the main table, the system will immediately trigger a
load to the index automatically. These two data loading (to main table and
index) is executed in a transactional manner, meaning that it will be either
both success or neither success.
-The data loading to index is incremental based on Segment concept, avoiding a
expensive total rebuild.
+The data loading to index is incremental based on Segment concept, avoiding an
expensive total refresh.
-If user perform following command on the main table, system will return
failure. (reject the operation)
+If a user performs the following command on the main table, the system will
return failure. (reject the operation)
1. Data management command: `UPDATE/DELETE/DELETE SEGMENT`.
2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE
DATATYPE`,
`ALTER TABLE RENAME`. Note that adding a new column is supported, and for
dropping columns and
change datatype command, CarbonData will check whether it will impact the
index table, if
- not, the operation is allowed, otherwise operation will be rejected by
throwing exception.
+ not, the operation is allowed, otherwise operation will be rejected by
throwing an exception.
3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`.
-If user do want to perform above operations on the main table, user can first
drop the index, perform the operation, and re-create the index again.
+If a user does want to perform above operations on the main table, the user
can first drop the index, perform the operation, and re-create the index again.
-If user drop the main table, the index will be dropped immediately too.
+If a user drops the main table, the index will be dropped immediately too.
-We do recommend you to use this management for index.
+We do recommend you to use this management for indexing.
### Manual Refresh
-When user creates a index specifying manual refresh semantic, the index is
created with status *disabled* and query will NOT use this index until user can
issue REFRESH INDEX command to build the index. For every REFRESH INDEX
command, system will trigger a full rebuild of the index. After rebuild is
done, system will change index status to *enabled*, so that it can be used in
query rewrite.
+When a user creates an index on the main table using `WITH DEFERRED REFRESH`
syntax, the index will be created with status *disabled* and query will NOT use
this index until the user issues `REFRESH INDEX` command to build the index.
For every `REFRESH INDEX` command, the system will trigger a full refresh of
the index. Once the refresh operation is finished, system will change index
status to *enabled*, so that it can be used in query rewrite.
For every new data loading, data update, delete, the related index will be
made *disabled*,
which means that the following queries will not benefit from the index before
it becomes *enabled* again.
-If the main table is dropped by user, the related index will be dropped
immediately.
+If the main table is dropped by the user, the related index will be dropped
immediately.
**Note**:
-+ If you are creating a index on external table, you need to do manual
management of the index.
-+ For index such as BloomFilter index, there is no need to do manual refresh.
- By default it is automatic refresh,
- which means its data will get refreshed immediately after the index is
created or the main table is loaded.
- Manual refresh on this index will has no impact.
++ If you are creating an index on an external table, you need to do manual
management of the index.
++ Currently, all types of indexes supported by carbon will be automatically
refreshed by default, which means its data will get refreshed immediately after
the index is created or the main table is loaded. Manual refresh on these
indexes is not supported.
## Index Related Commands
### Explain
-How can user know whether index is used in the query?
+How can users know whether an index is used in the query?
-User can set enable.query.statistics = true and use EXPLAIN command to know,
it will print out something like
+User can set `enable.query.statistics = true` and use `EXPLAIN` command to
know, it will print out something like
```text
== CarbonData Profiler ==
@@ -113,7 +110,7 @@ Table Scan on default.main
### Show Index
-There is a SHOW INDEXES command, when this is issued, system will read all
index from the carbon table and print all information on screen. The current
information includes:
+There is a SHOW INDEXES command, when this is issued, the system will read all
indexes from the carbon table and print all information on screen. The current
information includes:
- Name
- Provider like lucene
diff --git a/docs/index/lucene-index-guide.md b/docs/index/lucene-index-guide.md
index c811ec3..87f840a 100644
--- a/docs/index/lucene-index-guide.md
+++ b/docs/index/lucene-index-guide.md
@@ -36,14 +36,15 @@ index_columns is the list of string columns on which lucene
creates indexes.
Index can be dropped using following DDL:
```
DROP INDEX [IF EXISTS] index_name
- ON TABLE main_table
+ ON [TABLE] main_table
```
To show all Indexes created, use:
```
SHOW INDEXES
- ON TABLE main_table
+ ON [TABLE] main_table
```
-It will show all Indexes created on main table.
+It will show all Indexes created on the main table.
+> NOTE: Keywords given inside `[]` is optional.
## Lucene Index Introduction
@@ -83,28 +84,28 @@ It will show all Indexes created on main table.
When loading data to main table, lucene index files will be generated for all
the
index_columns(String Columns) given in CREATE statement which contains
information about the data
location of index_columns. These index files will be written inside a folder
named with index name
-inside each segment folders.
+inside each segment folder.
-A system level configuration carbon.lucene.compression.mode can be added for
best compression of
+A system level configuration `carbon.lucene.compression.mode` can be added for
best compression of
lucene index files. The default value is speed, where the index writing speed
will be more. If the
value is compression, the index file size will be compressed.
## Querying data
As a technique for query acceleration, Lucene indexes cannot be queried
directly.
-Queries are to be made on main table. when a query with TEXT_MATCH('name:c10')
or
+Queries are to be made on the main table. When a query with
TEXT_MATCH('name:c10') or
TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the
number of result to be
returned, if user does not specify this value, all results will be returned
without any limit] is
-fired, two jobs are fired. The first job writes the temporary files in folder
created at table level
-which contains lucene's seach results and these files will be read in second
job to give faster
+fired, two jobs will be launched. The first job writes the temporary files in
folder created at table level
+which contains lucene's search results and these files will be read in second
job to give faster
results. These temporary files will be cleared once the query finishes.
-User can verify whether a query can leverage Lucene index or not by executing
`EXPLAIN`
+User can verify whether a query can leverage Lucene index or not by executing
the `EXPLAIN`
command, which will show the transformed logical plan, and thus user can check
whether TEXT_MATCH()
filter is applied on query or not.
**Note:**
- 1. The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always
in lower case and
-filter condition like 'AND','OR' must be in upper case.
+ 1. The filter columns in TEXT_MATCH or TEXT_MATCH_WITH_LIMIT must be always
in lowercase and
+filter conditions like 'AND','OR' must be in upper case.
Ex:
```
@@ -124,7 +125,7 @@ filter condition like 'AND','OR' must be in upper case.
```
-Below like queries can be converted to text_match queries as following:
+Below `like` queries can be converted to text_match queries as following:
```
select * from index_test where name='n10'
@@ -151,9 +152,8 @@ select * from index_test where TEXT_MATCH('name:*10
-name:*n*')
**Note:** For lucene queries and syntax, refer to
[lucene-syntax](http://www.lucenetutorial.com/lucene-query-syntax.html)
## Data Management with lucene index
-Once there is lucene index is created on the main table, following command on
the main
-table
-is not supported:
+Once there is a lucene index created on the main table, following command on
the main
+table is not supported:
1. Data management command: `UPDATE/DELETE`.
2. Schema management command: `ALTER TABLE DROP COLUMN`, `ALTER TABLE CHANGE
DATATYPE`,
`ALTER TABLE RENAME`.
diff --git a/docs/index/secondary-index-guide.md
b/docs/index/secondary-index-guide.md
index e588ed9..1d86b82 100644
--- a/docs/index/secondary-index-guide.md
+++ b/docs/index/secondary-index-guide.md
@@ -30,34 +30,36 @@ Start spark-sql in terminal and run the following queries,
```
CREATE TABLE maintable(a int, b string, c string) stored as carbondata;
insert into maintable select 1, 'ab', 'cd';
-CREATE index inex1 on table maintable(c) AS 'carbondata';
+CREATE index index1 on table maintable(c) AS 'carbondata';
SELECT a from maintable where c = 'cd';
// NOTE: run explain query and check if query hits the SI table from the plan
EXPLAIN SELECT a from maintable where c = 'cd';
```
## Secondary Index Introduction
- Sencondary index tables are created as a indexes and managed as child tables
internally by
- Carbondata. Users can create secondary index based on the column position in
main table(Recommended
+ Secondary index tables are created as indexes and managed as child tables
internally by
+ Carbondata. Users can create a secondary index based on the column position
in the main table(Recommended
for right columns) and the queries should have filter on that column to
improve the filter query
performance.
- SI tables will always be loaded non-lazy way. Once SI table is created,
Carbondata's
+ Data refresh to the secondary index is always automatic. Once SI table is
created, Carbondata's
CarbonOptimizer with the help of `CarbonSITransformationRule`, transforms
the query plan to hit the
SI table based on the filter condition or set of filter conditions present
in the query.
- So first level of pruning will be done on SI table as it stores blocklets
and main table/parent
+ So the first level of pruning will be done on the SI table as it stores
blocklets and main table/parent
table pruning will be based on the SI output, which helps in giving the
faster query results with
better pruning.
- Secondary Index table can be create with below syntax
+ Secondary Index table can be created with the below syntax
```
CREATE INDEX [IF NOT EXISTS] index_name
ON TABLE maintable(index_column)
AS
'carbondata'
- [TBLPROPERTIES('table_blocksize'='1')]
+ [PROPERTIES('table_blocksize'='1')]
```
+> NOTE: Keywords given inside `[]` is optional.
+
For instance, main table called **sales** which is defined as
```
@@ -78,16 +80,16 @@ EXPLAIN SELECT a from maintable where c = 'cd';
ON TABLE sales(user_id)
AS
'carbondata'
- TBLPROPERTIES('table_blocksize'='1')
+ PROPERTIES('table_blocksize'='1')
```
#### How SI tables are selected
-When a user executes a filter query, during query planning phase, CarbonData
with help of
+When a user executes a filter query, during the query planning phase,
CarbonData with the help of
`CarbonSITransformationRule`, checks if there are any index tables present on
the filter column of
-query. If there are any, then filter query plan will be transformed such a way
that, execution will
-first hit the corresponding SI table and give input to main table for further
pruning.
+query. If there are any, then the filter query plan will be transformed in
such a way that execution will
+first hit the corresponding SI table and give input to the main table for
further pruning.
For the main table **sales** and SI table **index_sales** created above,
following queries
@@ -105,27 +107,27 @@ will be transformed by CarbonData's
`CarbonSITransformationRule` to query agains
### Loading data to Secondary Index table(s).
-*case1:* When SI table is created and the main table does not have any data.
In this case every
-consecutive load will load to SI table once main table data load is finished.
+*case1:* When the SI table is created and the main table does not have any
data. In this case every
+consecutive load to the main table, will load data to the SI table once the
main table data load is finished.
-*case2:* When SI table is created and main table already contains some data,
then SI creation will
-also load to SI table with same number of segments as main table. There after,
consecutive load to
-main table will load to SI table also.
+*case2:* When the SI table is created and the main table already contains some
data, then SI creation will
+also load data to the SI table with the same number of segments as the main
table. Thereafter, consecutive load to
+the main table will also load data to the SI table.
**NOTE**:
- * In case of data load failure to SI table, then we make the SI table disable
by setting a hive serde
+ * In case of data load failure to the SI table, then we make the SI table
disable by setting a hive serde
property. The subsequent main table load will load the old failed loads along
with current load and
makes the SI table enable and available for query.
## Querying data
-Direct query can be made on SI tables to see the data present in position
reference columns.
-When a filter query is fired, if the filter column is a secondary index
column, then plan is
-transformed accordingly to hit SI table first to make better pruning with main
table and in turn
+Direct query can be made on SI tables to check the data present in position
reference columns.
+When a filter query is fired, and if the filter column is a secondary index
column, then plan is
+transformed accordingly to hit the SI table first to make better pruning with
the main table and in turn
helps for faster query results.
-User can verify whether a query can leverage SI table or not by executing
`EXPLAIN`
-command, which will show the transformed logical plan, and thus user can check
whether SI table
-table is selected.
+Users can verify whether a query can leverage the SI table or not by executing
the `EXPLAIN`
+command, which will show the transformed logical plan, and thus users can
check whether the SI table
+is selected.
## Compacting SI table
@@ -133,33 +135,33 @@ table is selected.
### Compacting SI table table through Main Table compaction
Running Compaction command (`ALTER TABLE COMPACT`)[COMPACTION TYPE->
MINOR/MAJOR] on main table will
automatically delete all the old segments of SI and creates a new segment with
same name as main
-table compacted segmet and loads data to it.
+table compacted segment and loads data to it.
-### Compacting SI table's individual segment(s) through REBUILD command
-Where there are so many small files present in the SI table, then we can use
REBUILD command to
+### Compacting SI table's individual segment(s) through REFRESH INDEX command
+Where there are so many small files present in the SI table, then we can use
the REFRESH INDEX command to
compact the files within an SI segment to avoid many small files.
```
- REBUILD INDEX sales_index
+ REFRESH INDEX sales_index
```
-This command merges data files in each segment of SI table.
+This command merges data files in each segment of the SI table.
```
- REBUILD INDEX sales_index WHERE SEGMENT.ID IN(1)
+ REFRESH INDEX sales_index WHERE SEGMENT.ID IN(1)
```
-This command merges data files within specified segment of SI table.
+This command merges data files within a specified segment of the SI table.
## How to skip Secondary Index?
-When Secondary indexes are created on a table(s), always data fetching happens
from secondary
+When Secondary indexes are created on a table(s), data fetching happens from
secondary
indexes created on the main tables for better performance. But sometimes, data
fetching from the
-secondary index might degrade query performance in case where the data is
sparse and most of the
+secondary index might degrade query performance in cases where the data is
sparse and most of the
blocklets need to be scanned. So to avoid such secondary indexes, we use NI as
a function on filters
-with in WHERE clause.
+within WHERE clause.
```
SELECT country, sex from sales where NI(user_id = 'xxx')
```
-The above query ignores column user_id from secondary index and fetch data
from main table.
+The above query ignores column `user_id` from the secondary index and fetches
data from the main table.
## DDLs on Secondary Index
@@ -168,7 +170,7 @@ This command is used to get information about all the
secondary indexes on a tab
Syntax
```
- SHOW INDEXES on [db_name.]table_name
+ SHOW INDEXES ON [TABLE] [db_name.]table_name
```
### Drop index Command
@@ -176,7 +178,7 @@ This command is used to drop an existing secondary index on
a table
Syntax
```
- DROP INDEX [IF EXISTS] index_name on [db_name.]table_name
+ DROP INDEX [IF EXISTS] index_name ON [TABLE] [db_name.]table_name
```
### Register index Command
@@ -185,5 +187,5 @@ where we have old stores.
Syntax
```
- REGISTER INDEX TABLE index_name ON [db_name.]table_name
+ REGISTER INDEX TABLE index_name ON [TABLE] [db_name.]table_name
```
\ No newline at end of file
diff --git a/docs/mv-guide.md b/docs/mv-guide.md
index 9902e1c..24e38b1 100644
--- a/docs/mv-guide.md
+++ b/docs/mv-guide.md
@@ -35,17 +35,17 @@
INSERT INTO maintable SELECT 1, 'ab', 2;
CREATE MATERIALIZED VIEW view1 AS SELECT a, sum(b) FROM maintable GROUP
BY a;
SELECT a, sum(b) FROM maintable GROUP BY a;
- // NOTE: run explain query and check if query hits the Index table from
the plan
+ // NOTE: run explain query and check if query hits the mv table from the
plan
EXPLAIN SELECT a, sum(b) FROM maintable GROUP BY a;
```
-## Introductions
+## Introduction
- Materialized views are created as tables from queries. User can create
limitless materialized view
+ Materialized views are created as tables from queries. Users can create
limitless materialized views
to improve query performance provided the storage requirements and loading
time is acceptable.
Materialized view can be refreshed on commit or on manual. Once materialized
views are created,
- CarbonData's MVRewriteRule helps to select the most efficient materialized
view based on
+ CarbonData's `MVRewriteRule` helps to select the most efficient materialized
view based on
the user query and rewrite the SQL to select the data from materialized view
instead of
fact tables. Since the data size of materialized view is smaller and data is
pre-processed,
user queries are much faster.
@@ -63,7 +63,7 @@
STORED AS carbondata
```
- User can create materialized view using the CREATE MATERIALIZED VIEW
statement.
+ Users can create a materialized view using the CREATE MATERIALIZED VIEW
statement.
```
CREATE MATERIALIZED VIEW agg_sales
@@ -75,7 +75,7 @@
```
**NOTE**:
- * Group by and Order by columns has to be provided in projection list while
creating materialized view.
+ * Group by and Order by columns has to be provided in the projection list
while creating a materialized view.
* If only single fact table is involved in materialized view creation, then
TableProperties of
fact table (if not present in a aggregate function like sum(col)) listed
below will be
inherited to materialized view.
@@ -93,7 +93,7 @@
* Creating materialized view with select query containing only project of
all columns of fact
table is unsupported.
**Example:**
- If table 'x' contains columns 'a,b,c', then creating MV Index with
below queries is not supported.
+ If table 'x' contains columns 'a,b,c', then creating MV with below
queries is not supported.
1. ```SELECT a,b,c FROM x```
2. ```SELECT * FROM x```
* TableProperties can be provided in Properties excluding
LOCAL_DICTIONARY_INCLUDE,
@@ -107,9 +107,9 @@
#### How materialized views are selected
- When a user query is submitted, during query planning phase, CarbonData will
collect modular plan
- candidates and process the the ModularPlan based on registered summary data
sets. Then,
- materialized view for this query will be selected among the candidates.
+ When a user query is submitted, during the query planning phase, CarbonData
will collect modular plan
+ candidates and process the ModularPlan based on registered summary data sets.
Then,
+ a materialized view for this query will be selected among the candidates.
For the fact table **sales** and materialized view **agg_sales** created
above, following queries
```
@@ -140,7 +140,7 @@
view will be triggered by the CREATE MATERIALIZED VIEW statement when user
creates the materialized
view.
- For incremental loads to fact table, data to materialized view will be loaded
once the
+ For incremental loads to the fact table, data to materialized view will be
loaded once the
corresponding fact table load is completed.
### Loading data on manual
@@ -148,7 +148,7 @@
In case of WITH DEFERRED REFRESH, data load to materialized view will be
triggered by the refresh
command. Materialized view will be in DISABLED state in below scenarios.
- * when materialized view is created.
+ * when a materialized view is created.
* when data of fact table and materialized view are not in sync.
User should fire REFRESH MATERIALIZED VIEW command to sync all segments of
fact table with
@@ -163,27 +163,27 @@
During load to fact table, if anyone of the load to materialized view fails,
then that
corresponding materialized view will be DISABLED and load to other
materialized views mapped
- to fact table will continue.
+ to the fact table will continue.
User can fire REFRESH MATERIALIZED VIEW command to sync or else the
subsequent table load
will load the old failed loads along with current load and enable the
disabled materialized view.
**NOTE**:
* In case of InsertOverwrite/Update operation on fact table, all segments
of materialized view
- will be MARKED_FOR_DELETE and reload to Index table will happen by
REFRESH MATERIALIZED VIEW,
+ will be MARKED_FOR_DELETE and reload to mv table will happen by REFRESH
MATERIALIZED VIEW,
in case of materialized view which refresh on manual and once the
InsertOverwrite/Update
operation on fact table is finished, in case of materialized view which
refresh on commit.
* In case of full scan query, Data Size and Index Size of fact table and
materialized view
- will not the same, as fact table and materialized view has different
column names.
+ will not be the same, as fact table and materialized view have different
column names.
## Querying data
- Queries are to be made on fact table. While doing query planning, internally
CarbonData will check
+ Queries are to be made on the fact table. While doing query planning,
internally CarbonData will check
for the materialized views which are associated with the fact table, and do
query plan
transformation accordingly.
- User can verify whether a query can leverage materialized view or not by
executing `EXPLAIN` command,
- which will show the transformed logical plan, and thus user can check whether
materialized view
+ Users can verify whether a query can leverage materialized view or not by
executing the `EXPLAIN` command,
+ which will show the transformed logical plan, and thus the user can check
whether a materialized view
is selected.
## Compacting
@@ -207,7 +207,7 @@
materialized view, if not, the operation is allowed, otherwise operation
will be rejected by
throwing exception.
3. Partition management command: `ALTER TABLE ADD/DROP PARTITION`. Note
that dropping a partition
- will be allowed only if partition is participating in all indexes
associated with fact table.
+ will be allowed only if the partition column of fact table is
participating in all of the table's materialized views.
Drop Partition is not allowed, if any materialized view is associated
with more than one
fact table. Drop Partition directly on materialized view is not allowed.
4. Complex Datatype's for materialized view is not supported.
@@ -215,7 +215,7 @@
However, there is still way to support these operations on fact table, in
current CarbonData
release, user can do as following:
- 1. Remove the materialized by `DROP MATERIALIZED VIEW` command.
+ 1. Remove the materialized view by `DROP MATERIALIZED VIEW` command.
2. Carry out the data management operation on fact table.
3. Create the materialized view again by `CREATE MATERIALIZED VIEW` command.
@@ -273,14 +273,14 @@
GROUP BY timeseries(order_time, 'minute')
```
And execute the below query to check time series data. In this example, a
materialized view of
- aggregated table on price column will be created, which will be aggregated on
every one minute.
+ the aggregated table on the price column will be created, which will be
aggregated every one minute.
```
SELECT timeseries(order_time,'minute'), avg(price)
FROM sales
GROUP BY timeseries(order_time,'minute')
```
- Find below the result of above query aggregated over minute.
+ Find below the result of the above query aggregated over a minute.
```
+---------------------------------------+----------------+
@@ -300,19 +300,17 @@
granularity provided during creation and stored on each segment.
**NOTE**:
- 1. Single select statement cannot contain time series udf(s) neither with
different granularity
- nor with different timestamp/date columns.
- 2. Retention policies for time series is not supported yet.
+ 1. Retention policies for time series is not supported yet.
## Time Series RollUp Support
- Time series queries can be rolled up from existing materialized view.
+ Time series queries can be rolled up from an existing materialized view.
### Query RollUp
Consider an example where the query is on hour level granularity, but the
materialized view
with hour level granularity is not present but materialized view with minute
level granularity is
- present, then we can get the data from minute level and the aggregate the
hour level data and
+ present, then we can get the data from minute level and aggregate the hour
level data and
give output. This is called query rollup.
Consider if user create's below time series materialized view,
@@ -334,10 +332,10 @@
```
Then, the above query can be rolled up from materialized view 'agg_sales', by
adding hour
- level time series aggregation on minute level aggregation. Users can fire
explain command
- to check if query is rolled up from existing materialized view.
+ level time series aggregation on minute level aggregation. Users can fire the
`EXPLAIN` command
+ to check if a query is rolled up from an existing materialized view.
**NOTE**:
- 1. Queries cannot be rolled up, if filter contains time series function.
+ 1. Queries cannot be rolled up, if the filter contains a time series
function.
2. Roll up is not yet supported for queries having join clause or order by
functions.
\ No newline at end of file
diff --git
a/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/mergedata/CarbonDataFileMergeTestCaseOnSI.scala
b/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/mergedata/CarbonDataFileMergeTestCaseOnSI.scala
index 9eced78..00c7d4a 100644
---
a/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/mergedata/CarbonDataFileMergeTestCaseOnSI.scala
+++
b/index/secondary-index/src/test/scala/org/apache/carbondata/spark/testsuite/mergedata/CarbonDataFileMergeTestCaseOnSI.scala
@@ -142,7 +142,7 @@ class CarbonDataFileMergeTestCaseOnSI
checkAnswer(sql("""Select count(*) from nonindexmerge where
name='n164419'"""), rows)
}
- test("Verify command of REBUILD INDEX command with invalid segments") {
+ test("Verify command of REFRESH INDEX command with invalid segments") {
CarbonProperties.getInstance()
.addProperty(CarbonCommonConstants.CARBON_SI_SEGMENT_MERGE, "false")
sql("DROP TABLE IF EXISTS nonindexmerge")