Github user sgururajshetty commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2215#discussion_r183617702
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -0,0 +1,180 @@
+# CarbonData Lucene DataMap
+
+* [Quick Example](#quick-example)
+* [DataMap Management](#datamap-management)
+* [Lucene Datamap](#lucene-datamap-introduction)
+* [Loading Data](#loading-data)
+* [Querying Data](#querying-data)
+* [Data Management](#data-management-with-pre-aggregate-tables)
+
+## Quick example
+Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
+
+Package carbon jar, and copy
assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar
to $SPARK_HOME/jars
+```shell
+mvn clean package -DskipTests -Pspark-2.2
+```
+
+Start spark-shell in new terminal, type :paste, then copy and run the
following code.
+```scala
+ import java.io.File
+ import org.apache.spark.sql.{CarbonEnv, SparkSession}
+ import org.apache.spark.sql.CarbonSession._
+ import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
+ import org.apache.carbondata.core.util.path.CarbonStorePath
+
+ val warehouse = new File("./warehouse").getCanonicalPath
+ val metastore = new File("./metastore").getCanonicalPath
+
+ val spark = SparkSession
+ .builder()
+ .master("local")
+ .appName("preAggregateExample")
+ .config("spark.sql.warehouse.dir", warehouse)
+ .getOrCreateCarbonSession(warehouse, metastore)
+
+ spark.sparkContext.setLogLevel("ERROR")
+
+ // drop table if exists previously
+ spark.sql(s"DROP TABLE IF EXISTS datamap_test")
+
+ // Create main table
+ spark.sql(
+ s"""
+ |CREATE TABLE datamap_test (
+ |name string,
+ |age int,
+ |city string,
+ |country string)
+ |STORED BY 'carbondata'
+ """.stripMargin)
+
+ // Create lucene datamap on the main table
+ spark.sql(
+ s"""
+ |CREATE DATAMAP dm
+ |ON TABLE datamap_test
+ |USING "lucene"
+ |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+
+ import spark.implicits._
+ import org.apache.spark.sql.SaveMode
+ import scala.util.Random
+
+ // Load data to the main table, if
+ // lucene index writing fails, the datamap
+ // will be disabled in query
+ val r = new Random()
+ spark.sparkContext.parallelize(1 to 10)
+ .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
+ .toDF("name", "age", "city", "country")
+ .write
+ .format("carbondata")
+ .option("tableName", "datamap_test")
+ .option("compress", "true")
+ .mode(SaveMode.Append)
+ .save()
+
+ spark.sql(
+ s"""
+ |SELECT *
+ |from datamap_test where
+ |TEXT_MATCH('name:c10')
+ """.stripMargin).show
+
+ spark.stop
+```
+
+#### DataMap Management
+Lucene DataMap can be created using following DDL
+ ```
+ CREATE DATAMAP [IF NOT EXISTS] datamap_name
+ ON TABLE main_table
+ USING "lucene"
+ DMPROPERTIES ('text_columns'='city, name', ...)
+ ```
+
+DataMap can be dropped using following DDL
+ ```
+ DROP DATAMAP [IF EXISTS] datamap_name
+ ON TABLE main_table
+ ```
+To show all DataMaps created, use:
+ ```
+ SHOW DATAMAP
+ ON TABLE main_table
+ ```
+It will show all DataMaps created on main table.
+
+
+## Lucene DataMap Introduction
+ Lucene datamap are created as index DataMaps and managed along with main
tables by CarbonData.
+ User can create as many lucene datamaps required to improve query
performance,
+ provided the storage requirements and loading speeds are acceptable.
+
+ Once lucene datamaps are created, the indexes generated by lucene will
be read for pruning till
+ row level for the filter query by launching a spark datamap job. This
pruned data will be read to
+ give the proper and faster result
+
+ For instance, main table called **sales** which is defined as
+
+ ```
+ CREATE TABLE datamap_test (
+ name string,
+ age int,
+ city string,
+ country string)
+ STORED BY 'carbondata'
+ ```
+
+ User can create Lucene datamap using the Create DataMap DDL
+
+ ```
+ CREATE DATAMAP dm
+ ON TABLE datamap_test
+ USING "lucene"
+ DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
+ ```
+
+## Loading data
+When loading data to main table, it checks whether any lucene datamaps are
present or not, if it is,
+then lucene index files will be generated for all the text_columns (String
Columns) given in
+DMProperties which contains information about the blocklet_id, page_id and
row_id and for all the
+data of text_columns. These index files will be written inside a folder
named as datamap name inside
+each segment directories.
+
+## Querying data
+As a technique for query acceleration, Lucene indexes cannot be queried
directly.
+Queries are to be made on main table. An UDF called TEXT_MATCH is
registered in spark session, so
+when a query with TEXT_MATCH() is fired, While doing query planning,
TEXT_MATCH will be treated as
+pushed filters. It checks for all the lucene datamaps, and a job is fired
for pruning and for each
+blocklet a temporary file will be generated which has information till row
level, but prune will
+return blocklets finally.
+
+When query reaches executor side, the temporary files written will be read
and bitset groups are
+formed to return the query result.
+
+User can verify whether a query can leverage Lucene datamap or not by
executing `EXPLAIN`
+command, which will show the transformed logical plan, and thus user can
check whether TEXT_MATCH()
+filter is applied on query or not.
+
+
+## Data Management with pre-aggregate tables
+Once there is lucene datamap is created on the main table, following
command on the main
--- End diff --
Once lucene datamap is created on the main table, following command on the
main table is not supported:
---