[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

sgururajshetty Mon, 23 Apr 2018 23:37:59 -0700

Github user sgururajshetty commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2215#discussion_r183617702
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -0,0 +1,180 @@
    +# CarbonData Lucene DataMap
    +  
    +* [Quick Example](#quick-example)
    +* [DataMap Management](#datamap-management)
    +* [Lucene Datamap](#lucene-datamap-introduction)
    +* [Loading Data](#loading-data)
    +* [Querying Data](#querying-data)
    +* [Data Management](#data-management-with-pre-aggregate-tables)
    +
    +## Quick example
    +Download and unzip spark-2.2.0-bin-hadoop2.7.tgz, and export $SPARK_HOME
    +
    +Package carbon jar, and copy 
assembly/target/scala-2.11/carbondata_2.11-x.x.x-SNAPSHOT-shade-hadoop2.7.2.jar 
to $SPARK_HOME/jars
    +```shell
    +mvn clean package -DskipTests -Pspark-2.2
    +```
    +
    +Start spark-shell in new terminal, type :paste, then copy and run the 
following code.
    +```scala
    + import java.io.File
    + import org.apache.spark.sql.{CarbonEnv, SparkSession}
    + import org.apache.spark.sql.CarbonSession._
    + import org.apache.spark.sql.streaming.{ProcessingTime, StreamingQuery}
    + import org.apache.carbondata.core.util.path.CarbonStorePath
    + 
    + val warehouse = new File("./warehouse").getCanonicalPath
    + val metastore = new File("./metastore").getCanonicalPath
    + 
    + val spark = SparkSession
    +   .builder()
    +   .master("local")
    +   .appName("preAggregateExample")
    +   .config("spark.sql.warehouse.dir", warehouse)
    +   .getOrCreateCarbonSession(warehouse, metastore)
    +
    + spark.sparkContext.setLogLevel("ERROR")
    +
    + // drop table if exists previously
    + spark.sql(s"DROP TABLE IF EXISTS datamap_test")
    + 
    + // Create main table
    + spark.sql(
    +   s"""
    +      |CREATE TABLE datamap_test (
    +      |name string,
    +      |age int,
    +      |city string,
    +      |country string)
    +      |STORED BY 'carbondata'
    +    """.stripMargin)
    + 
    + // Create lucene datamap on the main table
    + spark.sql(
    +   s"""
    +      |CREATE DATAMAP dm
    +      |ON TABLE datamap_test
    +      |USING "lucene"
    +      |DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +      
    +  import spark.implicits._
    +  import org.apache.spark.sql.SaveMode
    +  import scala.util.Random
    + 
    +  // Load data to the main table, if
    +  // lucene index writing fails, the datamap
    +  // will be disabled in query
    +  val r = new Random()
    +  spark.sparkContext.parallelize(1 to 10)
    +   .map(x => ("c1" + x % 8, x % 8, "city" + x % 50, "country" + x % 60))
    +   .toDF("name", "age", "city", "country")
    +   .write
    +   .format("carbondata")
    +   .option("tableName", "datamap_test")
    +   .option("compress", "true")
    +   .mode(SaveMode.Append)
    +   .save()
    +      
    +  spark.sql(
    +    s"""
    +       |SELECT *
    +       |from datamap_test where
    +       |TEXT_MATCH('name:c10')
    +     """.stripMargin).show
    +
    +  spark.stop
    +```
    +
    +#### DataMap Management
    +Lucene DataMap can be created using following DDL
    +  ```
    +  CREATE DATAMAP [IF NOT EXISTS] datamap_name
    +  ON TABLE main_table
    +  USING "lucene"
    +  DMPROPERTIES ('text_columns'='city, name', ...)
    +  ```
    +
    +DataMap can be dropped using following DDL
    +  ```
    +  DROP DATAMAP [IF EXISTS] datamap_name
    +  ON TABLE main_table
    +  ```
    +To show all DataMaps created, use:
    +  ```
    +  SHOW DATAMAP 
    +  ON TABLE main_table
    +  ```
    +It will show all DataMaps created on main table.
    +
    +
    +## Lucene DataMap Introduction
    +  Lucene datamap are created as index DataMaps and managed along with main 
tables by CarbonData.
    +  User can create as many lucene datamaps required to improve query 
performance,
    +  provided the storage requirements and loading speeds are acceptable.
    +  
    +  Once lucene datamaps are created, the indexes generated by lucene will 
be read for pruning till
    +  row level for the filter query by launching a spark datamap job. This 
pruned data will be read to
    +  give the proper and faster result
    +    
    +  For instance, main table called **sales** which is defined as 
    +  
    +  ```
    +  CREATE TABLE datamap_test (
    +    name string,
    +    age int,
    +    city string,
    +    country string)
    +  STORED BY 'carbondata'
    +  ```
    +  
    +  User can create Lucene datamap using the Create DataMap DDL
    +  
    +  ```
    +  CREATE DATAMAP dm
    +  ON TABLE datamap_test
    +  USING "lucene"
    +  DMPROPERTIES ('TEXT_COLUMNS' = 'name, country')
    +  ```
    +
    +## Loading data
    +When loading data to main table, it checks whether any lucene datamaps are 
present or not, if it is,
    +then lucene index files will be generated for all the text_columns (String 
Columns) given in
    +DMProperties which contains information about the blocklet_id, page_id and 
row_id and for all the
    +data of text_columns. These index files will be written inside a folder 
named as datamap name inside
    +each segment directories.
    +
    +## Querying data
    +As a technique for query acceleration, Lucene indexes cannot be queried 
directly.
    +Queries are to be made on main table. An UDF called TEXT_MATCH is 
registered in spark session, so
    +when a query with TEXT_MATCH() is fired, While doing query planning, 
TEXT_MATCH will be treated as
    +pushed filters. It checks for all the lucene datamaps, and a job is fired 
for pruning and for each
    +blocklet a temporary file will be generated which has information till row 
level, but prune will
    +return blocklets finally.
    +
    +When query reaches executor side, the temporary files written will be read 
and bitset groups are
    +formed to return the query result.
    +
    +User can verify whether a query can leverage Lucene datamap or not by 
executing `EXPLAIN`
    +command, which will show the transformed logical plan, and thus user can 
check whether TEXT_MATCH()
    +filter is applied on query or not.
    +
    +
    +## Data Management with pre-aggregate tables
    +Once there is lucene datamap is created on the main table, following 
command on the main
    --- End diff --
    
    Once lucene datamap is created on the main table, following command on the 
main table is not supported:

---

[GitHub] carbondata pull request #2215: [wip]add documentation for lucene datamap

Reply via email to