[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

sraghunandan Thu, 16 Aug 2018 20:47:49 -0700

Github user sraghunandan commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2632#discussion_r210798420
  
    --- Diff: docs/datamap/lucene-datamap-guide.md ---
    @@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
       USING 'lucene'
       DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
       ```
    -
    -**DMProperties**
    -1. INDEX_COLUMNS: The list of string columns on which lucene creates 
indexes.
    -2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if 
specified then it tries to 
    -   aggregate the unique data till the cache limit and flush to Lucene. It 
is best suitable for low 
    -   cardinality dimensions.
    -3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise 
in lucene , it means new 
    -   folder will be created for each blocklet, thus, it eliminates storing 
blockletid in lucene and 
    -   also it makes lucene small chunks of data.
    +**Properties for Lucene DataMap**
    +
    +| Property | Is Required | Default Value | Description |
    +|-------------|----------|--------|---------|
    +| INDEX_COLUMNS | YES |  | Carbondata will generate Lucene index on these 
string columns. |
    +| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in 
Lucene writer. If specified, it tries to aggregate the unique data till the 
cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE 
for low cardinality dimensions.|
    +| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE", 
folders are created per blocklet by using the blockletID. This eliminates 
indexing blockletID by lucene by storing only pageID and rowID, thus reducing 
the size of indexes created by lucene. |
    +
    +**Folder Structure for lucene datamap:**
    +  * Location of index files when Split BlockletId is TRUE: 
    +    
    +    tablePath/dataMapName/SegmentID/blockName/blockletID/..
    +
    +  * Location of index files when Split BlockletId is FALSE:
    +    
    +    tablePath/dataMapName/SegmentID/blockName/..
        
     ## Loading data
    -When loading data to main table, lucene index files will be generated for 
all the
    -index_columns(String Columns) given in DMProperties which contains 
information about the data
    -location of index_columns. These index files will be written inside a 
folder named with datamap name
    -inside each segment folders.
    +When loading data to main table, lucene index files will be generated for 
all the index_columns(String Columns) given in DMProperties which contains 
information about the data location of index_columns. These index files will be 
written into the path mentioned above.
     
    -A system level configuration carbon.lucene.compression.mode can be added 
for best compression of
    -lucene index files. The default value is speed, where the index writing 
speed will be more. If the
    -value is compression, the index file size will be compressed.
    +A system level configuration carbon.lucene.compression.mode can be added 
for best compression of lucene index files. The default value is speed, where 
the index writing speed will be more. If the value is compression, the index 
file size will be compressed.
     
     ## Querying data
     As a technique for query acceleration, Lucene indexes cannot be queried 
directly.
    -Queries are to be made on main table. when a query with 
TEXT_MATCH('name:c10') or 
    -TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the 
number of result to be 
    -returned, if user does not specify this value, all results will be 
returned without any limit] is 
    -fired, two jobs are fired.The first job writes the temporary files in 
folder created at table level 
    -which contains lucene's seach results and these files will be read in 
second job to give faster 
    -results. These temporary files will be cleared once the query finishes.
    +Queries are to be made on main table. when a query with 
TEXT_MATCH('name:c10') or TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second 
parameter represents the number of result to be returned, if user does not 
specify this value, all results will be returned without any limit] is fired, 
two jobs are fired. The first job performs pruning based on filter values and 
writes the lucene search results into temporary files in the dataMap folder 
created at table level. These files will be read during the second job (filter 
execution) to give faster results. These temporary files will be cleared once 
the query finishes.
    +
    +User can verify whether a query can leverage Lucene datamap or not by 
executing `EXPLAIN` command, which will show the transformed logical plan, and 
thus user can check whether TEXT_MATCH() filter is applied on query or not.
     
    -User can verify whether a query can leverage Lucene datamap or not by 
executing `EXPLAIN`
    -command, which will show the transformed logical plan, and thus user can 
check whether TEXT_MATCH()
    -filter is applied on query or not.
    +**NOTE:** Temporary files will contain blockletId, pageId, and rowId of 
filter query.
    --- End diff --
    
    whats the use of this note?

---

[GitHub] carbondata pull request #2632: [CARBONDATA-2206] Enhanced document on Lucene...

Reply via email to