Github user sraghunandan commented on a diff in the pull request:
https://github.com/apache/carbondata/pull/2632#discussion_r210798390
--- Diff: docs/datamap/lucene-datamap-guide.md ---
@@ -70,42 +66,38 @@ It will show all DataMaps created on main table.
USING 'lucene'
DMPROPERTIES ('INDEX_COLUMNS' = 'name, country',)
```
-
-**DMProperties**
-1. INDEX_COLUMNS: The list of string columns on which lucene creates
indexes.
-2. FLUSH_CACHE: size of the cache to maintain in Lucene writer, if
specified then it tries to
- aggregate the unique data till the cache limit and flush to Lucene. It
is best suitable for low
- cardinality dimensions.
-3. SPLIT_BLOCKLET: when made as true then store the data in blocklet wise
in lucene , it means new
- folder will be created for each blocklet, thus, it eliminates storing
blockletid in lucene and
- also it makes lucene small chunks of data.
+**Properties for Lucene DataMap**
+
+| Property | Is Required | Default Value | Description |
+|-------------|----------|--------|---------|
+| INDEX_COLUMNS | YES | | Carbondata will generate Lucene index on these
string columns. |
+| FLUSH_CACHE | NO | -1 | It defines the size of the cache to maintain in
Lucene writer. If specified, it tries to aggregate the unique data till the
cache limit and then flushes to Lucene. It is recommended to define FLUSH_CACHE
for low cardinality dimensions.|
+| SPLIT_BLOCKLET | NO | TRUE | When SPLIT_BLOCKLET is defined as "TRUE",
folders are created per blocklet by using the blockletID. This eliminates
indexing blockletID by lucene by storing only pageID and rowID, thus reducing
the size of indexes created by lucene. |
+
+**Folder Structure for lucene datamap:**
+ * Location of index files when Split BlockletId is TRUE:
+
+ tablePath/dataMapName/SegmentID/blockName/blockletID/..
+
+ * Location of index files when Split BlockletId is FALSE:
+
+ tablePath/dataMapName/SegmentID/blockName/..
## Loading data
-When loading data to main table, lucene index files will be generated for
all the
-index_columns(String Columns) given in DMProperties which contains
information about the data
-location of index_columns. These index files will be written inside a
folder named with datamap name
-inside each segment folders.
+When loading data to main table, lucene index files will be generated for
all the index_columns(String Columns) given in DMProperties which contains
information about the data location of index_columns. These index files will be
written into the path mentioned above.
-A system level configuration carbon.lucene.compression.mode can be added
for best compression of
-lucene index files. The default value is speed, where the index writing
speed will be more. If the
-value is compression, the index file size will be compressed.
+A system level configuration carbon.lucene.compression.mode can be added
for best compression of lucene index files. The default value is speed, where
the index writing speed will be more. If the value is compression, the index
file size will be compressed.
## Querying data
As a technique for query acceleration, Lucene indexes cannot be queried
directly.
-Queries are to be made on main table. when a query with
TEXT_MATCH('name:c10') or
-TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second parameter represents the
number of result to be
-returned, if user does not specify this value, all results will be
returned without any limit] is
-fired, two jobs are fired.The first job writes the temporary files in
folder created at table level
-which contains lucene's seach results and these files will be read in
second job to give faster
-results. These temporary files will be cleared once the query finishes.
+Queries are to be made on main table. when a query with
TEXT_MATCH('name:c10') or TEXT_MATCH_WITH_LIMIT('name:n10',10)[the second
parameter represents the number of result to be returned, if user does not
specify this value, all results will be returned without any limit] is fired,
two jobs are fired. The first job performs pruning based on filter values and
writes the lucene search results into temporary files in the dataMap folder
created at table level. These files will be read during the second job (filter
execution) to give faster results. These temporary files will be cleared once
the query finishes.
--- End diff --
sentence can be written to specify the lucene UDFs we are using as means of
firing query to lucent
---