[GitHub] [incubator-druid] RestfulBlue commented on issue #6189: Lucene indexing for free form text

GitHub Wed, 22 Aug 2018 00:09:02 -0700

Hi , multivalue dimensions will work only in some generic simple case, for 
example where logs have simple form with space separated words. But even with 
this form of data, it need external preprocessing, which will be grow with 
time. For example by first it just split by space, when we realize we also want 
to split by all special characters, when we realize what we also want to search 
by part of word, so we k skip n gramm, etc. With what external preprocessing 
will slowly move to things, what lucene doing. Also with what we cant simply 
get source text, for like select * from table limit 100, because data in 
multivalue column splitted and optimized for search. So this requiere 
denormalization of data and cost additional space.


Simple lucene indexing looks like this :

```java
   Analyzer analyzer = new StandardAnalyzer();

    // Store the index in memory:
    Directory directory = new RAMDirectory();
    // To store an index on disk, use this instead:
    //Directory directory = FSDirectory.open("/tmp/testindex");
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    IndexWriter iwriter = new IndexWriter(directory, config);
    Document doc = new Document();
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, TextField.TYPE_STORED));
    iwriter.addDocument(doc);
    iwriter.close();
    
    // Now search the index:
    DirectoryReader ireader = DirectoryReader.open(directory);
    IndexSearcher isearcher = new IndexSearcher(ireader);
    // Parse a simple query that searches for "text":
    QueryParser parser = new QueryParser("fieldname", analyzer);
    Query query = parser.parse("text");
    ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
    assertEquals(1, hits.length);
    // Iterate through the results:
    for (int i = 0; i < hits.length; i++) {
      Document hitDoc = isearcher.doc(hits[i].doc);
      assertEquals("This is the text to be indexed.", hitDoc.get("fieldname"));
    }
    ireader.close();
    directory.close();
```

i think adding it as new column will be great. The main reason is what lucene 
is more heavy than simple token indexing. Mixing disabled indexing, tokening 
and lucene in one table can greatly reduce total amount of required disk space 
compare to full lucene indexing

[ Full content available at: 
https://github.com/apache/incubator-druid/issues/6189 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [incubator-druid] RestfulBlue commented on issue #6189: Lucene indexing for free form text

Reply via email to