limitation on token-length for KeywordAnalyzer?

Andreas Brandl Sun, 26 Jan 2014 08:49:50 -0800

Hi,

I'm trying to get a RegexpQuery to work properly with Lucene 4.6. However, it 
fails consistently when the document gets bigger than 32kb (this document will 
never show up in search results, even if it is a match).


Is there some limitation on the length of fields? How do I get around this?

I've attached some simplified code to demonstrate the behaviour with different 
sized documents. I would expect that all documents show up in results - 
however, the actual output is:

<snip>
small-doc
>16k-doc
</snip>

(so the '>32k-doc' is missing)

My overall goal is to index (arbitrary sized) text files and run a regular 
expression search using lucene's RegexpQuery. I suspect the KeywordAnalyzer to 
cause the inconsistent behaviour - is this the right analyzer to use for a 
RegexpQuery?

Thanks a lot.

Regards,
Andreas

import java.io.IOException;

import org.apache.lucene.analysis.core.KeywordAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.RegexpQuery;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;

public class LuceneRegexpTest {

  public static void main(String[] args) throws IOException {

    Directory directory = new RAMDirectory();

    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new KeywordAnalyzer());

    try (IndexWriter writer = new IndexWriter(directory, config)) {
      addDoc(writer, "small-doc", "hello world");
      addDoc(writer, ">32k-doc", makeBigger("hello world", 32 * 1024));
      addDoc(writer, ">16k-doc", makeBigger("hello world", 16 * 1024));
      writer.commit();
    }

    try (DirectoryReader reader = DirectoryReader.open(directory)) {
      IndexSearcher isearcher = new IndexSearcher(reader);

      String regex = ".*hello world.*";
      RegexpQuery query = new RegexpQuery(new Term("content", regex));

      // doesn't work either:
      // WildcardQuery query = new WildcardQuery(new Term("content",
      // "*hello world*"));

      ScoreDoc[] scoreDocs = isearcher.search(query, Integer.MAX_VALUE).scoreDocs;

      for (ScoreDoc scoreDoc : scoreDocs) {
        Document hitDoc = isearcher.doc(scoreDoc.doc);
        System.out.println(hitDoc.get("identifier"));
      }
    }

  }

  private static String makeBigger(String in, int bytes) {
    StringBuilder sb = new StringBuilder();
    while (sb.length() < bytes) {
      sb.append(in);
    }
    return sb.toString();
  }

  private static void addDoc(IndexWriter writer, String identifier, String content) throws IOException {
    Document ldoc = new org.apache.lucene.document.Document();
    ldoc.add(new StringField("identifier", identifier, Store.YES));
    ldoc.add(new TextField("content", content, Store.YES));
    writer.addDocument(ldoc);
  }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

limitation on token-length for KeywordAnalyzer?

Reply via email to