Re: How to use RegexTermEnum

Erick Erickson Fri, 03 Jul 2009 12:03:55 -0700

WARNING: I haven't actually tried using RegexTermEnum in a
long time, but...


I *think* that the constructor positions you at the first term that
matches, without calling next(). At least there's nothing I saw
in the documentation that indicates you need to call next() before
calling term().

Assuming that's true, I think you're skipping the first term by calling
next() before incrementing your count.

At least it's worth a try <G>....

Best
Erick

On Fri, Jul 3, 2009 at 12:27 PM, Raf <r.ventag...@gmail.com> wrote:

> Hi,
> I am trying to solve the following problem:
> In my index I have a "url" field added as Field.Store.YES,
> Field.Index.NOT_ANALYZED and I must use this field as a "key" to identify a
> document.
>
> The problem is that sometimes two urls can differ only because they contain
> a different session id:
> i.e.  I would like to identify that
>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879
> and
>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879
> are the same document!
>
> So I have tried using a regular expression, to ignore the sid and match
> both
> documents: "http://digiland
> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879".
>
> At this point, I would like to retrieve all terms that satisfy my regex so
> I
> tried to use a RegexTermEnum, but it returns to me only one of the two
> documents.
> Actually, it seems to me that it does not return the "first" match.
> So, if I have only one match in my index, RegexTermEnum returns nothing, if
> I have two matches, it returns one doc, and so on.
>
> Here you can find a simple test that shows the problem (both assert fail):
>
> <code>
> package it.celi.search;
>
> import static org.junit.Assert.assertEquals;
>
> import java.io.IOException;
>
> import org.apache.lucene.analysis.KeywordAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> import org.apache.lucene.index.IndexReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.index.IndexWriter.MaxFieldLength;
> import org.apache.lucene.search.regex.JakartaRegexpCapabilities;
> import org.apache.lucene.search.regex.RegexTermEnum;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.RAMDirectory;
> import org.junit.After;
> import org.junit.Before;
> import org.junit.Test;
>
> public class RegexLuceneTest {
>
>    private Directory directory;
>
>    @Before
>    public void setUp() throws Exception {
>
>        this.directory = new RAMDirectory();
>        this.addDocsToIndex();
>    }
>
>    @After
>    public void tearDown() throws Exception {
>    }
>
>    @Test
>    public void test() throws IOException {
>
>        IndexReader reader = IndexReader.open(this.directory);
>        System.out.println("Num docs: " + reader.numDocs());
>
>        JakartaRegexpCapabilities regexpCapabilities = new
> JakartaRegexpCapabilities();
>
>        String urlToSearch = "http://digiland
> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432889\\&.*#3432889";
>        RegexTermEnum rte = new RegexTermEnum(reader, new Term("url",
> urlToSearch), regexpCapabilities);
>        int count = 0;
>        while (rte.next()) {
>            System.out.println(rte.term() + " " + rte.docFreq());
>            count++;
>        }
>        assertEquals(1, count);
>
>        urlToSearch = "http://digiland
> \\.libero\\.it/forum/viewtopic\\.php\\?p=3432879\\&.*#3432879";
>        rte = new RegexTermEnum(reader, new Term("url", urlToSearch),
> regexpCapabilities);
>        count = 0;
>        while (rte.next()) {
>            System.out.println(rte.term() + " " + rte.docFreq());
>            count++;
>        }
>        assertEquals(2, count);
>
>    }
>
>    private void addDocsToIndex() throws IOException {
>
>        IndexWriter writer = new IndexWriter(directory, new
> KeywordAnalyzer(), true, MaxFieldLength.UNLIMITED);
>
>        Document doc = new Document();
>        doc.add(new Field("url", "
>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab02505591827a90fe5010f45c#3432879
> ",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        doc.add(new Field("contents", "contenuto documento 1",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        writer.addDocument(doc);
>
>        doc = new Document();
>        doc.add(new Field("url", "
>
> http://digiland.libero.it/forum/viewtopic.php?p=3432889&sid=16c7ea74d98a8229c1ddd4800a2738ec#3432889
> ",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        doc.add(new Field("contents", "contenuto documento 2",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        writer.addDocument(doc);
>
>        doc = new Document();
>        doc.add(new Field("url", "
>
> http://digiland.libero.it/forum/viewtopic.php?p=3432879&sid=70acaeab505d98a8229c10fe5010f45c#3432879
> ",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        doc.add(new Field("contents", "contenuto documento 3",
> Field.Store.YES, Field.Index.NOT_ANALYZED));
>        writer.addDocument(doc);
>
>        writer.optimize();
>        writer.close();
>    }
>
> }
> </code>
>
> What am I missing?
> Thanks.
>
> Bye,
> Raf
>

Re: How to use RegexTermEnum

Reply via email to