-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hello community, dear Grant
I have build a JUnit test case that illustrates the problem - there, I try to
cut
out the right substring with the offset values given from Lucene - and fail :(
A few remarks:
In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches -
it is recognized, unlike in StandardAnalyzer - as delimiter sign.
Analysis: It seems that Lucene calculates the offset values by adding a virtual
delimiter between every field value.
But Lucene forgets the last characters of a field value when these are
analyzer-specific delimiter values. (I seem this because of DocumentWriter, line
245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
With this line of code, only the end offset of the last token is considered - by
forgetting potential, trimmed delimiter chars.
Thus, solving would be:
1. Add a single delimiter char between the field values
2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters
that are at the end of all field values before the match
For this, someone needs to know what a delimiter for an specific analyzer is.
The other possibility of course is to change the behaviour inside Lucene,
because
the current offset values are more or less useless / hard to use (I currently
have
no idea how to get analyzer-specific delimiter chars).
For me, this looks like a bug - am I wrong?
Any ideas/hints/remarks? I would be very lucky about :)
Greetings
Christian
Grant Ingersoll schrieb:
> Hi Christian,
>
> Is there anyway you can post a complete, self-contained example
> preferably as a JUnit test? I think it would be useful to know more
> about how you are indexing (i.e. what Analyzer, etc.)
> The offsets should be taken from whatever is set in on the Token during
> Analysis. I, too, am trying to remember where in the code this is
> taking place
>
> Also, what version of Lucene are you using?
>
> -Grant
>
> On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:
>
> Hello,
>
> I have an index with an 'actor' field, for each actor there exists an
> single field value entry, e.g.
>
> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
> <movie_actors>
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
> movie_actors:Miguel Bosé
> movie_actors:Anna Lizaran (as Ana Lizaran)
> movie_actors:Raquel Sanchís
> movie_actors:Angelina Llongueras
>
> I try to get the term offset, e.g. for 'angelina' with
>
> termPositionVector = (TermPositionVector)
> reader.getTermFreqVector(docNumber, "movie_actors");
> int iTermIndex = termPositionVector.indexOf("angelina");
> TermVectorOffsetInfo[] termOffsets =
> termPositionVector.getOffsets(iTermIndex);
>
>
> I get one TermVectorOffsetInfo for the field - with offset numbers
> that are bigger than one single
> Field entry.
> I guessed that Lucene gives the offset number for the situation that
> all values were concatenated,
> which is for the single (virtual) string:
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
>
> This fits in nearly no situation, so my second guess was that lucene
> adds some virtual delimiters between the single
> field entries for offset calculation. I added a delimiter, so the
> result would be:
>
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
> (note the ' ' between each actor name)
>
> ..this also fits not for each situation - there are too much
> delimiters there now, so, further, I guessed that Lucene don't add
> a delimiter in each situation. So I added only one when the last
> character of an entry was no alphanumerical one, with:
> StringBuilder strbAttContent = new StringBuilder();
> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
> {
> strbAttContent.append(strAttValue);
> if(strbAttContent.substring(strbAttContent.length() -
> 1).matches("\\w"))
> strbAttContent.append(' ');
> }
>
> where I get the result (virtual) entry:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
>
> this fits in ~96% of all my queries....but still its not 100% the way
> lucene calculates the offset value for fields with multiple
> value entries.
>
>
> ..maybe the problem is that there are special characters inside my
> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
> I have looked to this specific situation, but considering this one
> character don't solves the problem.
>
>
> How do Lucene calculates these offsets? I also searched inside the
> source code, but can't find the correct place.
>
>
> Thanks in advance!
>
> Christian Reuschling
>
>
>
>
>
> --
> ______________________________________________________________________________
>
> Christian Reuschling, Dipl.-Ing.(BA)
> Software Engineer
>
> Knowledge Management Department
> German Research Center for Artificial Intelligence DFKI GmbH
> Trippstadter Straße 122, D-67663 Kaiserslautern, Germany
>
> Phone: +49.631.20575-125
> mailto:[EMAIL PROTECTED] http://www.dfki.uni-kl.de/~reuschling/
>
> ------------Legal Company Information Required by German
> Law------------------
> Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster
> (Vorsitzender)
> Dr. Walter Olthoff
> Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
> Amtsgericht Kaiserslautern, HRB 2313=
> ______________________________________________________________________________
>
>>
- ---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
>>
> --------------------------
> Grant Ingersoll
> http://lucene.grantingersoll.com
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQFGxdBkQoTr50f1tpcRAjyHAJ9jvKORu0Ciepprg70z2qOdFhmnIgCgoaYA
nkji0oFq2UGEkh7H9YLdqbE=
=wKe1
-----END PGP SIGNATURE-----
package org.dynaq.index;
import static org.junit.Assert.assertTrue;
import java.io.IOException;
import java.util.LinkedList;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermPositionVector;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.RAMDirectory;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class TokenizerTest
{
IndexReader m_indexReader;
Analyzer m_analyzer = new StandardAnalyzer();
@Before
public void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException
{
Directory ramDirectory = new RAMDirectory();
// we create a fist little set of actors names
LinkedList<String> llActorNames = new LinkedList<String>();
llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)");
llActorNames.add("Miguel Bosé");
llActorNames.add("Anna Lizaran (as Ana Lizaran)");
llActorNames.add("Raquel SanchÃs");
llActorNames.add("Angelina Llongueras");
// store them into a single document with multiple values for one Field
Document testDoc = new Document();
for (String strActorsName : llActorNames)
{
Field testEntry =
new Field("movie_actors", strActorsName, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
testDoc.add(testEntry);
}
// now we write it into the index
IndexWriter indexWriter = new IndexWriter(ramDirectory, true, m_analyzer, true);
indexWriter.addDocument(testDoc);
indexWriter.close();
m_indexReader = IndexReader.open(ramDirectory);
}
@Test
public void checkOffsetValues() throws ParseException, IOException
{
// first, we search for 'angelina'
String strSearchTerm = "Angelina";
Searcher searcher = new IndexSearcher(m_indexReader);
QueryParser parser = new QueryParser("movie_actors", m_analyzer);
Query query = parser.parse(strSearchTerm);
Hits hits = searcher.search(query);
Document resultDoc = hits.doc(0);
String[] straValues = resultDoc.getValues("movie_actors");
// now, we get the field values and build a single string value out of them
StringBuilder strbSimplyConcatenated = new StringBuilder();
StringBuilder strbWithDelimiters = new StringBuilder();
StringBuilder strbWithDelimitersAfterAlphaNumChar = new StringBuilder();
for (String strActorName : straValues)
{
// first situation: we simply concatenate all field value entries
strbSimplyConcatenated.append(strActorName);
// second: we add a single delimiter char between the field values
strbWithDelimiters.append(strActorName).append('$');
// third try: we add a single delimiter, but only if the last char of the actor before was a alphanum-char
strbWithDelimitersAfterAlphaNumChar.append(strActorName);
String strLastChar = strActorName.substring(strActorName.length() - 1, strActorName.length());
if(strLastChar.matches("\\w")) strbWithDelimitersAfterAlphaNumChar.append('$');
}
// this is the offset value from lucene. This should be the place of 'angelina' in one of the concatenated value Strings above
TermPositionVector termPositionVector = (TermPositionVector) m_indexReader.getTermFreqVector(0, "movie_actors");
int iTermIndex = termPositionVector.indexOf(strSearchTerm.toLowerCase());
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);
int iStartOffset = termOffsets[0].getStartOffset();
int iEndOffset = termOffsets[0].getEndOffset();
//we create the substrings according to the offset value given from lucene
String strSubString1 = strbSimplyConcatenated.substring(iStartOffset, iEndOffset);
String strSubString2 = strbWithDelimiters.substring(iStartOffset, iEndOffset);
String strSubString3 = strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset);
System.out.println("Offset value: " + iStartOffset + "-" + iEndOffset);
System.out.println("simply concatenated:");
System.out.println(strbSimplyConcatenated);
System.out.println("SubString for offset: '" + strSubString1 + "'");
System.out.println();
System.out.println("with delimiters:");
System.out.println(strbWithDelimiters);
System.out.println("SubString for offset: '" + strSubString2 + "'");
System.out.println();
System.out.println("with delimiter after alphanum character:");
System.out.println(strbWithDelimitersAfterAlphaNumChar);
System.out.println("SubString for offset: '" + strSubString3 + "'");
//is the offset value correct for one of the concatenated strings?
//this fails for all situations
assertTrue(strSubString1.equals(strSearchTerm) || strSubString2.equals(strSearchTerm) || strSubString3.equals(strSearchTerm));
/*
* Comments: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in
* StandardAnalyzer - as delimiter sign.
*
* Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value.
* But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values.
* (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
* With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter
* chars.
*
* Thus, solving would be:
* 1. Add a single delimiter char between the field values
* 2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match
*
* For this, someone needs to know what a delimiter for an specific analyzer is.
*/
}
@After
public void closeIndex() throws CorruptIndexException, IOException
{
m_indexReader.close();
}
}
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]