Re: [CLucene-dev] Indexing a document

Veit Jahns Fri, 27 May 2011 03:08:51 -0700

Hi Emerson!

2011/5/26 Emerson Espínola <emersonespin...@gmail.com>
>
> I figure it out. In my class Index's constructor I created the 
> StandardAnalyzer, but when the constructor ends its work this variable is 
> lost, then in some point that I need the analylzer it's not null, however 
> it's pointing to nowhere.
>
> I moved the analyzer to the scope of class and then everything worked fine.


Good to hear.

> I'd like to take the advantage of this email to ask two other things:
>
> 1. When I get my documents in the Hits object, how do I know how much similar 
> they are? Using score() method from Hits? Because I passed the same text that 
> I indexed before and the score of the document was 0.58. I thought it was 
> strange because I passed the same text used to index it. I'm asking this 
> because I want to return only documents which are at least 75% similar.

The score makes a statement regarding the similarity of the document
(vector) to the query (vector) based on the frequency of the term in
the document and in the index, as well as the overall number of terms
in document and the index. You can find details on this in the Java
Lucene documentation [1]. As I understand this the score is more like
a distance between the document and the query, and not a percentage
value in terms of the query. But at least you can define a threshold.

> 2. I'm indexing portuguese documents. Does it matter?

It depends! The StandardAnalyzers tokenizes the text at whitespace
characters, changes the case to lower case, and removes diacritics
diacritics. E.g., "Importação" will become to "importacao" in the
index. Thus, searching for the terms "Importação", "importacao",
"importacão", or "IMPORTACAO" will result in the same document. If
this is what you want then it doesn't matter. But if this matters to
your use case or if you want to use stopword list or stemming, then
the language does matter.

> If so, how can I tell CLucene that I'm indexing/searching portuguese 
> documents.

By using an appropriate analyzer. Unfortunately, there is no analyzer
for the Portuguese language in CLucene. You then either have to create
your own analyzer by using the existing tokenizers and filters, or
port a analyzer for the Portuguese language from Java Lucene. The
current release of CLucene is based on Java Lucene 2.3.2. There you
can find only a BrazilianAnalyzer [2]. But I don't know, if the
difference is significant here. A PortugueseAnalyzer can be found in
Java Lucene 3.1.0 [3]. But porting this maybe difficult, due to the
different structure of Lucene 2.3.2 and 3.1.0.

If you decide to port one of the analyzers, I can give you support. I
ported the GermanAnalyzer to CLucene once---it wasn't very difficult.

Kind regards,

Veit

[1] 
http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/Similarity.html
[2] 
http://lucene.apache.org/java/2_3_2/api/contrib-analyzers/org/apache/lucene/analysis/br/BrazilianAnalyzer.html
[3] 
http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/pt/PortugueseAnalyzer.html

------------------------------------------------------------------------------
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
_______________________________________________
CLucene-developers mailing list
CLucene-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Indexing a document

Reply via email to