Hi Emerson! 2011/5/26 Emerson Espínola <emersonespin...@gmail.com> > > I figure it out. In my class Index's constructor I created the > StandardAnalyzer, but when the constructor ends its work this variable is > lost, then in some point that I need the analylzer it's not null, however > it's pointing to nowhere. > > I moved the analyzer to the scope of class and then everything worked fine.
Good to hear. > I'd like to take the advantage of this email to ask two other things: > > 1. When I get my documents in the Hits object, how do I know how much similar > they are? Using score() method from Hits? Because I passed the same text that > I indexed before and the score of the document was 0.58. I thought it was > strange because I passed the same text used to index it. I'm asking this > because I want to return only documents which are at least 75% similar. The score makes a statement regarding the similarity of the document (vector) to the query (vector) based on the frequency of the term in the document and in the index, as well as the overall number of terms in document and the index. You can find details on this in the Java Lucene documentation [1]. As I understand this the score is more like a distance between the document and the query, and not a percentage value in terms of the query. But at least you can define a threshold. > 2. I'm indexing portuguese documents. Does it matter? It depends! The StandardAnalyzers tokenizes the text at whitespace characters, changes the case to lower case, and removes diacritics diacritics. E.g., "Importação" will become to "importacao" in the index. Thus, searching for the terms "Importação", "importacao", "importacão", or "IMPORTACAO" will result in the same document. If this is what you want then it doesn't matter. But if this matters to your use case or if you want to use stopword list or stemming, then the language does matter. > If so, how can I tell CLucene that I'm indexing/searching portuguese > documents. By using an appropriate analyzer. Unfortunately, there is no analyzer for the Portuguese language in CLucene. You then either have to create your own analyzer by using the existing tokenizers and filters, or port a analyzer for the Portuguese language from Java Lucene. The current release of CLucene is based on Java Lucene 2.3.2. There you can find only a BrazilianAnalyzer [2]. But I don't know, if the difference is significant here. A PortugueseAnalyzer can be found in Java Lucene 3.1.0 [3]. But porting this maybe difficult, due to the different structure of Lucene 2.3.2 and 3.1.0. If you decide to port one of the analyzers, I can give you support. I ported the GermanAnalyzer to CLucene once---it wasn't very difficult. Kind regards, Veit [1] http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/Similarity.html [2] http://lucene.apache.org/java/2_3_2/api/contrib-analyzers/org/apache/lucene/analysis/br/BrazilianAnalyzer.html [3] http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/pt/PortugueseAnalyzer.html ------------------------------------------------------------------------------ vRanger cuts backup time in half-while increasing security. With the market-leading solution for virtual backup and recovery, you get blazing-fast, flexible, and affordable data protection. Download your free trial now. http://p.sf.net/sfu/quest-d2dcopy1 _______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers