Re: [CLucene-dev] Indexing a document

Emerson Espínola Fri, 27 May 2011 05:46:13 -0700

Hi Veit.

Thank yo very much for your answer. Great explanation. You don't wonder how
much you're helping me.


1. I'll try english documents.
2. Ok.
3. Does BrazilianAnalyzer work similar to StandardAnalyzer? If so, that's
what I want. I'm from Brazil. :) There are no much differences between
portuguese from Portugal and portuguese from Brazil. But if there is already
an analyzer for portuguese from Brazil that's perfect for me.

[]'s
Emerson de Lira Espínola
** <[email protected]>
<https://profiles.google.com/emersonespinola/buzz?hl=pt-BR>
<http://www.quora.com/emersonespinola>
<http://www.facebook.com/emersonespinola>
<http://www.linkedin.com/in/emersonespinola>
<http://spaces.live.com/[email protected]>
<http://emersonespinola.blogspot.com> <http://twitter.com/emersonespinola>
<http://www.myebook.com/emersonespinola/>

 <http://www.myebook.com/emersonespinola/>


2011/5/27 Veit Jahns <[email protected]>

> Hi Emerson!
>
> 2011/5/26 Emerson Espínola <[email protected]>
> >
> > I figure it out. In my class Index's constructor I created the
> StandardAnalyzer, but when the constructor ends its work this variable is
> lost, then in some point that I need the analylzer it's not null, however
> it's pointing to nowhere.
> >
> > I moved the analyzer to the scope of class and then everything worked
> fine.
>
> Good to hear.
>
> > I'd like to take the advantage of this email to ask two other things:
> >
> > 1. When I get my documents in the Hits object, how do I know how much
> similar they are? Using score() method from Hits? Because I passed the same
> text that I indexed before and the score of the document was 0.58. I thought
> it was strange because I passed the same text used to index it. I'm asking
> this because I want to return only documents which are at least 75% similar.
>
> The score makes a statement regarding the similarity of the document
> (vector) to the query (vector) based on the frequency of the term in
> the document and in the index, as well as the overall number of terms
> in document and the index. You can find details on this in the Java
> Lucene documentation [1]. As I understand this the score is more like
> a distance between the document and the query, and not a percentage
> value in terms of the query. But at least you can define a threshold.
>
> > 2. I'm indexing portuguese documents. Does it matter?
>
> It depends! The StandardAnalyzers tokenizes the text at whitespace
> characters, changes the case to lower case, and removes diacritics
> diacritics. E.g., "Importação" will become to "importacao" in the
> index. Thus, searching for the terms "Importação", "importacao",
> "importacão", or "IMPORTACAO" will result in the same document. If
> this is what you want then it doesn't matter. But if this matters to
> your use case or if you want to use stopword list or stemming, then
> the language does matter.
>
> > If so, how can I tell CLucene that I'm indexing/searching portuguese
> documents.
>
> By using an appropriate analyzer. Unfortunately, there is no analyzer
> for the Portuguese language in CLucene. You then either have to create
> your own analyzer by using the existing tokenizers and filters, or
> port a analyzer for the Portuguese language from Java Lucene. The
> current release of CLucene is based on Java Lucene 2.3.2. There you
> can find only a BrazilianAnalyzer [2]. But I don't know, if the
> difference is significant here. A PortugueseAnalyzer can be found in
> Java Lucene 3.1.0 [3]. But porting this maybe difficult, due to the
> different structure of Lucene 2.3.2 and 3.1.0.
>
> If you decide to port one of the analyzers, I can give you support. I
> ported the GermanAnalyzer to CLucene once---it wasn't very difficult.
>
> Kind regards,
>
> Veit
>
> [1]
> http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/search/Similarity.html
> [2]
> http://lucene.apache.org/java/2_3_2/api/contrib-analyzers/org/apache/lucene/analysis/br/BrazilianAnalyzer.html
> [3]
> http://lucene.apache.org/java/3_1_0/api/contrib-analyzers/org/apache/lucene/analysis/pt/PortugueseAnalyzer.html
>
>
> ------------------------------------------------------------------------------
> vRanger cuts backup time in half-while increasing security.
> With the market-leading solution for virtual backup and recovery,
> you get blazing-fast, flexible, and affordable data protection.
> Download your free trial now.
> http://p.sf.net/sfu/quest-d2dcopy1
> _______________________________________________
> CLucene-developers mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/clucene-developers
>

------------------------------------------------------------------------------
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1

_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Indexing a document

Reply via email to