Re: Multiple terms with the same position in PhraseQuery

2005-11-05 Thread Erik Hatcher

On 4 Nov 2005, at 23:08, Ahmed El-dawy wrote:

BTW, I think there's a newer version of Lucene that I can't get, my
version is 1.4.3 and I didn't find any newer version at the site. For
example, the QueryParser in my version doesn't care with term position
and I had to override it by myself to support this.
You may be referring to the CVS version, but I want to release my app.
with a stable version.


For the record, Subversion trunk (no longer CVS) is stable and being  
used in many production projects already.


The only difference between Subversion trunk and a released version  
is the time and effort someone has taken to build it, package it,  
sign it, and upload it (and of course a consensus vote authorizing  
it).  While I know that many environments demand that such blessing  
has occurred, I cannot say that I altogether understand it.  I much  
prefer, personally, to be on the trunk and know that any issues I do  
happen to encounter can be easily reported, likely fixed if  
identified specifically enough, fixed, and integrated back into my  
projects right away.


I certainly do feel a bit bad that I'm not personally being  
aggressive about pushing a new release, but please don't let my  
insane schedule hold you back from using the latest and best version  
of Lucene.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: SpanQuery parser? Update (ugly hack inside...)

2005-11-05 Thread Paul Elschot
On Saturday 05 November 2005 01:29, Erik Hatcher wrote:
 
 On 4 Nov 2005, at 18:32, Sean O'Connor wrote:
  I'm posting this primarily hoping to give back a tiny bit to a very  
  helpful community. More likely however, someone else will open my  
  eyes to an easier approach than what I outline below...
 
  I've come up with a very ugly conversion approach from regular  
  Query objects into SpanQuery objects. I then use the converted  
  SpanQuery to get span positions (currently both token #, and start/ 
  end position). In effect, I have highlighting for simple queries  
  with a very inefficient approach (yea for me!).
 
 As you and I have talked about on a couple of face to face occasions,  
 this is the approach I am taking on a current consulting project.  My  
 conversion code is slightly different than yours in that I don't  
 rewrite the query, but translate it as-is into comparable SpanQuery  
 subclasses - and this is because I have a RegexQuery and  
 SpanRegexQuery that are comparable.  But rewriting is a good  
 pragmatic way to go for general query types that don't have a  
 comparable SpanQuery subclass.
 
  The goal(s) I am trying to accomplish is rather specific I think,  
  so I imagine the use of my hacking is rather limited (i.e. just to  
  me).
 
  At the moment my code:
 
 * parses the search text (i.e. user entered query)
 
 Are you using QueryParser?  If so, you'll also want to account for  
 BooleanQuery, recursively.

The surround parser can create both boolean queries and span queries.

Sean, as you seem to prefer not to use the surround syntax, do you think
this syntax could be improved somehow? I recall trying to make it simpler,
but when I made it I was not able to do so.

Also, PhraseQuery is more efficient than a combination of SpanTermQuery,
SpanOrQuery and SpanNearQuery, so perhaps PhraseQuery should have
a getSpans() method so it could be used as a SpanQuery, too.

Regards,
Paul Elschot

 
 * rewrites the resulting query to expand wildcards and such against
   index
 * calls a recursive conversion function with very basic conversion
   understanding
   o TermQuery - SpanTerm
   o PhraseQuery - SpanNear
   o others in progress as time permits
 
  Currently, I only process simple query strings like:
  blue green yellow = SpanOrQuery
  luce* acti* = SpanOrQuery with wild cards expanded
 e.g.: lucene lucent action acting ... all or'ed together in a  
  braindead fashion
  luce* acti* \book rocks\ = SpanOrQuery combining SpanTerms and  
  SpanNear (no slop)
 er, hopefully you get the picture, I'm not up to showing a  
  vector of this one... :-)
 
  I would be happy to discuss my approach if there is anyone  
  interested. I assume I am pretty much alone in finding this  
  ineffecient approach useful. For me, it is the functionality that  
  overrides perfomance issues.
 
 What is inefficient about it?   The rewrite stuff is the main  
 difference, and perhaps that is the issue you're encountering.  Where  
 do you see the performance issues?
 
 Converting a query, for me at least, is fast - perhaps because there  
 is no rewriting involved.
 
  I have something which can take user search strings and do hit  
  highlighting for the exact hit found. This is really only useful  
  for termA near 'some phrase' at the moment, but might become more  
  advanced in the next 2-3 months.
 
 I'm basically implementing this very thing.  I will likely be  
 enhancing the contrib/highlighter code in the next month to use  
 SpanQuery for highlighting, as well as adding field-aware highlighting.
 
  Erik
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene and jsp

2005-11-05 Thread Gaston

Hallo,

I know my topic is a little bit out of topic. but I am trying and trying 
to do something without no effort. I have a very simple application.I 
tested this application on my homepc with tomcat 3.3.2 and it worked. 
But on the the server off my webhosting agency it does not work. I 
putted the jar in the right directory and so on and so I have no idea 
why it doesn't work. Perhaps somebody out of you had the same problem 
and has a hint for my, what the reason for my failure can be.


My code:
%@ page import=java.io.*,javax.servlet.*, 
javax.servlet.http.*,org.apache.lucene.analysis.Analyzer,org.apache.lucene.analysis.standard.StandardAnalyzer,org.apache.lucene.document.Document,org.apache.lucene.document.Field,org.apache.lucene.index.IndexWriter 
%

%



   try
   {
   String[] text = { Indexierung mit Lucene, Suche mit Lucene };
   String indexDir = application.getRealPath(/)+myindex;
   Analyzer analyzer = new StandardAnalyzer();
   boolean create = true;
  
   IndexWriter writer = new IndexWriter(indexDir, analyzer, create);

   out.println(indexDir);
   for (int i = 0; i  text.length; i++)
   {
   Document document = new Document();
   document.add(Field.Text(textfeld, text[i]));
   writer.addDocument(document);
   out.println(Es klappt);
   }
   writer.close();
   out.println(hallozwei);
   }
   catch(IOException e)
   {
   e.printStackTrace();
   }
   catch(Exception e)
   {
   e.printStackTrace();
   }

%

Error:

http://gasizwei.meintestaccount.de:9080/gagamodi/indexaufserver.jsp


Thank you in advance.

Greetings

Gaston

P.S. I asked this in j2ee forums but the answers I get didn't help me.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring formula

2005-11-05 Thread Karl Koch
Yes, the Similarity class existed in version 1.2, but no description is
included in its JavaDoc. After somebody could point me to the formular I
would also like to know if the formula ensures the score is always between
0.0 and 1.0 (without any boosting)... Is this the case?

Karl

 --- Ursprüngliche Nachricht ---
 Von: Otis Gospodnetic [EMAIL PROTECTED]
 An: java-user@lucene.apache.org
 Betreff: Re: Scoring formula
 Datum: Fri, 4 Nov 2005 12:12:52 -0800 (PST)
 
 The formula should also be in the javadoc for Similarity class, if it
 was there in 1.2.
 
 Otis
 
 
 --- Karl Koch [EMAIL PROTECTED] wrote:
 
  Hello group,
  
  the scoring formula for Lucene is well explained in Lucene in
  Action.
  However, is this formula also valid for Lucene 1.2 (which I am
  using). I
  need to know that for documentation purposes. If not, where can I
  find the
  currect formula since I do not want to interpret if from the code...
  
  Best Regards to all of you,
  Karl
  
  -- 
  Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
  Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Question about scoring normalisation

2005-11-05 Thread Karl Koch
Hello all,

I am wondering how many of you actually work with own scoring mechanism
(overwriting Lucenes standard scoring) and how many of you do work on how to
normalise this score. 

I would like to add a second score on top of Lucenes TF/IDF score. The
resulting score is most likely higher then 1.0. However, the score should be
between 0.0 and 1.0. What is the best way to do that? If Lucene is
normalising its score (if no boosting is applied) to a maximium of 1.0, how
is this done (in Lucene 1.2 and/or beyond) ?

Regards,
Karl


-- 
Highspeed-Freiheit. Bei GMX supergünstig, z.B. GMX DSL_Cityflat,
DSL-Flatrate für nur 4,99 Euro/Monat*  http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring formula

2005-11-05 Thread Yonik Seeley
Lucene 1.2 is before my time, but check if the functions are
implemented the same as the current version (they probably are).

Scores are not naturally = 1, but for most search methods (including
all that return Hits) they are normalized to be between 1 and 0 if the
highest score is greater than 1.

-Yonik
Now hiring -- http://forms.cnet.com/slink?231706


On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote:
 Yes, the Similarity class existed in version 1.2, but no description is
 included in its JavaDoc. After somebody could point me to the formular I
 would also like to know if the formula ensures the score is always between
 0.0 and 1.0 (without any boosting)... Is this the case?

 Karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring formula

2005-11-05 Thread Karl Koch
I always thought that Lucene search is always returning a Hits object. In
what occation would this not be the case?

Karl

 --- Ursprüngliche Nachricht ---
 Von: Yonik Seeley [EMAIL PROTECTED]
 An: java-user@lucene.apache.org
 Betreff: Re: Scoring formula
 Datum: Sat, 5 Nov 2005 17:49:40 -0500
 
 Lucene 1.2 is before my time, but check if the functions are
 implemented the same as the current version (they probably are).
 
 Scores are not naturally = 1, but for most search methods (including
 all that return Hits) they are normalized to be between 1 and 0 if the
 highest score is greater than 1.
 
 -Yonik
 Now hiring -- http://forms.cnet.com/slink?231706
 
 
 On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote:
  Yes, the Similarity class existed in version 1.2, but no description is
  included in its JavaDoc. After somebody could point me to the formular I
  would also like to know if the formula ensures the score is always
 between
  0.0 and 1.0 (without any boosting)... Is this the case?
 
  Karl
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring formula

2005-11-05 Thread Otis Gospodnetic
0.0 - 1.0 score - yes.

Otis

--- Karl Koch [EMAIL PROTECTED] wrote:

 Yes, the Similarity class existed in version 1.2, but no description
 is
 included in its JavaDoc. After somebody could point me to the
 formular I
 would also like to know if the formula ensures the score is always
 between
 0.0 and 1.0 (without any boosting)... Is this the case?
 
 Karl
 
  --- Ursprüngliche Nachricht ---
  Von: Otis Gospodnetic [EMAIL PROTECTED]
  An: java-user@lucene.apache.org
  Betreff: Re: Scoring formula
  Datum: Fri, 4 Nov 2005 12:12:52 -0800 (PST)
  
  The formula should also be in the javadoc for Similarity class, if
 it
  was there in 1.2.
  
  Otis
  
  
  --- Karl Koch [EMAIL PROTECTED] wrote:
  
   Hello group,
   
   the scoring formula for Lucene is well explained in Lucene in
   Action.
   However, is this formula also valid for Lucene 1.2 (which I am
   using). I
   need to know that for documentation purposes. If not, where can I
   find the
   currect formula since I do not want to interpret if from the
 code...
   
   Best Regards to all of you,
   Karl
   
   -- 
   Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne
 Risiko!
   Satte Provisionen für GMX Partner:
 http://www.gmx.net/de/go/partner
   
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 -- 
 Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
 Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring formula

2005-11-05 Thread Otis Gospodnetic
TopDocs search()
TopFieldDocs search(...)
...

Just peek at the IndexSearcher.java source.

Otis


--- Karl Koch [EMAIL PROTECTED] wrote:

 I always thought that Lucene search is always returning a Hits
 object. In
 what occation would this not be the case?
 
 Karl
 
  --- Ursprüngliche Nachricht ---
  Von: Yonik Seeley [EMAIL PROTECTED]
  An: java-user@lucene.apache.org
  Betreff: Re: Scoring formula
  Datum: Sat, 5 Nov 2005 17:49:40 -0500
  
  Lucene 1.2 is before my time, but check if the functions are
  implemented the same as the current version (they probably are).
  
  Scores are not naturally = 1, but for most search methods
 (including
  all that return Hits) they are normalized to be between 1 and 0 if
 the
  highest score is greater than 1.
  
  -Yonik
  Now hiring -- http://forms.cnet.com/slink?231706
  
  
  On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote:
   Yes, the Similarity class existed in version 1.2, but no
 description is
   included in its JavaDoc. After somebody could point me to the
 formular I
   would also like to know if the formula ensures the score is
 always
  between
   0.0 and 1.0 (without any boosting)... Is this the case?
  
   Karl
  
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 -- 
 Lust, ein paar Euro nebenbei zu verdienen? Ohne Kosten, ohne Risiko!
 Satte Provisionen für GMX Partner: http://www.gmx.net/de/go/partner
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about scoring normalisation

2005-11-05 Thread Chris Lamprecht
Lucene just takes the highest score returned, and divides all scores
by this max_score.  So max_score / max_score = 1.0, and voila.

On 11/5/05, Karl Koch [EMAIL PROTECTED] wrote:
 Hello all,

 I am wondering how many of you actually work with own scoring mechanism
 (overwriting Lucenes standard scoring) and how many of you do work on how to
 normalise this score.

 I would like to add a second score on top of Lucenes TF/IDF score. The
 resulting score is most likely higher then 1.0. However, the score should be
 between 0.0 and 1.0. What is the best way to do that? If Lucene is
 normalising its score (if no boosting is applied) to a maximium of 1.0, how
 is this done (in Lucene 1.2 and/or beyond) ?

 Regards,
 Karl


 --
 Highspeed-Freiheit. Bei GMX supergünstig, z.B. GMX DSL_Cityflat,
 DSL-Flatrate für nur 4,99 Euro/Monat*  http://www.gmx.net/de/go/dsl

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Question about scoring normalisation

2005-11-05 Thread Sameer Shisodia
if so the top score should always be 1.0. Isn't so.
Or does boosting multiple individual fields wreck that ?
sameer

On 11/6/05, Chris Lamprecht [EMAIL PROTECTED] wrote:
 Lucene just takes the highest score returned, and divides all scores
 by this max_score.  So max_score / max_score = 1.0, and voila.


--
Sameer Shisodia  Bangalore

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Is There Other Ports of Nutch?

2005-11-05 Thread Victor Lee
Hi,
  I know that there are several ports of Lucene, like
cLucene, pLucene, etc.  Are there other ports of Nutch
besides java?

Many thanks.




__ 
Yahoo! Mail - PC Magazine Editors' Choice 2005 
http://mail.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



require new comment for IndexWriter.mergeFactor

2005-11-05 Thread Kerang Lv
Does the IndexWriter.mergeFactor remain the same
effect on the RAM use after the introduce of
IndexWriter.minMergeDocs?

The minMergeDocs was introduced into
IndexWriter(Revision 1.21 in cvs) in order to control
the number of
Documents merged in RAMDirectory independently of the
mergeFactor (see
http://issues.apache.org/bugzilla/show_bug.cgi?id=23754).
And the IndexWriter.maybeMergeSegments changed from
then on:

@@ -375,7 +385,7 @@
 
   /** Incremental segment merger.  */
   private final void maybeMergeSegments() throws
IOException {
-long targetMergeDocs = mergeFactor;
+long targetMergeDocs = minMergeDocs;

But the comment of mergeFactor remains:

The following is the comment of
IndexWriter.mergeFactor in the 1.2 RC6:
  /** Determines how often segment indexes are merged
by addDocument().  With
   * smaller values, less RAM is used while indexing,
and searches on
   * unoptimized indexes are faster, but indexing
speed is slower.  With larger
   * values more RAM is used while indexing and
searches on unoptimized indexes
   * are slower, but indexing is faster.  Thus larger
values ( 10) are best
   * for batched index creation, and smaller values (
10) for indexes that are
   * interactively maintained.
   *
   * pThis must never be less than 2.  The default
value is 10.*/
  public int mergeFactor = 10;


and now, it's in 1.4.3:
  /** Determines how often segment indices are merged
by addDocument().  With
   * smaller values, less RAM is used while indexing,
and searches on
   * unoptimized indices are faster, but indexing
speed is slower.  With larger
   * values, more RAM is used during indexing, and
while searches on unoptimized
   * indices are slower, indexing is faster.  Thus
larger values ( 10) are best
   * for batch index creation, and smaller values (
10) for indices that are
   * interactively maintained.
   *
   * pThis must never be less than 2.  The default
value is 10.*/
  public int mergeFactor = DEFAULT_MERGE_FACTOR;

Does the IndexWriter.mergeFactor remain the same
effect on the RAM use?



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]