Re: Sorting by Score

2007-02-28 Thread Peter Keegan

can't you pick any arbitrary marker field name (that's not a real field
name) and use that?


Yes, I could. I guess you're saying that the field name doesn't matter,
except that it's used for caching the comparator, right?


... he wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.


Yes, exactly. Couldn't I just do this rounding in the HitCollector, before
inserting it into the FieldSortedHitQueue?



On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: The first part was just to iterate through the TopDocs that's available
to
: my and normalize the scores right in the ScoreDocs. Like this...

Won't that be done after the Lucene does the hitcollecting/sorting? ... he
wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.

(or am i missing something about your description?)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Sorting by Score

2007-02-28 Thread Erick Erickson

Empirically, when I insert the elements in the FieldSortedHitQueue
they get sorted according to the Sort object. The original query
that gives me a TopDocs applied
no secondary sorting, only relevancy. Since I normalized
all the scores into one of only 5 discrete values, and secondary
sorting was applied to all docs with the same score when I inserted
them in the FieldSortedHitQueue.

Now popping things of the FieldSortedHitQueue is ordered the
way I want.

You could just operate on the FieldSortedHitQueue at this point, but
I decided the rest of my code would be simpler if I stuffed them back
into the TopDocs, so there's some explanation below that you can
just skip if I've cleared things up already.

*
The step I left out is moving the documents from the
FIeldSortedHitQueue back to topDocs.scoreDocs.
So the steps are as follows..

1 bucketize the scores. That is, go through the
TopDocs.scoreDocs and adjust each raw score into
one of my buckets. This is made easy by the
existence of topDocs.getMaxScore. TopDocs has
had no sorting other than relevancy applied so far.

2 assemble the FieldSortedHitQueue by inserting
each element from scoreDocs into it, with a suitable
Sort object, relevance is the first field (SortField.FIELD_SCORE).

3 pop the entries off the FieldSortedHitQueue, overwriting
the elements in topDocs.scoreDocs.

I left out step 3, although I suppose you could
operate directly on the FieldSortedHitQueue.

NOTE: in my case, I just put everything back in the
scoreDocs without attempting any efficiencies. If I
needed more performance, I'd only put as many items
back as I needed to display. But as I wrote yesterday,
performance isn't an issue so there's no point. Although
I know one place to look if we need to squeeze more QPS.

How efficient this is is an open question. But it's fast enough
and relatively simple so I stopped looking for more
efficiencies

Erick

On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:



: The first part was just to iterate through the TopDocs that's available
to
: my and normalize the scores right in the ScoreDocs. Like this...

Won't that be done after the Lucene does the hitcollecting/sorting? ... he
wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.

(or am i missing something about your description?)




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




RE: document field updates

2007-02-28 Thread Steven Parkes
Are unindexed fields stored seperately from the main inverted
index?
If so then, one could implement the field value change as a
delete and
re-add of just that value?

The short answer is that won't work. Field values are stored in a
different data structure than the postings lists but docids are
consistent across all contents of a segment. Deleting something and
readding it is going to put it into a different segment which is going
to keep this from working. (Not to mention that you want the postings
lists updated if you want it to be searchable ...)

Are you aware of some implementation of Lucene that solves this
need
well with a second index for 'tags' complete with multi-index
boolean
queries?

I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text? I don't know if Solr has anything like this but
if I remember correctly, Collex has tags but as far as I can tell, it's
not been open sourced (yet?)

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RamDirectory vs IndexWriter

2007-02-28 Thread WATHELET Thomas
I don't really understand the difference between using the ramDirectory
and using IndexWriter.

What's the difference between using ramDirectory instead of using
IndexWriter with those properties set to:
setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);



Re: RamDirectory vs IndexWriter

2007-02-28 Thread Nicolas Lalevée
Le Mercredi 28 Février 2007 16:19, WATHELET Thomas a écrit :
 I don't really understand the difference between using the ramDirectory
 and using IndexWriter.

 What's the difference between using ramDirectory instead of using
 IndexWriter with those properties set to:
 setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);

The two classes are not designed to accomplish the same feature. The 
IndexWriter write documents in a Directory. And a RAMDirectory is a special 
implementation of a Directory which is holding the data in RAM, rather than 
holding them on a file system like the FSDirectory.

-- 
Nicolas LALEVÉE
Solutions  Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Merge Indexes - addIndexes

2007-02-28 Thread DECAFFMEYER MATHIEU
Hi,

I store the Lucene Index of my web applications in a file system.
Oftenly, I need to add to this index another index also stored as file
system.

I have three questions :

* What is the best way to do this ?
  Open an IndexReader on this newcoming index-file system 
  and use addIndexes(IndexReader[] readers) ? 
   (where I will have each time one IndexReader in the array)

* Which files do I need ?
  I see in the File System that the following is stored :
  - segments
  - deletable
  - _1.cfs
  - untokenizedFieldNames.txt
  - stopWordList.txt
  - analyzerType.txt

* Can the merge be time consuming ?
   What happen when a user does a query in my search engine when I'm
merging the indexes
   with the method addIndexes(IndexReader[] readers)  ?


Thank u for any help !

__

   Matt




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: RamDirectory vs IndexWriter

2007-02-28 Thread WATHELET Thomas
Je pense mettre mal exprimée.
Dans les 2 cas j'utilise la classe IndexWriter mais dans un cas je l'utilise 
avec un RamDirectory et dans l'autre avec FSDirecory (index=new IndexWriter(ram 
OR fsdir,analyser,true))
Si j'utilise la classe ramDirectory c'est pour éviter l'accès disque fréquent.
Mais j'ai constatée quand utilisant FSDirecory et en paramettrant les 
setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1) j'ai plus 
ou moin le même comportement.
Suis-je dans le bon chemin. 

-Original Message-
From: Nicolas Lalevée [mailto:[EMAIL PROTECTED] 
Sent: 28 February 2007 16:29
To: java-user@lucene.apache.org
Subject: Re: RamDirectory vs IndexWriter

Le Mercredi 28 Février 2007 16:19, WATHELET Thomas a écrit :
 I don't really understand the difference between using the ramDirectory
 and using IndexWriter.

 What's the difference between using ramDirectory instead of using
 IndexWriter with those properties set to:
 setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);

The two classes are not designed to accomplish the same feature. The 
IndexWriter write documents in a Directory. And a RAMDirectory is a special 
implementation of a Directory which is holding the data in RAM, rather than 
holding them on a file system like the FSDirectory.

-- 
Nicolas LALEVÉE
Solutions  Technologies
ANYWARE TECHNOLOGIES
Tel : +33 (0)5 61 00 52 90
Fax : +33 (0)5 61 00 51 46
http://www.anyware-tech.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: RamDirectory vs IndexWriter

2007-02-28 Thread Erick Erickson

I guess it depends upon your goal. If you're asking what the difference
between writing to a RAMDirectory *then* flushing to an FSDIrectory,
I don't believer there's much, if any. As I remember (and my memory
isn't always...er...accurate), there's been discussion on this thread
by those who know that underneath the covers an FSDirecotyr
uses a RAMDirectory for a while, then flushes it to disk.

If you're asking what the difference between a RAMDirectory is
and an FSDirectory, that's another story.

Erick


On 2/28/07, WATHELET Thomas [EMAIL PROTECTED] wrote:


I don't really understand the difference between using the ramDirectory
and using IndexWriter.

What's the difference between using ramDirectory instead of using
IndexWriter with those properties set to:
setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);




Filtering results on a Field

2007-02-28 Thread Ismail Siddiqui

Hey guys,
I want to filter a result set on a particular field..I have code like this

try
   {
   PhraseQuery textQuery = new PhraseQuery();
   PhraseQuery titleQuery = new PhraseQuery();
   PhraseQuery catQuery = new PhraseQuery();
   textQuery.setSlop( 20 );
   titleQuery.setSlop( 4 );

   for( int k = 0; k  phrase.length; k++ )
   {
   textQuery.add( new Term( NAME, phrase[k] ) );
   titleQuery.add( new Term( REVIEW, phrase[k] ) );


   }
   bQuery.add( textQuery, BooleanClause.Occur.SHOULD );
   bQuery.add( titleQuery, BooleanClause.Occur.SHOULD );

   if(category!=null  !category.equals()){
catQuery.add( new Term( TYPE, category ) );
bQuery.add(catQuery,BooleanClause.Occur.MUST);

   }

   }
   catch( Exception e )
   {
   throw new RuntimeException( Unable to make any sense of the
query., e );
   }



Now the problem is its getting all results for a particular category
regardless the phrase is  in the title or text field which make sense as
the other two have SHOULD clause. the problem is I can not set a MUST clause
on the other two field as I need to match either one of the field. so what i
want to is either title or text MUST have it and if category is not null it
MUST have the category string passed. any ideas


Re: Filtering results on a Field

2007-02-28 Thread Erick Erickson

When you have a category, add the pair of clauses as a sub-Boolean query.

Something like...

try
  {
  PhraseQuery textQuery = new PhraseQuery();
  PhraseQuery titleQuery = new PhraseQuery();
  PhraseQuery catQuery = new PhraseQuery();
  textQuery.setSlop( 20 );
  titleQuery.setSlop( 4 );

bQueryPair = new BooleanQuery();
bQueryAll = new BooleanQuery();

  for( int k = 0; k  phrase.length; k++ )
  {
  textQuery.add( new Term( NAME, phrase[k] ) );
  titleQuery.add( new Term( REVIEW, phrase[k] ) );


  }
  bQueryPair.add( textQuery, BooleanClause.Occur.SHOULD );
  bQueryPair.add( titleQuery, BooleanClause.Occur.SHOULD );

  if(category!=null  !category.equals()){
   catQuery.add( new Term( TYPE, category ) );
   bQueryAll.add(catQuery,BooleanClause.Occur.MUST);
  bQueryAll.add(bQueryPair, BooleanCluase.Occur.MUST)

  } else {
 bQueryAll = bQueryPair;
  }

  }
  catch( Exception e )
  {
  throw new RuntimeException( Unable to make any sense of the
query., e );
  }


and use bQueryAll in your query.

And please be wy more elegant than the code I wrote G.

Erick


On 2/28/07, Ismail Siddiqui [EMAIL PROTECTED] wrote:


Hey guys,
I want to filter a result set on a particular field..I have code like this

try
{
PhraseQuery textQuery = new PhraseQuery();
PhraseQuery titleQuery = new PhraseQuery();
PhraseQuery catQuery = new PhraseQuery();
textQuery.setSlop( 20 );
titleQuery.setSlop( 4 );

for( int k = 0; k  phrase.length; k++ )
{
textQuery.add( new Term( NAME, phrase[k] ) );
titleQuery.add( new Term( REVIEW, phrase[k] ) );


}
bQuery.add( textQuery, BooleanClause.Occur.SHOULD );
bQuery.add( titleQuery, BooleanClause.Occur.SHOULD );

if(category!=null  !category.equals()){
 catQuery.add( new Term( TYPE, category ) );
 bQuery.add(catQuery,BooleanClause.Occur.MUST);

}

}
catch( Exception e )
{
throw new RuntimeException( Unable to make any sense of the
query., e );
}



Now the problem is its getting all results for a particular category
regardless the phrase is  in the title or text field which make sense as
the other two have SHOULD clause. the problem is I can not set a MUST
clause
on the other two field as I need to match either one of the field. so what
i
want to is either title or text MUST have it and if category is not null
it
MUST have the category string passed. any ideas



Re: Filtering results on a Field

2007-02-28 Thread Ismail Siddiqui

thanks a lot

On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:


When you have a category, add the pair of clauses as a sub-Boolean query.

Something like...

try
  {
  PhraseQuery textQuery = new PhraseQuery();
  PhraseQuery titleQuery = new PhraseQuery();
  PhraseQuery catQuery = new PhraseQuery();
  textQuery.setSlop( 20 );
  titleQuery.setSlop( 4 );

bQueryPair = new BooleanQuery();
bQueryAll = new BooleanQuery();

  for( int k = 0; k  phrase.length; k++ )
  {
  textQuery.add( new Term( NAME, phrase[k] ) );
  titleQuery.add( new Term( REVIEW, phrase[k] ) );


  }
  bQueryPair.add( textQuery, BooleanClause.Occur.SHOULD );
  bQueryPair.add( titleQuery, BooleanClause.Occur.SHOULD );

  if(category!=null  !category.equals()){
   catQuery.add( new Term( TYPE, category ) );
   bQueryAll.add(catQuery,BooleanClause.Occur.MUST);
  bQueryAll.add(bQueryPair, BooleanCluase.Occur.MUST)

  } else {
 bQueryAll = bQueryPair;
  }

  }
  catch( Exception e )
  {
  throw new RuntimeException( Unable to make any sense of the
query., e );
  }


and use bQueryAll in your query.

And please be wy more elegant than the code I wrote G.

Erick


On 2/28/07, Ismail Siddiqui [EMAIL PROTECTED] wrote:

 Hey guys,
 I want to filter a result set on a particular field..I have code like
this

 try
 {
 PhraseQuery textQuery = new PhraseQuery();
 PhraseQuery titleQuery = new PhraseQuery();
 PhraseQuery catQuery = new PhraseQuery();
 textQuery.setSlop( 20 );
 titleQuery.setSlop( 4 );

 for( int k = 0; k  phrase.length; k++ )
 {
 textQuery.add( new Term( NAME, phrase[k] ) );
 titleQuery.add( new Term( REVIEW, phrase[k] ) );


 }
 bQuery.add( textQuery, BooleanClause.Occur.SHOULD );
 bQuery.add( titleQuery, BooleanClause.Occur.SHOULD );

 if(category!=null  !category.equals()){
  catQuery.add( new Term( TYPE, category ) );
  bQuery.add(catQuery,BooleanClause.Occur.MUST);

 }

 }
 catch( Exception e )
 {
 throw new RuntimeException( Unable to make any sense of the
 query., e );
 }



 Now the problem is its getting all results for a particular category
 regardless the phrase is  in the title or text field which make sense
as
 the other two have SHOULD clause. the problem is I can not set a MUST
 clause
 on the other two field as I need to match either one of the field. so
what
 i
 want to is either title or text MUST have it and if category is not null
 it
 MUST have the category string passed. any ideas




Re: indexing and searching the document title question

2007-02-28 Thread Phillip Rhodes
I found the problem!

I did not realize using a HitCollector would return things in an unsorted order.

I was using the HitCollector to try to maximize performance by only returning 
the documents that I needed (which page of the results, and how many per page).

-Phillip


- Original Message -
From: Daniel Naber [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Tuesday, February 27, 2007 5:33:01 PM (GMT-0500) America/New_York
Subject: Re: indexing and searching the document title question

On Tuesday 27 February 2007 23:07, Phillip Rhodes wrote:

 NAME:color me mine^2.0 (CONTENTS:color CONTENTS:me CONTENTS:mine)

Try a (much) higer boost like 20 or 50, does that help?

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Soliciting Design Thoughts on Date Searching

2007-02-28 Thread Aigner, Thomas
Walt,
I am no expert, but it sounds like you need to associate many
dates to a single record.  Can this be handled as you would a synonym?
Basically add a token at the same offset as the row itself?  i.e. you
would have a record that would also have a date field that has 3 offsets
that would be treated as a synonym type (basically
setPositionIncrement(0)?)

Just thinking outloud..

Tom


-Original Message-
From: Walt Stoneburner [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 28, 2007 2:13 PM
To: java-user@lucene.apache.org
Subject: Re: Soliciting Design Thoughts on Date Searching

Been searching http://www.gossamer-threads.com/lists/lucene/java-user/
as Erick suggested; man, is there a wealth of information in the
Lucene archives.

I have found many examples of how to convert text to dates and back,
how to search Date fields for various ranges, and so forth -- but I
don't think this is what I'm looking for.

That material assumes I have a single date, such as last modified
date, and it's stored in a date field, and that I'm searching that
field.

What I'm looking to do is different.

I have generic material that _contain_ dates: historic time lines,
certificates, news articles, forms, deeds, testimonies, and wildly
free form genealogical information.  The dates have no specific
structure, obvious context, nor consistency.

Finding relevant material would be trivial if those dates were easily
cherry picked out and placed in a date field.  But they're not.  A
given document can have any number of embedded dates, provided for any
reason, and I'm interested in locating things which mention any date,
potentially within a range.

The issue isn't in using DateRange on a Date Field, but in knowing if
there is some filter that already exists which extracts dates from a
body of text to put into a Date Field.  If not, the DateTool solution
is a helpful step in building my own filter; I just don't want to
reinvent the wheel if it already exists.

Now this is where my personal knowledge of Lucene breaks down.
Assuming I can extract each date from a source's body and convert it
to a usable format, can a Lucene Date Field hold more than one date?
For example, is a strict name/value pair, or can the value be a array
of dates, or can I append additional dates under the same name?

Super generalizing, to break the discussion from a date specific
example, suppose I did this:
document.add( Field.Text( title, Learning Perl, Fourth Edition )
); // real title
document.add( Field.Text( title, Camel Book ) );  // my wife knows
it by the cover

Could I do a search for both the long and short title against the title
field?

If the answer is yes, problem solved!  I'll just pile on a ton of
dates as I find them and add them to the document.  (Note, I could
easily have hundreds.)

for ( Date somedate : allDatesFoundInSource[] ) {
  document.add( Field.Text( embeddedDates, somedate ) );  // Right
way to do this?
}


If the answer is no, it better illustrates the problem I face:
searching across an arbitrary collection of dates.


Erick, if I've missed something obvious in the archives, I'll happily
accept my public flogging.Thanks for your help so far.

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: optimizing single document searches

2007-02-28 Thread Paul Elschot
On Wednesday 28 February 2007 01:01, Russ wrote:
 I will definatelly check it out tommorow.
 
 I also forgot to mention that I am not interested in the hits themselves, 
only whether or not there was a hit.  Is there something I can use that's 
optimized for this scenario, or should I look into rewriting the search 
method of the indexarsearcher?  Currently I just check hits.size().

For a single document: get the Scorer from the Query via Weight.
Then check the return value of Scorer.next(), it will indicate whether
the only doc matches the query.

Regards,
Paul Elschot.


 
 Russ
 Sent wirelessly via BlackBerry from T-Mobile.  
 
 -Original Message-
 From: Erick Erickson [EMAIL PROTECTED]
 Date: Tue, 27 Feb 2007 18:49:45 
 To:java-user@lucene.apache.org
 Subject: Re: optimizing single document searches
 
 Which is very, very cool. I wound up using it for hit counting and it
 works like a charm
 
 On 2/27/07, karl wettin [EMAIL PROTECTED] wrote:
 
 
  28 feb 2007 kl. 00.25 skrev Ruslan Sivak:
  ]
 
   On a single document of 10k characters, doing about 40k searches
   takes about 5 seconds.  This is not bad, but I was wondering if I
   can somehow speed this up.
 
  Your corpus contains only one document? Try contrib/memory, an index
  optimized for that scenario.
 
  --
  karl
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sorting by Score

2007-02-28 Thread Peter Keegan

Erich,

Yes, this seems to be the simplest way to implement score 'bucketization',
but wouldn't it be more efficient to do this with a custom ScoreComparator?
That way, you'd do the bucketizing and sorting in one 'step' (compare()).
Maybe the savings isn't measurable, though. A comparator might also allow
one to do a more sophisticated rounding or bucketizing since you'd be
getting 2 scores at a time.

Peter


On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:


Empirically, when I insert the elements in the FieldSortedHitQueue
they get sorted according to the Sort object. The original query
that gives me a TopDocs applied
no secondary sorting, only relevancy. Since I normalized
all the scores into one of only 5 discrete values, and secondary
sorting was applied to all docs with the same score when I inserted
them in the FieldSortedHitQueue.

Now popping things of the FieldSortedHitQueue is ordered the
way I want.

You could just operate on the FieldSortedHitQueue at this point, but
I decided the rest of my code would be simpler if I stuffed them back
into the TopDocs, so there's some explanation below that you can
just skip if I've cleared things up already.

*
The step I left out is moving the documents from the
FIeldSortedHitQueue back to topDocs.scoreDocs.
So the steps are as follows..

1 bucketize the scores. That is, go through the
TopDocs.scoreDocs and adjust each raw score into
one of my buckets. This is made easy by the
existence of topDocs.getMaxScore. TopDocs has
had no sorting other than relevancy applied so far.

2 assemble the FieldSortedHitQueue by inserting
each element from scoreDocs into it, with a suitable
Sort object, relevance is the first field (SortField.FIELD_SCORE).

3 pop the entries off the FieldSortedHitQueue, overwriting
the elements in topDocs.scoreDocs.

I left out step 3, although I suppose you could
operate directly on the FieldSortedHitQueue.

NOTE: in my case, I just put everything back in the
scoreDocs without attempting any efficiencies. If I
needed more performance, I'd only put as many items
back as I needed to display. But as I wrote yesterday,
performance isn't an issue so there's no point. Although
I know one place to look if we need to squeeze more QPS.

How efficient this is is an open question. But it's fast enough
and relatively simple so I stopped looking for more
efficiencies

Erick

On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:


 : The first part was just to iterate through the TopDocs that's
available
 to
 : my and normalize the scores right in the ScoreDocs. Like this...

 Won't that be done after the Lucene does the hitcollecting/sorting? ...
he
 wants the bucketing to happen as part of hte scoring so that the
 secondary sort will determine the ordering within the bucket.

 (or am i missing something about your description?)




 -Hoss


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





Re: Sorting by Score

2007-02-28 Thread Erick Erickson

It may well be, but as I said this is efficient enough for my needs
so I didn't pursue it. One of my pet peeves is spending time making
things more efficient when there's no need, and my index isn't
going to grow enough larger to worry about that now G...

Erick

On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote:


Erich,

Yes, this seems to be the simplest way to implement score 'bucketization',
but wouldn't it be more efficient to do this with a custom
ScoreComparator?
That way, you'd do the bucketizing and sorting in one 'step' (compare()).
Maybe the savings isn't measurable, though. A comparator might also allow
one to do a more sophisticated rounding or bucketizing since you'd be
getting 2 scores at a time.

Peter


On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:

 Empirically, when I insert the elements in the FieldSortedHitQueue
 they get sorted according to the Sort object. The original query
 that gives me a TopDocs applied
 no secondary sorting, only relevancy. Since I normalized
 all the scores into one of only 5 discrete values, and secondary
 sorting was applied to all docs with the same score when I inserted
 them in the FieldSortedHitQueue.

 Now popping things of the FieldSortedHitQueue is ordered the
 way I want.

 You could just operate on the FieldSortedHitQueue at this point, but
 I decided the rest of my code would be simpler if I stuffed them back
 into the TopDocs, so there's some explanation below that you can
 just skip if I've cleared things up already.

 *
 The step I left out is moving the documents from the
 FIeldSortedHitQueue back to topDocs.scoreDocs.
 So the steps are as follows..

 1 bucketize the scores. That is, go through the
 TopDocs.scoreDocs and adjust each raw score into
 one of my buckets. This is made easy by the
 existence of topDocs.getMaxScore. TopDocs has
 had no sorting other than relevancy applied so far.

 2 assemble the FieldSortedHitQueue by inserting
 each element from scoreDocs into it, with a suitable
 Sort object, relevance is the first field (SortField.FIELD_SCORE).

 3 pop the entries off the FieldSortedHitQueue, overwriting
 the elements in topDocs.scoreDocs.

 I left out step 3, although I suppose you could
 operate directly on the FieldSortedHitQueue.

 NOTE: in my case, I just put everything back in the
 scoreDocs without attempting any efficiencies. If I
 needed more performance, I'd only put as many items
 back as I needed to display. But as I wrote yesterday,
 performance isn't an issue so there's no point. Although
 I know one place to look if we need to squeeze more QPS.

 How efficient this is is an open question. But it's fast enough
 and relatively simple so I stopped looking for more
 efficiencies

 Erick

 On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote:
 
 
  : The first part was just to iterate through the TopDocs that's
 available
  to
  : my and normalize the scores right in the ScoreDocs. Like this...
 
  Won't that be done after the Lucene does the hitcollecting/sorting?
...
 he
  wants the bucketing to happen as part of hte scoring so that the
  secondary sort will determine the ordering within the bucket.
 
  (or am i missing something about your description?)
 
 
 
 
  -Hoss
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 




Re: Soliciting Design Thoughts on Date Searching

2007-02-28 Thread Peter W.

Hello,

There are a few ways to solve this but no
Date extraction filter I know of. Adding
a hundred fields for each Lucene doc
seems bloated.

First, get your text out of the various
source documents (.doc,.pdf,.html) using
available tools out there described in the
Lucene in Action book.

It sounds like you know Perl, so next
try regexes to pull out dates from the
text using java.util.regex and make sure to
remove extra whitespace.

Put your clean date Strings into a java TreeMap
or TreeSet collection to eliminate duplicates.

Finally, loop thru the collection adding items
to a StringBuffer delimited by commas, then make
one long String (holding all your dates) and add
to the Lucene doc as one Field.Text.

You might be able to set that Field to indexed, but not
stored to save space.

Regards,

Peter W.



On Feb 28, 2007, at 11:22 AM, Aigner, Thomas wrote:


Walt,
I am no expert, but it sounds like you need to associate many
dates to a single record.
...

Tom


-Original Message-
From: Walt Stoneburner [mailto:[EMAIL PROTECTED]
Sent: Wednesday, February 28, 2007 2:13 PM
To: java-user@lucene.apache.org
Subject: Re: Soliciting Design Thoughts on Date Searching

...
The issue isn't in using DateRange on a Date Field, but in knowing if
there is some filter that already exists which extracts dates from a
body of text to put into a Date Field.  If not, the DateTool solution
is a helpful step in building my own filter; I just don't want to
reinvent the wheel if it already exists.

Now this is where my personal knowledge of Lucene breaks down.
Assuming I can extract each date from a source's body and convert it
to a usable format, can a Lucene Date Field hold more than one date?
For example, is a strict name/value pair, or can the value be a array
of dates, or can I append additional dates under the same name?

Super generalizing, to break the discussion from a date specific
example, suppose I did this:
document.add( Field.Text( title, Learning Perl, Fourth Edition )
); // real title
document.add( Field.Text( title, Camel Book ) );  // my wife knows
it by the cover

Could I do a search for both the long and short title against the  
title

field?

If the answer is yes, problem solved!  I'll just pile on a ton of
dates as I find them and add them to the document.  (Note, I could
easily have hundreds.)

for ( Date somedate : allDatesFoundInSource[] ) {
  document.add( Field.Text( embeddedDates, somedate ) );  // Right
way to do this?
}


If the answer is no, it better illustrates the problem I face:
searching across an arbitrary collection of dates.


Erick, if I've missed something obvious in the archives, I'll happily
accept my public flogging.Thanks for your help so far.

-wls


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to returning hits after search?

2007-02-28 Thread Doron Cohen
Antony Bowesman [EMAIL PROTECTED] wrote on 27/02/2007 17:37:41:

 Doron Cohen wrote:
  The collect() method is going to be invoked once for each document that
  matches the query (having nonzero score). If the index is very large,
that
  may turn to be a very large number of calls. Often, search applications
  only fetch additional data (doc fields) for only a small subset of the
  entire set of documents matching a query - e.g. first page (0-9),
second
  page (10-19), etc.  But if your application is going to fetch in an
  exhaustive manner, and especially for a short field like DB_ID, it
  sometimes makes sense to cache in memory the entire field (its values
for
  all the docs), for the entire life of the index reader/searcher, and
use
  that cached data. The collect method can then use that cached data.

 That's an excellent idea!  We cannot easily change our client
 implementation, so
 have to support the exhaustive retrieval for now, although I do limit the

 absolute max hits that will be returned.  We are hoping to implement
 paging in a
 later client version.

 I'm not sure I can cache all the GUIDs though.  A GUID is 20 bytes
 and there are
 two that need to be cached.  The document count could be up to
100M,though in
 most cases 20M.  I am keeping a BitSet filter cache for a searcher for
each
 user's mail, so I could extend that to cache all the IDs for that
 user and give
 that cache a shortish life and/or limit the total cache available.
 That would
 really help.

 I'll have a play - thanks for the input.
 Antony

If you decide to cache stored field value in memory, FieldCache may be
useful for this - so you don't have to implement your own cache - you can
access the field values with something like:
   FieldCache fieldCache = FieldCache.DEFAULT;
   String db_id_field[] =
fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME);
Those values are valid for the lifetime of the index-reader. Once a new
index reader is opened, when GC collects the unused old index reader
object, it would also be able to collect (from the cache) unused field
values.

See also http://www.gossamer-threads.com/lists/lucene/java-user/39352

Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ranking/scoring algorithm in details

2007-02-28 Thread Jong Kim
Hi,
 
Does anyone know of a written document that describes in some details
how Lucene's ranking/scoring algorithm works? I'm safely assuming that
a single consistent algorithm is being used to compute the scores of
each matching documents (with or without explicit boost factors in the
query) and rank them accordingly. 
 
I would appreciate any pointer to such information, or your own description
if you happen to know that. 
 
Thanks in advance.
/Jong


Re: Soliciting Design Thoughts on Date Searching

2007-02-28 Thread Chris Hostetter

: I have generic material that _contain_ dates: historic time lines,
: certificates, news articles, forms, deeds, testimonies, and wildly
: free form genealogical information.  The dates have no specific
: structure, obvious context, nor consistency.

identifying an extracting dates from bulk text sounds like a pretyt
interesting analysys problem ... if you wrote a Tokenizer that could
recognize dates, you could then format them using something like DateTools
to ensure it would be easy to find them ... but Lucene Analyzers can not
currently create terns in multiple fields - so if you wanted a special
date field for each doc, you would have to extract those dates in a
preprocessing step.

if you aren't picky how your index is stored however, there is no reason
why you can't have a single field with your text terms and your date
terms ... you would just have to be careful to know the differnece in
searching ... make your analyzer prefix all of your date terms with
soemthing it would never let your regular terms start with (ie __)  and
make sure you bear that structure in mind when creating your RangeFilter
on dates.

: Now this is where my personal knowledge of Lucene breaks down.
: Assuming I can extract each date from a source's body and convert it
: to a usable format, can a Lucene Date Field hold more than one date?

fields can contain as many values as you want -- or none at all.

: If the answer is yes, problem solved!  I'll just pile on a ton of

definitely yes.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ParallelSearcher in multi-node environment

2007-02-28 Thread Chris Hostetter

: I want to execute parallel search over several machines. But
: ParallelSearcher doesn't look perfect. It creates threads and spawns many
: requests to the underlying Searchables (over a network) for a single search.
: Is there a decent implementation of the parallel search over remote indexes
: somewhere?

what would you consider decent implementation of a parallel search? ...
how could it be done in parallel without spawning seperate threads for
each sub Searchable?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: ranking/scoring algorithm in details

2007-02-28 Thread Steven Parkes
http://lucene.apache.org/java/docs/scoring.html
(which you can also find by googling lucene scoring)

-Original Message-
From: Jong Kim [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 28, 2007 2:21 PM
To: java-user@lucene.apache.org
Subject: ranking/scoring algorithm in details

Hi,
 
Does anyone know of a written document that describes in some details
how Lucene's ranking/scoring algorithm works? I'm safely assuming that
a single consistent algorithm is being used to compute the scores of
each matching documents (with or without explicit boost factors in the
query) and rank them accordingly. 
 
I would appreciate any pointer to such information, or your own
description
if you happen to know that. 
 
Thanks in advance.
/Jong

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Soliciting Design Thoughts on Date Searching

2007-02-28 Thread Steven Parkes
Yeah, date finding is a little like entity extraction, since dates can
have many formats, depending on how crazy you want to get (a week from
tomorrow is 3/8/2007 if you know that this e-mail was written today).
So much so that I went and looked up lingpipe, but they seem to not be
concerned with dates. 

Even if you don't get crazy, it's not straightforward: is 3/8/2007 March
8th or August 3rd? Dates can be written many ways. The real challenge is
recognizing dates.

As Chris said, once you have them, you just stick them in the token
stream. In fact, you can emit the date token (as Chris suggested, with
some delimiter that helps you know it's a date) with a position
increment of zero and then emit the regular tokens so that the token
stream will have both and aligned.

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 28, 2007 3:26 PM
To: Lucene Users
Subject: Re: Soliciting Design Thoughts on Date Searching


: I have generic material that _contain_ dates: historic time lines,
: certificates, news articles, forms, deeds, testimonies, and wildly
: free form genealogical information.  The dates have no specific
: structure, obvious context, nor consistency.

identifying an extracting dates from bulk text sounds like a pretyt
interesting analysys problem ... if you wrote a Tokenizer that could
recognize dates, you could then format them using something like
DateTools
to ensure it would be easy to find them ... but Lucene Analyzers can not
currently create terns in multiple fields - so if you wanted a special
date field for each doc, you would have to extract those dates in a
preprocessing step.

if you aren't picky how your index is stored however, there is no reason
why you can't have a single field with your text terms and your date
terms ... you would just have to be careful to know the differnece in
searching ... make your analyzer prefix all of your date terms with
soemthing it would never let your regular terms start with (ie __)
and
make sure you bear that structure in mind when creating your RangeFilter
on dates.

: Now this is where my personal knowledge of Lucene breaks down.
: Assuming I can extract each date from a source's body and convert it
: to a usable format, can a Lucene Date Field hold more than one date?

fields can contain as many values as you want -- or none at all.

: If the answer is yes, problem solved!  I'll just pile on a ton of

definitely yes.




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best way to returning hits after search?

2007-02-28 Thread Mohammad Norouzi

Hello
I am implemented an IndexResultSet just like java.sql.ResultSet with all its
methods. when I call searcher.search(...) I pass a the returned Hits to my
IndexResultSet.
in the IndexResultSet I have getString(String) getString(int) getInt()
next() previous() absolute() and all methods of the java.sql.ResultSet
besides, because I am using MyFaces in my application, I customized
DataModel in order to support pagination and I keep open my reader so,
pagination works fine.
in addition, I provided a SercherPool to keep readers open and close when
the user ends his searching or an idle time occured.


On 3/1/07, Doron Cohen [EMAIL PROTECTED] wrote:


Antony Bowesman [EMAIL PROTECTED] wrote on 27/02/2007 17:37:41:

 Doron Cohen wrote:
  The collect() method is going to be invoked once for each document
that
  matches the query (having nonzero score). If the index is very large,
that
  may turn to be a very large number of calls. Often, search
applications
  only fetch additional data (doc fields) for only a small subset of the
  entire set of documents matching a query - e.g. first page (0-9),
second
  page (10-19), etc.  But if your application is going to fetch in an
  exhaustive manner, and especially for a short field like DB_ID, it
  sometimes makes sense to cache in memory the entire field (its values
for
  all the docs), for the entire life of the index reader/searcher, and
use
  that cached data. The collect method can then use that cached data.

 That's an excellent idea!  We cannot easily change our client
 implementation, so
 have to support the exhaustive retrieval for now, although I do limit
the

 absolute max hits that will be returned.  We are hoping to implement
 paging in a
 later client version.

 I'm not sure I can cache all the GUIDs though.  A GUID is 20 bytes
 and there are
 two that need to be cached.  The document count could be up to
100M,though in
 most cases 20M.  I am keeping a BitSet filter cache for a searcher for
each
 user's mail, so I could extend that to cache all the IDs for that
 user and give
 that cache a shortish life and/or limit the total cache available.
 That would
 really help.

 I'll have a play - thanks for the input.
 Antony

If you decide to cache stored field value in memory, FieldCache may be
useful for this - so you don't have to implement your own cache - you can
access the field values with something like:
   FieldCache fieldCache = FieldCache.DEFAULT;
   String db_id_field[] =
fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME);
Those values are valid for the lifetime of the index-reader. Once a new
index reader is opened, when GC collects the unused old index reader
object, it would also be able to collect (from the cache) unused field
values.

See also http://www.gossamer-threads.com/lists/lucene/java-user/39352

Doron


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





--
Regards,
Mohammad


Performance in having Multiple Index files

2007-02-28 Thread Raaj
hi all,

i have requirement where in i create an index file for each xml file . i have 
over 100/150 xml files which are all related . 

if create 100/150 index files and query using these indices , will this affect 
the performance of the search operation . 

bye
raaj



 
-
Need a quick answer? Get one in minutes from people who know. Ask your question 
on Yahoo! Answers.