Re: Sorting by Score
can't you pick any arbitrary marker field name (that's not a real field name) and use that? Yes, I could. I guess you're saying that the field name doesn't matter, except that it's used for caching the comparator, right? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. Yes, exactly. Couldn't I just do this rounding in the HitCollector, before inserting it into the FieldSortedHitQueue? On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore. TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field (SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: document field updates
Are unindexed fields stored seperately from the main inverted index? If so then, one could implement the field value change as a delete and re-add of just that value? The short answer is that won't work. Field values are stored in a different data structure than the postings lists but docids are consistent across all contents of a segment. Deleting something and readding it is going to put it into a different segment which is going to keep this from working. (Not to mention that you want the postings lists updated if you want it to be searchable ...) Are you aware of some implementation of Lucene that solves this need well with a second index for 'tags' complete with multi-index boolean queries? I'm pretty sure this has been done, I'm just not 100% sure where. Does Nutch index link text? I don't know if Solr has anything like this but if I remember correctly, Collex has tags but as far as I can tell, it's not been open sourced (yet?) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RamDirectory vs IndexWriter
I don't really understand the difference between using the ramDirectory and using IndexWriter. What's the difference between using ramDirectory instead of using IndexWriter with those properties set to: setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);
Re: RamDirectory vs IndexWriter
Le Mercredi 28 Février 2007 16:19, WATHELET Thomas a écrit : I don't really understand the difference between using the ramDirectory and using IndexWriter. What's the difference between using ramDirectory instead of using IndexWriter with those properties set to: setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1); The two classes are not designed to accomplish the same feature. The IndexWriter write documents in a Directory. And a RAMDirectory is a special implementation of a Directory which is holding the data in RAM, rather than holding them on a file system like the FSDirectory. -- Nicolas LALEVÉE Solutions Technologies ANYWARE TECHNOLOGIES Tel : +33 (0)5 61 00 52 90 Fax : +33 (0)5 61 00 51 46 http://www.anyware-tech.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Merge Indexes - addIndexes
Hi, I store the Lucene Index of my web applications in a file system. Oftenly, I need to add to this index another index also stored as file system. I have three questions : * What is the best way to do this ? Open an IndexReader on this newcoming index-file system and use addIndexes(IndexReader[] readers) ? (where I will have each time one IndexReader in the array) * Which files do I need ? I see in the File System that the following is stored : - segments - deletable - _1.cfs - untokenizedFieldNames.txt - stopWordList.txt - analyzerType.txt * Can the merge be time consuming ? What happen when a user does a query in my search engine when I'm merging the indexes with the method addIndexes(IndexReader[] readers) ? Thank u for any help ! __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: RamDirectory vs IndexWriter
Je pense mettre mal exprimée. Dans les 2 cas j'utilise la classe IndexWriter mais dans un cas je l'utilise avec un RamDirectory et dans l'autre avec FSDirecory (index=new IndexWriter(ram OR fsdir,analyser,true)) Si j'utilise la classe ramDirectory c'est pour éviter l'accès disque fréquent. Mais j'ai constatée quand utilisant FSDirecory et en paramettrant les setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1) j'ai plus ou moin le même comportement. Suis-je dans le bon chemin. -Original Message- From: Nicolas Lalevée [mailto:[EMAIL PROTECTED] Sent: 28 February 2007 16:29 To: java-user@lucene.apache.org Subject: Re: RamDirectory vs IndexWriter Le Mercredi 28 Février 2007 16:19, WATHELET Thomas a écrit : I don't really understand the difference between using the ramDirectory and using IndexWriter. What's the difference between using ramDirectory instead of using IndexWriter with those properties set to: setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1); The two classes are not designed to accomplish the same feature. The IndexWriter write documents in a Directory. And a RAMDirectory is a special implementation of a Directory which is holding the data in RAM, rather than holding them on a file system like the FSDirectory. -- Nicolas LALEVÉE Solutions Technologies ANYWARE TECHNOLOGIES Tel : +33 (0)5 61 00 52 90 Fax : +33 (0)5 61 00 51 46 http://www.anyware-tech.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: RamDirectory vs IndexWriter
I guess it depends upon your goal. If you're asking what the difference between writing to a RAMDirectory *then* flushing to an FSDIrectory, I don't believer there's much, if any. As I remember (and my memory isn't always...er...accurate), there's been discussion on this thread by those who know that underneath the covers an FSDirecotyr uses a RAMDirectory for a while, then flushes it to disk. If you're asking what the difference between a RAMDirectory is and an FSDirectory, that's another story. Erick On 2/28/07, WATHELET Thomas [EMAIL PROTECTED] wrote: I don't really understand the difference between using the ramDirectory and using IndexWriter. What's the difference between using ramDirectory instead of using IndexWriter with those properties set to: setMergeFactor(1000);setMaxMergeDocs(1);setMaxBufferedDocs(1);
Filtering results on a Field
Hey guys, I want to filter a result set on a particular field..I have code like this try { PhraseQuery textQuery = new PhraseQuery(); PhraseQuery titleQuery = new PhraseQuery(); PhraseQuery catQuery = new PhraseQuery(); textQuery.setSlop( 20 ); titleQuery.setSlop( 4 ); for( int k = 0; k phrase.length; k++ ) { textQuery.add( new Term( NAME, phrase[k] ) ); titleQuery.add( new Term( REVIEW, phrase[k] ) ); } bQuery.add( textQuery, BooleanClause.Occur.SHOULD ); bQuery.add( titleQuery, BooleanClause.Occur.SHOULD ); if(category!=null !category.equals()){ catQuery.add( new Term( TYPE, category ) ); bQuery.add(catQuery,BooleanClause.Occur.MUST); } } catch( Exception e ) { throw new RuntimeException( Unable to make any sense of the query., e ); } Now the problem is its getting all results for a particular category regardless the phrase is in the title or text field which make sense as the other two have SHOULD clause. the problem is I can not set a MUST clause on the other two field as I need to match either one of the field. so what i want to is either title or text MUST have it and if category is not null it MUST have the category string passed. any ideas
Re: Filtering results on a Field
When you have a category, add the pair of clauses as a sub-Boolean query. Something like... try { PhraseQuery textQuery = new PhraseQuery(); PhraseQuery titleQuery = new PhraseQuery(); PhraseQuery catQuery = new PhraseQuery(); textQuery.setSlop( 20 ); titleQuery.setSlop( 4 ); bQueryPair = new BooleanQuery(); bQueryAll = new BooleanQuery(); for( int k = 0; k phrase.length; k++ ) { textQuery.add( new Term( NAME, phrase[k] ) ); titleQuery.add( new Term( REVIEW, phrase[k] ) ); } bQueryPair.add( textQuery, BooleanClause.Occur.SHOULD ); bQueryPair.add( titleQuery, BooleanClause.Occur.SHOULD ); if(category!=null !category.equals()){ catQuery.add( new Term( TYPE, category ) ); bQueryAll.add(catQuery,BooleanClause.Occur.MUST); bQueryAll.add(bQueryPair, BooleanCluase.Occur.MUST) } else { bQueryAll = bQueryPair; } } catch( Exception e ) { throw new RuntimeException( Unable to make any sense of the query., e ); } and use bQueryAll in your query. And please be wy more elegant than the code I wrote G. Erick On 2/28/07, Ismail Siddiqui [EMAIL PROTECTED] wrote: Hey guys, I want to filter a result set on a particular field..I have code like this try { PhraseQuery textQuery = new PhraseQuery(); PhraseQuery titleQuery = new PhraseQuery(); PhraseQuery catQuery = new PhraseQuery(); textQuery.setSlop( 20 ); titleQuery.setSlop( 4 ); for( int k = 0; k phrase.length; k++ ) { textQuery.add( new Term( NAME, phrase[k] ) ); titleQuery.add( new Term( REVIEW, phrase[k] ) ); } bQuery.add( textQuery, BooleanClause.Occur.SHOULD ); bQuery.add( titleQuery, BooleanClause.Occur.SHOULD ); if(category!=null !category.equals()){ catQuery.add( new Term( TYPE, category ) ); bQuery.add(catQuery,BooleanClause.Occur.MUST); } } catch( Exception e ) { throw new RuntimeException( Unable to make any sense of the query., e ); } Now the problem is its getting all results for a particular category regardless the phrase is in the title or text field which make sense as the other two have SHOULD clause. the problem is I can not set a MUST clause on the other two field as I need to match either one of the field. so what i want to is either title or text MUST have it and if category is not null it MUST have the category string passed. any ideas
Re: Filtering results on a Field
thanks a lot On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: When you have a category, add the pair of clauses as a sub-Boolean query. Something like... try { PhraseQuery textQuery = new PhraseQuery(); PhraseQuery titleQuery = new PhraseQuery(); PhraseQuery catQuery = new PhraseQuery(); textQuery.setSlop( 20 ); titleQuery.setSlop( 4 ); bQueryPair = new BooleanQuery(); bQueryAll = new BooleanQuery(); for( int k = 0; k phrase.length; k++ ) { textQuery.add( new Term( NAME, phrase[k] ) ); titleQuery.add( new Term( REVIEW, phrase[k] ) ); } bQueryPair.add( textQuery, BooleanClause.Occur.SHOULD ); bQueryPair.add( titleQuery, BooleanClause.Occur.SHOULD ); if(category!=null !category.equals()){ catQuery.add( new Term( TYPE, category ) ); bQueryAll.add(catQuery,BooleanClause.Occur.MUST); bQueryAll.add(bQueryPair, BooleanCluase.Occur.MUST) } else { bQueryAll = bQueryPair; } } catch( Exception e ) { throw new RuntimeException( Unable to make any sense of the query., e ); } and use bQueryAll in your query. And please be wy more elegant than the code I wrote G. Erick On 2/28/07, Ismail Siddiqui [EMAIL PROTECTED] wrote: Hey guys, I want to filter a result set on a particular field..I have code like this try { PhraseQuery textQuery = new PhraseQuery(); PhraseQuery titleQuery = new PhraseQuery(); PhraseQuery catQuery = new PhraseQuery(); textQuery.setSlop( 20 ); titleQuery.setSlop( 4 ); for( int k = 0; k phrase.length; k++ ) { textQuery.add( new Term( NAME, phrase[k] ) ); titleQuery.add( new Term( REVIEW, phrase[k] ) ); } bQuery.add( textQuery, BooleanClause.Occur.SHOULD ); bQuery.add( titleQuery, BooleanClause.Occur.SHOULD ); if(category!=null !category.equals()){ catQuery.add( new Term( TYPE, category ) ); bQuery.add(catQuery,BooleanClause.Occur.MUST); } } catch( Exception e ) { throw new RuntimeException( Unable to make any sense of the query., e ); } Now the problem is its getting all results for a particular category regardless the phrase is in the title or text field which make sense as the other two have SHOULD clause. the problem is I can not set a MUST clause on the other two field as I need to match either one of the field. so what i want to is either title or text MUST have it and if category is not null it MUST have the category string passed. any ideas
Re: indexing and searching the document title question
I found the problem! I did not realize using a HitCollector would return things in an unsorted order. I was using the HitCollector to try to maximize performance by only returning the documents that I needed (which page of the results, and how many per page). -Phillip - Original Message - From: Daniel Naber [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Tuesday, February 27, 2007 5:33:01 PM (GMT-0500) America/New_York Subject: Re: indexing and searching the document title question On Tuesday 27 February 2007 23:07, Phillip Rhodes wrote: NAME:color me mine^2.0 (CONTENTS:color CONTENTS:me CONTENTS:mine) Try a (much) higer boost like 20 or 50, does that help? Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Soliciting Design Thoughts on Date Searching
Walt, I am no expert, but it sounds like you need to associate many dates to a single record. Can this be handled as you would a synonym? Basically add a token at the same offset as the row itself? i.e. you would have a record that would also have a date field that has 3 offsets that would be treated as a synonym type (basically setPositionIncrement(0)?) Just thinking outloud.. Tom -Original Message- From: Walt Stoneburner [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 28, 2007 2:13 PM To: java-user@lucene.apache.org Subject: Re: Soliciting Design Thoughts on Date Searching Been searching http://www.gossamer-threads.com/lists/lucene/java-user/ as Erick suggested; man, is there a wealth of information in the Lucene archives. I have found many examples of how to convert text to dates and back, how to search Date fields for various ranges, and so forth -- but I don't think this is what I'm looking for. That material assumes I have a single date, such as last modified date, and it's stored in a date field, and that I'm searching that field. What I'm looking to do is different. I have generic material that _contain_ dates: historic time lines, certificates, news articles, forms, deeds, testimonies, and wildly free form genealogical information. The dates have no specific structure, obvious context, nor consistency. Finding relevant material would be trivial if those dates were easily cherry picked out and placed in a date field. But they're not. A given document can have any number of embedded dates, provided for any reason, and I'm interested in locating things which mention any date, potentially within a range. The issue isn't in using DateRange on a Date Field, but in knowing if there is some filter that already exists which extracts dates from a body of text to put into a Date Field. If not, the DateTool solution is a helpful step in building my own filter; I just don't want to reinvent the wheel if it already exists. Now this is where my personal knowledge of Lucene breaks down. Assuming I can extract each date from a source's body and convert it to a usable format, can a Lucene Date Field hold more than one date? For example, is a strict name/value pair, or can the value be a array of dates, or can I append additional dates under the same name? Super generalizing, to break the discussion from a date specific example, suppose I did this: document.add( Field.Text( title, Learning Perl, Fourth Edition ) ); // real title document.add( Field.Text( title, Camel Book ) ); // my wife knows it by the cover Could I do a search for both the long and short title against the title field? If the answer is yes, problem solved! I'll just pile on a ton of dates as I find them and add them to the document. (Note, I could easily have hundreds.) for ( Date somedate : allDatesFoundInSource[] ) { document.add( Field.Text( embeddedDates, somedate ) ); // Right way to do this? } If the answer is no, it better illustrates the problem I face: searching across an arbitrary collection of dates. Erick, if I've missed something obvious in the archives, I'll happily accept my public flogging.Thanks for your help so far. -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: optimizing single document searches
On Wednesday 28 February 2007 01:01, Russ wrote: I will definatelly check it out tommorow. I also forgot to mention that I am not interested in the hits themselves, only whether or not there was a hit. Is there something I can use that's optimized for this scenario, or should I look into rewriting the search method of the indexarsearcher? Currently I just check hits.size(). For a single document: get the Scorer from the Query via Weight. Then check the return value of Scorer.next(), it will indicate whether the only doc matches the query. Regards, Paul Elschot. Russ Sent wirelessly via BlackBerry from T-Mobile. -Original Message- From: Erick Erickson [EMAIL PROTECTED] Date: Tue, 27 Feb 2007 18:49:45 To:java-user@lucene.apache.org Subject: Re: optimizing single document searches Which is very, very cool. I wound up using it for hit counting and it works like a charm On 2/27/07, karl wettin [EMAIL PROTECTED] wrote: 28 feb 2007 kl. 00.25 skrev Ruslan Sivak: ] On a single document of 10k characters, doing about 40k searches takes about 5 seconds. This is not bad, but I was wondering if I can somehow speed this up. Your corpus contains only one document? Try contrib/memory, an index optimized for that scenario. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore. TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field (SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Sorting by Score
It may well be, but as I said this is efficient enough for my needs so I didn't pursue it. One of my pet peeves is spending time making things more efficient when there's no need, and my index isn't going to grow enough larger to worry about that now G... Erick On 2/28/07, Peter Keegan [EMAIL PROTECTED] wrote: Erich, Yes, this seems to be the simplest way to implement score 'bucketization', but wouldn't it be more efficient to do this with a custom ScoreComparator? That way, you'd do the bucketizing and sorting in one 'step' (compare()). Maybe the savings isn't measurable, though. A comparator might also allow one to do a more sophisticated rounding or bucketizing since you'd be getting 2 scores at a time. Peter On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote: Empirically, when I insert the elements in the FieldSortedHitQueue they get sorted according to the Sort object. The original query that gives me a TopDocs applied no secondary sorting, only relevancy. Since I normalized all the scores into one of only 5 discrete values, and secondary sorting was applied to all docs with the same score when I inserted them in the FieldSortedHitQueue. Now popping things of the FieldSortedHitQueue is ordered the way I want. You could just operate on the FieldSortedHitQueue at this point, but I decided the rest of my code would be simpler if I stuffed them back into the TopDocs, so there's some explanation below that you can just skip if I've cleared things up already. * The step I left out is moving the documents from the FIeldSortedHitQueue back to topDocs.scoreDocs. So the steps are as follows.. 1 bucketize the scores. That is, go through the TopDocs.scoreDocs and adjust each raw score into one of my buckets. This is made easy by the existence of topDocs.getMaxScore. TopDocs has had no sorting other than relevancy applied so far. 2 assemble the FieldSortedHitQueue by inserting each element from scoreDocs into it, with a suitable Sort object, relevance is the first field (SortField.FIELD_SCORE). 3 pop the entries off the FieldSortedHitQueue, overwriting the elements in topDocs.scoreDocs. I left out step 3, although I suppose you could operate directly on the FieldSortedHitQueue. NOTE: in my case, I just put everything back in the scoreDocs without attempting any efficiencies. If I needed more performance, I'd only put as many items back as I needed to display. But as I wrote yesterday, performance isn't an issue so there's no point. Although I know one place to look if we need to squeeze more QPS. How efficient this is is an open question. But it's fast enough and relatively simple so I stopped looking for more efficiencies Erick On 2/28/07, Chris Hostetter [EMAIL PROTECTED] wrote: : The first part was just to iterate through the TopDocs that's available to : my and normalize the scores right in the ScoreDocs. Like this... Won't that be done after the Lucene does the hitcollecting/sorting? ... he wants the bucketing to happen as part of hte scoring so that the secondary sort will determine the ordering within the bucket. (or am i missing something about your description?) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Soliciting Design Thoughts on Date Searching
Hello, There are a few ways to solve this but no Date extraction filter I know of. Adding a hundred fields for each Lucene doc seems bloated. First, get your text out of the various source documents (.doc,.pdf,.html) using available tools out there described in the Lucene in Action book. It sounds like you know Perl, so next try regexes to pull out dates from the text using java.util.regex and make sure to remove extra whitespace. Put your clean date Strings into a java TreeMap or TreeSet collection to eliminate duplicates. Finally, loop thru the collection adding items to a StringBuffer delimited by commas, then make one long String (holding all your dates) and add to the Lucene doc as one Field.Text. You might be able to set that Field to indexed, but not stored to save space. Regards, Peter W. On Feb 28, 2007, at 11:22 AM, Aigner, Thomas wrote: Walt, I am no expert, but it sounds like you need to associate many dates to a single record. ... Tom -Original Message- From: Walt Stoneburner [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 28, 2007 2:13 PM To: java-user@lucene.apache.org Subject: Re: Soliciting Design Thoughts on Date Searching ... The issue isn't in using DateRange on a Date Field, but in knowing if there is some filter that already exists which extracts dates from a body of text to put into a Date Field. If not, the DateTool solution is a helpful step in building my own filter; I just don't want to reinvent the wheel if it already exists. Now this is where my personal knowledge of Lucene breaks down. Assuming I can extract each date from a source's body and convert it to a usable format, can a Lucene Date Field hold more than one date? For example, is a strict name/value pair, or can the value be a array of dates, or can I append additional dates under the same name? Super generalizing, to break the discussion from a date specific example, suppose I did this: document.add( Field.Text( title, Learning Perl, Fourth Edition ) ); // real title document.add( Field.Text( title, Camel Book ) ); // my wife knows it by the cover Could I do a search for both the long and short title against the title field? If the answer is yes, problem solved! I'll just pile on a ton of dates as I find them and add them to the document. (Note, I could easily have hundreds.) for ( Date somedate : allDatesFoundInSource[] ) { document.add( Field.Text( embeddedDates, somedate ) ); // Right way to do this? } If the answer is no, it better illustrates the problem I face: searching across an arbitrary collection of dates. Erick, if I've missed something obvious in the archives, I'll happily accept my public flogging.Thanks for your help so far. -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best way to returning hits after search?
Antony Bowesman [EMAIL PROTECTED] wrote on 27/02/2007 17:37:41: Doron Cohen wrote: The collect() method is going to be invoked once for each document that matches the query (having nonzero score). If the index is very large, that may turn to be a very large number of calls. Often, search applications only fetch additional data (doc fields) for only a small subset of the entire set of documents matching a query - e.g. first page (0-9), second page (10-19), etc. But if your application is going to fetch in an exhaustive manner, and especially for a short field like DB_ID, it sometimes makes sense to cache in memory the entire field (its values for all the docs), for the entire life of the index reader/searcher, and use that cached data. The collect method can then use that cached data. That's an excellent idea! We cannot easily change our client implementation, so have to support the exhaustive retrieval for now, although I do limit the absolute max hits that will be returned. We are hoping to implement paging in a later client version. I'm not sure I can cache all the GUIDs though. A GUID is 20 bytes and there are two that need to be cached. The document count could be up to 100M,though in most cases 20M. I am keeping a BitSet filter cache for a searcher for each user's mail, so I could extend that to cache all the IDs for that user and give that cache a shortish life and/or limit the total cache available. That would really help. I'll have a play - thanks for the input. Antony If you decide to cache stored field value in memory, FieldCache may be useful for this - so you don't have to implement your own cache - you can access the field values with something like: FieldCache fieldCache = FieldCache.DEFAULT; String db_id_field[] = fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME); Those values are valid for the lifetime of the index-reader. Once a new index reader is opened, when GC collects the unused old index reader object, it would also be able to collect (from the cache) unused field values. See also http://www.gossamer-threads.com/lists/lucene/java-user/39352 Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
ranking/scoring algorithm in details
Hi, Does anyone know of a written document that describes in some details how Lucene's ranking/scoring algorithm works? I'm safely assuming that a single consistent algorithm is being used to compute the scores of each matching documents (with or without explicit boost factors in the query) and rank them accordingly. I would appreciate any pointer to such information, or your own description if you happen to know that. Thanks in advance. /Jong
Re: Soliciting Design Thoughts on Date Searching
: I have generic material that _contain_ dates: historic time lines, : certificates, news articles, forms, deeds, testimonies, and wildly : free form genealogical information. The dates have no specific : structure, obvious context, nor consistency. identifying an extracting dates from bulk text sounds like a pretyt interesting analysys problem ... if you wrote a Tokenizer that could recognize dates, you could then format them using something like DateTools to ensure it would be easy to find them ... but Lucene Analyzers can not currently create terns in multiple fields - so if you wanted a special date field for each doc, you would have to extract those dates in a preprocessing step. if you aren't picky how your index is stored however, there is no reason why you can't have a single field with your text terms and your date terms ... you would just have to be careful to know the differnece in searching ... make your analyzer prefix all of your date terms with soemthing it would never let your regular terms start with (ie __) and make sure you bear that structure in mind when creating your RangeFilter on dates. : Now this is where my personal knowledge of Lucene breaks down. : Assuming I can extract each date from a source's body and convert it : to a usable format, can a Lucene Date Field hold more than one date? fields can contain as many values as you want -- or none at all. : If the answer is yes, problem solved! I'll just pile on a ton of definitely yes. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ParallelSearcher in multi-node environment
: I want to execute parallel search over several machines. But : ParallelSearcher doesn't look perfect. It creates threads and spawns many : requests to the underlying Searchables (over a network) for a single search. : Is there a decent implementation of the parallel search over remote indexes : somewhere? what would you consider decent implementation of a parallel search? ... how could it be done in parallel without spawning seperate threads for each sub Searchable? -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: ranking/scoring algorithm in details
http://lucene.apache.org/java/docs/scoring.html (which you can also find by googling lucene scoring) -Original Message- From: Jong Kim [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 28, 2007 2:21 PM To: java-user@lucene.apache.org Subject: ranking/scoring algorithm in details Hi, Does anyone know of a written document that describes in some details how Lucene's ranking/scoring algorithm works? I'm safely assuming that a single consistent algorithm is being used to compute the scores of each matching documents (with or without explicit boost factors in the query) and rank them accordingly. I would appreciate any pointer to such information, or your own description if you happen to know that. Thanks in advance. /Jong - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Soliciting Design Thoughts on Date Searching
Yeah, date finding is a little like entity extraction, since dates can have many formats, depending on how crazy you want to get (a week from tomorrow is 3/8/2007 if you know that this e-mail was written today). So much so that I went and looked up lingpipe, but they seem to not be concerned with dates. Even if you don't get crazy, it's not straightforward: is 3/8/2007 March 8th or August 3rd? Dates can be written many ways. The real challenge is recognizing dates. As Chris said, once you have them, you just stick them in the token stream. In fact, you can emit the date token (as Chris suggested, with some delimiter that helps you know it's a date) with a position increment of zero and then emit the regular tokens so that the token stream will have both and aligned. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 28, 2007 3:26 PM To: Lucene Users Subject: Re: Soliciting Design Thoughts on Date Searching : I have generic material that _contain_ dates: historic time lines, : certificates, news articles, forms, deeds, testimonies, and wildly : free form genealogical information. The dates have no specific : structure, obvious context, nor consistency. identifying an extracting dates from bulk text sounds like a pretyt interesting analysys problem ... if you wrote a Tokenizer that could recognize dates, you could then format them using something like DateTools to ensure it would be easy to find them ... but Lucene Analyzers can not currently create terns in multiple fields - so if you wanted a special date field for each doc, you would have to extract those dates in a preprocessing step. if you aren't picky how your index is stored however, there is no reason why you can't have a single field with your text terms and your date terms ... you would just have to be careful to know the differnece in searching ... make your analyzer prefix all of your date terms with soemthing it would never let your regular terms start with (ie __) and make sure you bear that structure in mind when creating your RangeFilter on dates. : Now this is where my personal knowledge of Lucene breaks down. : Assuming I can extract each date from a source's body and convert it : to a usable format, can a Lucene Date Field hold more than one date? fields can contain as many values as you want -- or none at all. : If the answer is yes, problem solved! I'll just pile on a ton of definitely yes. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best way to returning hits after search?
Hello I am implemented an IndexResultSet just like java.sql.ResultSet with all its methods. when I call searcher.search(...) I pass a the returned Hits to my IndexResultSet. in the IndexResultSet I have getString(String) getString(int) getInt() next() previous() absolute() and all methods of the java.sql.ResultSet besides, because I am using MyFaces in my application, I customized DataModel in order to support pagination and I keep open my reader so, pagination works fine. in addition, I provided a SercherPool to keep readers open and close when the user ends his searching or an idle time occured. On 3/1/07, Doron Cohen [EMAIL PROTECTED] wrote: Antony Bowesman [EMAIL PROTECTED] wrote on 27/02/2007 17:37:41: Doron Cohen wrote: The collect() method is going to be invoked once for each document that matches the query (having nonzero score). If the index is very large, that may turn to be a very large number of calls. Often, search applications only fetch additional data (doc fields) for only a small subset of the entire set of documents matching a query - e.g. first page (0-9), second page (10-19), etc. But if your application is going to fetch in an exhaustive manner, and especially for a short field like DB_ID, it sometimes makes sense to cache in memory the entire field (its values for all the docs), for the entire life of the index reader/searcher, and use that cached data. The collect method can then use that cached data. That's an excellent idea! We cannot easily change our client implementation, so have to support the exhaustive retrieval for now, although I do limit the absolute max hits that will be returned. We are hoping to implement paging in a later client version. I'm not sure I can cache all the GUIDs though. A GUID is 20 bytes and there are two that need to be cached. The document count could be up to 100M,though in most cases 20M. I am keeping a BitSet filter cache for a searcher for each user's mail, so I could extend that to cache all the IDs for that user and give that cache a shortish life and/or limit the total cache available. That would really help. I'll have a play - thanks for the input. Antony If you decide to cache stored field value in memory, FieldCache may be useful for this - so you don't have to implement your own cache - you can access the field values with something like: FieldCache fieldCache = FieldCache.DEFAULT; String db_id_field[] = fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME); Those values are valid for the lifetime of the index-reader. Once a new index reader is opened, when GC collects the unused old index reader object, it would also be able to collect (from the cache) unused field values. See also http://www.gossamer-threads.com/lists/lucene/java-user/39352 Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Regards, Mohammad
Performance in having Multiple Index files
hi all, i have requirement where in i create an index file for each xml file . i have over 100/150 xml files which are all related . if create 100/150 index files and query using these indices , will this affect the performance of the search operation . bye raaj - Need a quick answer? Get one in minutes from people who know. Ask your question on Yahoo! Answers.