RE: Lucene Help
Hi, Thank you for your help, i just downloades the lucene-1.4.3 and i want to run the demo file. if u dontmin please tell me how to run this demo file. this demo folder contain one Org folder, Search.html Search.jhtml files. Thanking you, Shajahan Shaik. -- View this message in context: http://www.nabble.com/Lucene-Help-t1442764.html#a3927354 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Seaches VS. Relational database Queries
On Saturday 15 April 2006 03:36, Jeryl Cook wrote: Im the co-worker who suggested to Ananth( I've think we have been debating this for 3 days now,from the post it seems he is winning :)... ) Anway, as Ananth stated I suggested this because I am wondering if lucene could solve a bottle neck query that is taking a deathly long time to complete(read-only)and the orginal design actually generated a threaded 60+ queries on the database to return results per userThread who hit our website for this view..., I know that this will kill our server when user-load increases...i know that lucene is built for speed and can handle a very large number of peopel searching(we are using singleton Searcher), and One way to have more queries per second with a singleton Searcher is by merging the retrievals of documents for multiple queries. This will increase query throughput (less disk head movement) but it will also increase the response time for the individual queries. the (threaded)results will be the hits returned from lucene.. , also this query will NOT be executed by any user in a text field , but rather in our application code only when user selects differnt parts of the site...if all values in this 1:n relationship we are trying to query in lucene then the application-provided query will return accurate results. To follow 1:n relationships avoid using Hits, use your own HitCollector instead. From application code, try and use TermDocs from the index reader. we are using Quartz, and not creating threads in servlets... FINAL SOLUTION MAYBE?: if our client EVER gives us a requirement that says we must have accurate text-searching even if somthing on our index for 1: Jason and Jason Black relationship, then we should just simply say we cannot implement this because lucene search will yield inaccurate results correct??? comments? Assuming I understand the problem correctly, one can solve this by indexing such fields twice: once as keyword to search for the specific individual, and once with indexed terms to search for name(s). In both fields one could use an extra word from a relational db, for example a client id. Regards, Paul Elschot View this message in context: http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693 Sent from the Lucene - Java Users forum at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Catching BooleanQuery.TooManyClauses
With the warning that I'm not the most experienced Lucene user in the world... I *think*, that rather than search for each term, it's more efficient to just use IndexReader.termDocs. i.e. Indexreader ir = whatever; TermDocs termDocs = ir.TermDocs(); WildcardTermEnum wildEnum = whatever; for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) { termDocs.seek(term); while (termDocs.next()) { Document doc = reader.document(termDocs.doc()) } } I know that for loop looks odd, but I just peeked at the source code for the TermEnum classes and see why it works. One warning, as the folks on the board have pointed out to me is that the Hits object is not entirely efficient when you fetch lots of docs (more than 100 has been mentioned) and you should think about TopDocs or some such. Also, if you can avoid fetching the document (i.e. get everything you want from the index) you'll add efficiency. I have no clue how much you're returning to the user, so I don't know whether that would work for you. Hope this helps Erick P.S. I feel kind of odd writing things like this given that Chris, Yonik, Erik etc. are looking over my shoulder, but if I actually offer good advice, maybe I can save them some time since they've certainly helped me out. And if they make alternate suggestions, they'll be doing code reviews for me! Cool! G
Re: Lucene Help
What I did was create a project from existing source in Eclipse (gave it the path to the demo folder), imported the Lucene jar file and ran the application. As far as I can tell, the only required library is the Lucene jar file (I was using 1.9, but that shouldn't matter). I freely admit that the things I don't know about building Java application are many, but if you're building other Java applications, this should follow a familiar pattern and build easily in whatever your favorite development environment is. Best Erick
Why is BooleanQuery.maxClauseCount static?
What was the thinking behind making the BooleanQuery maxClauseCount a static? Or, I guess more to the point, why not an instance setting as well? Not trying to point out a flaw, just curious about the original thinking behind the setting. I have a situation where I have a set of BooleanQueries that use a high number of clauses, but another set that needs a low number of clauses (different indexes searched, and efficiencies dictate the high/low clause range.) cheers, jeff
We are looking for Lucene Developer in Pune-India
Hello, We are looking forward to add Lucene/J2EE developer to our core engineering team at Betterlabs, Pune India. Interested one can send your resume to [EMAIL PROTECTED] Regards Satish - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock
Hi all, I am very new to lucene. I am using it in my application to index and serach through text files. And my program is more or less similar to the demo privided with lucene distribution. Initially everything was working fine without any problems. But today while running the application i have been getting this exception java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene- dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock whever i try to read or write to the index. I am unable to understand why this is happening. IS there some mistake I am making in the code.. because I havent changed any code, which was working smoothly up until today!!! My version of lucene is 1.9.1 I deleted the index directory and tried again and voila now it works again!! But if I am going to be delivering my application I would really like to know why this was happening to guard against it.. Thanks -- Puneet
Re: java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock
You are creating two IndexWriters on the same directory I guess that is the reason for the problem and one holds the lock Rgds Prabhu On 4/15/06, Puneet Lakhina [EMAIL PROTECTED] wrote: Hi all, I am very new to lucene. I am using it in my application to index and serach through text files. And my program is more or less similar to the demo privided with lucene distribution. Initially everything was working fine without any problems. But today while running the application i have been getting this exception java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene- dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock whever i try to read or write to the index. I am unable to understand why this is happening. IS there some mistake I am making in the code.. because I havent changed any code, which was working smoothly up until today!!! My version of lucene is 1.9.1 I deleted the index directory and tried again and voila now it works again!! But if I am going to be delivering my application I would really like to know why this was happening to guard against it.. Thanks -- Puneet
Re: Using Lucene for searching tokens, not storing them.
14 apr 2006 kl. 18.31 skrev Doug Cutting: karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. Implement the IndexReader API, overriding all of the abstract methods. That will enable you to search your index using Lucene's search code. This was not even half as tough I thought it would be. I'm however not certain about a couple of methods: 1. TermPositions. It returns the next position of *what* in the document? It would make sence to me if it returned a start/end offset, but this just confuses me. implements TermPositions { /** Returns next position in the current document. It is an error to call this more than [EMAIL PROTECTED] #freq()} times without calling [EMAIL PROTECTED] #next()}p This is invalid until [EMAIL PROTECTED] #next()} is called for the first time. */ public int nextPosition() throws IOException { return 0; // todo } 2. Norms. I've been looking in other code, but I honestly don't understand what data they are storing, thus it's really hard for me to implement :-) I read it as it contains the boost of each document per field? So what does the byte represent then? /** Returns the byte-encoded normalization factor for the named field of * every document. This is used by the search code to score documents. * @see org.apache.lucene.document.Field#setBoost(float) */ public byte[] norms(String field) { return null; // todo } /** Reads the byte-encoded normalization factor for the named field of every * document. This is used by the search code to score documents. * @see org.apache.lucene.document.Field#setBoost(float) */ public void norms(String field, byte[] bytes, int offset) throws IOException { // todo } /** Implements setNorm in subclass.*/ protected void doSetNorm(int doc, String field, byte value) throws IOException { // todo } 3. I presume I can just ignore the following methods: /** Implements deletion of the document numbered codedocNum/code. * Applications should call [EMAIL PROTECTED] #delete(int)} or [EMAIL PROTECTED] #delete(org.apache.lucene.index.Term)}. */ protected void doDelete(int docNum) { } /** Implements actual undeleteAll() in subclass. */ protected void doUndeleteAll() { } /** Implements commit. */ protected void doCommit() { } /** Implements close. */ protected void doClose() { } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock
Could it just be that the application was not shut down properly? If you dare, check for locks and remove them when you start your application. Note that both IndexReader and IndexWriter can produce a write-lock. 15 apr 2006 kl. 18.56 skrev Raghavendra Prabhu: You are creating two IndexWriters on the same directory I guess that is the reason for the problem and one holds the lock Rgds Prabhu On 4/15/06, Puneet Lakhina [EMAIL PROTECTED] wrote: Hi all, I am very new to lucene. I am using it in my application to index and serach through text files. And my program is more or less similar to the demo privided with lucene distribution. Initially everything was working fine without any problems. But today while running the application i have been getting this exception java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene- dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock whever i try to read or write to the index. I am unable to understand why this is happening. IS there some mistake I am making in the code.. because I havent changed any code, which was working smoothly up until today!!! My version of lucene is 1.9.1 I deleted the index directory and tried again and voila now it works again!! But if I am going to be delivering my application I would really like to know why this was happening to guard against it.. Thanks -- Puneet - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Catching BooleanQuery.TooManyClauses
On Saturday 15 April 2006 13:44, Erick Erickson wrote: With the warning that I'm not the most experienced Lucene user in the world... I *think*, that rather than search for each term, it's more efficient to just use IndexReader.termDocs. i.e. Indexreader ir = whatever; TermDocs termDocs = ir.TermDocs(); WildcardTermEnum wildEnum = whatever; for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) { termDocs.seek(term); This avoids the buffer space needed for each TermDocs by using each term separately. A BooleanQuery over all the terms will use the termDocs.next() and termDocs.doc() for all terms at the same time. It has to, because more terms might match each document and it has to compute the query score for each document. while (termDocs.next()) { Document doc = reader.document(termDocs.doc()) The methods termDocs.next() and reader.document() go to different places in the Lucene index (see the index format), so this will send the disk head up and down. It's better to collect the termDocs.doc() values first, for example in a BitSet, and then retrieve the Document's in numerical order. Btw., this is what the ConstantScoreRangeQuery does to avoid using all terms at the same time. } } I know that for loop looks odd, but I just peeked at the source code for the TermEnum classes and see why it works. One warning, as the folks on the board have pointed out to me is that the Hits object is not entirely efficient when you fetch lots of docs (more than 100 has been mentioned) and you should think about TopDocs or some such. Also, if you can avoid fetching the document (i.e. get everything you want from the index) you'll add efficiency. I have no clue how much you're returning to the user, so I don't know whether that would work for you. In other words, one can use the above BitSet in a Filter lateron during an IndexSearcher.search() (or in a ConstantScoreQuery), and use Hits or TopDocs for document retrieval. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is BooleanQuery.maxClauseCount static?
On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote: What was the thinking behind making the BooleanQuery maxClauseCount a static? Or, I guess more to the point, why not an instance setting as well? Not trying to point out a flaw, just curious about the original thinking behind the setting. I have a situation where I have a set of BooleanQueries that use a high number of clauses, but another set that needs a low number of clauses (different indexes searched, and efficiencies dictate the high/low clause range.) The reason is to have simplicity in dealing with the case of a single BooleanQuery using many terms. This was done to avoid spurious OutOfMemory problems for queries that happen to expand to a lot of terms, and for that it works well. With nested BooleanQuerys it wouldn't even make sence to have an instance setting, because in that case the maximum number of clauses should be associated with the top level query only. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Using Lucene for searching tokens, not storing them.
On Saturday 15 April 2006 19:25, karl wettin wrote: 14 apr 2006 kl. 18.31 skrev Doug Cutting: karl wettin wrote: I would like to store all in my application rather than using the Lucene persistency mechanism for tokens. I only want the search mechanism. I do not need the IndexReader and IndexWriter as that will be a natural part of my application. I only want to use the Searchable. Implement the IndexReader API, overriding all of the abstract methods. That will enable you to search your index using Lucene's search code. This was not even half as tough I thought it would be. I'm however not certain about a couple of methods: 1. TermPositions. It returns the next position of *what* in the document? It would make sence to me if it returned a start/end offset, but this just confuses me. implements TermPositions { /** Returns next position in the current document. It is an error to call this more than [EMAIL PROTECTED] #freq()} times without calling [EMAIL PROTECTED] #next()}p This is invalid until [EMAIL PROTECTED] #next()} is called for the first time. */ public int nextPosition() throws IOException { return 0; // todo } This enumerates all positions of the Term in the document as returned by the Tokenizer used by the Analyzer (as normally used by IndexWriter). The Tokenizer provides all terms as analyzed, but here only the position of one term are enumerated. Btw. this is why the index is called an inverted term index. 2. Norms. I've been looking in other code, but I honestly don't understand what data they are storing, thus it's really hard for me to implement :-) I read it as it contains the boost of each document per field? So what does the byte represent then? What is stored is a byte representing the inverse of the number of indexed terms in a field of a document, as returned by a Tokenizer. Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Catching BooleanQuery.TooManyClauses
Cool, thanks for the clarification... Erick
Re: Why is BooleanQuery.maxClauseCount static?
Thanks Paul. In my case, I don't have nested queries but rather separate queries running against different indexes -- some with very high clause counts, and some with very low clause counts. These are executing in a web environment with the same memory space and process, so concurrency can sometimes cause problems when both types of queries need to execute simultaneously. -- j On 4/15/06, Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote: What was the thinking behind making the BooleanQuery maxClauseCount a static? Or, I guess more to the point, why not an instance setting as well? Not trying to point out a flaw, just curious about the original thinking behind the setting. I have a situation where I have a set of BooleanQueries that use a high number of clauses, but another set that needs a low number of clauses (different indexes searched, and efficiencies dictate the high/low clause range.) The reason is to have simplicity in dealing with the case of a single BooleanQuery using many terms. This was done to avoid spurious OutOfMemory problems for queries that happen to expand to a lot of terms, and for that it works well. With nested BooleanQuerys it wouldn't even make sence to have an instance setting, because in that case the maximum number of clauses should be associated with the top level query only. Regards, Paul Elschot. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]