RE: Lucene Help

2006-04-15 Thread Shajahan

Hi,
Thank you for your help, i just downloades the lucene-1.4.3 and i want to
run the demo file. if u dontmin please tell me how to run this demo file.

this demo folder contain one Org folder, Search.html  Search.jhtml files.

Thanking you,
Shajahan Shaik.
--
View this message in context: 
http://www.nabble.com/Lucene-Help-t1442764.html#a3927354
Sent from the Lucene - Java Users forum at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Seaches VS. Relational database Queries

2006-04-15 Thread Paul Elschot
On Saturday 15 April 2006 03:36, Jeryl Cook wrote:
 
 Im the co-worker who suggested to Ananth( I've think we have been debating
 this for 3 days now,from the post it seems he is winning :)... )
 
 Anway, as Ananth stated I suggested this because I am wondering if  lucene
 could solve a bottle neck query that is taking a deathly long time to
 complete(read-only)and the orginal design actually generated a threaded
 60+ queries on the database to return results per userThread who hit our
 website for this view..., I know that this will kill our server when
 user-load increases...i know that lucene is built for speed and can handle a
 very large number of peopel searching(we are using singleton Searcher), and

One way to have more queries per second with a singleton Searcher is
by merging the retrievals of documents for multiple queries.
This will increase query throughput (less disk head movement) but it will
also increase the response time for the individual queries.

 the (threaded)results will be the hits returned from lucene.. , also this
 query will NOT be executed by any user in a text field , but rather in our
 application code only when user selects differnt parts of the site...if all
 values in this 1:n relationship we are trying to query in lucene then the
 application-provided query will return accurate results.  

To follow 1:n relationships avoid using Hits, use your own HitCollector
instead. From application code, try and use TermDocs from the index
reader.
 
 we are using Quartz, and not creating threads in servlets...
 
 FINAL SOLUTION MAYBE?:
 if our client EVER gives us a requirement that says we must have accurate
 text-searching even if somthing on our index for  1:  Jason and Jason
 Black relationship, then we should just simply say we cannot implement this
 because  lucene search will yield inaccurate results correct???
 
 comments?

Assuming I understand the problem correctly, one can solve this by
indexing such fields twice: once as keyword to search for the specific
individual, and once with indexed terms to search for name(s). 
In both fields one could use an extra word from a relational db,
for example a client id.

Regards,
Paul Elschot


 View this message in context: 
http://www.nabble.com/Lucene-Seaches-VS.-Relational-database-Queries-t1434583.html#a3925693
 Sent from the Lucene - Java Users forum at Nabble.com.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Catching BooleanQuery.TooManyClauses

2006-04-15 Thread Erick Erickson
With the warning that I'm not the most experienced Lucene user in the
world...

I *think*, that rather than search for each term, it's more efficient to
just use IndexReader.termDocs. i.e.

Indexreader ir = whatever;
TermDocs termDocs = ir.TermDocs();
WildcardTermEnum wildEnum = whatever;

for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) {
  termDocs.seek(term);
  while (termDocs.next()) {
Document doc = reader.document(termDocs.doc())
  }
}

I know that for loop looks odd, but I just peeked at the source code for the
TermEnum classes and see why it works.

One warning, as the folks on the board have pointed out to me is that the
Hits object is not entirely efficient when you fetch lots of docs (more than
100 has been mentioned) and you should think about TopDocs or some such.

Also, if you can avoid fetching the document (i.e. get everything you want
from the index) you'll add efficiency. I have no clue how much you're
returning to the user, so I don't know whether that would work for you.

Hope this helps
Erick

P.S. I feel kind of odd writing things like this given that Chris, Yonik,
Erik  etc. are looking over my shoulder, but if I actually offer good
advice, maybe I can save them some time since they've certainly helped me
out. And if they make alternate suggestions, they'll be doing code reviews
for me! Cool! G


Re: Lucene Help

2006-04-15 Thread Erick Erickson
What I did was create a project from existing source in Eclipse (gave it
the path to the demo folder), imported the Lucene jar file and ran the
application. As far as I can tell, the only required library is the Lucene
jar file (I was using 1.9, but that shouldn't matter).

I freely admit that the things I don't know about building Java application
are many, but if you're building other Java applications, this should follow
a familiar pattern and build easily in whatever your favorite development
environment is.

Best
Erick


Why is BooleanQuery.maxClauseCount static?

2006-04-15 Thread Jeff Rodenburg
What was the thinking behind making the BooleanQuery maxClauseCount a
static?  Or, I guess more to the point, why not an instance setting as well?

Not trying to point out a flaw, just curious about the original thinking
behind the setting.  I have a situation where I have a set of BooleanQueries
that use a high number of clauses, but another set that needs a low number
of clauses (different indexes searched, and efficiencies dictate the
high/low clause range.)


cheers,
jeff


We are looking for Lucene Developer in Pune-India

2006-04-15 Thread satish
Hello,

We are looking forward to add Lucene/J2EE developer to our core engineering
team at Betterlabs, Pune India. Interested one can send your resume to
[EMAIL PROTECTED] 

Regards
Satish


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

2006-04-15 Thread Puneet Lakhina
Hi all,
I am very new to lucene. I am using it in my application to index and serach
through text files. And my program is more or less similar to the demo
privided with lucene distribution.
Initially everything was working fine without any problems. But today while
running the application i have been getting this exception

java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-
dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

whever i try to read or write to the index. I am unable to understand why
this is happening. IS there some mistake I am making in the code.. because I
havent changed any code, which was working smoothly up until today!!!

My version of lucene is 1.9.1

I deleted the index directory and tried again and voila now it works again!!
But if I am going to be delivering my application I would really like to
know why this was happening to guard against it..

Thanks
--
Puneet


Re: java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

2006-04-15 Thread Raghavendra Prabhu
You are creating two IndexWriters on the same directory

I guess that is the reason for the problem and one holds the lock

Rgds
Prabhu


On 4/15/06, Puneet Lakhina [EMAIL PROTECTED] wrote:

 Hi all,
 I am very new to lucene. I am using it in my application to index and
 serach
 through text files. And my program is more or less similar to the demo
 privided with lucene distribution.
 Initially everything was working fine without any problems. But today
 while
 running the application i have been getting this exception

 java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-
 dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

 whever i try to read or write to the index. I am unable to understand why
 this is happening. IS there some mistake I am making in the code.. because
 I
 havent changed any code, which was working smoothly up until today!!!

 My version of lucene is 1.9.1

 I deleted the index directory and tried again and voila now it works
 again!!
 But if I am going to be delivering my application I would really like to
 know why this was happening to guard against it..

 Thanks
 --
 Puneet




Re: Using Lucene for searching tokens, not storing them.

2006-04-15 Thread karl wettin


14 apr 2006 kl. 18.31 skrev Doug Cutting:


karl wettin wrote:
I would like to store all in my application rather than using the   
Lucene persistency mechanism for tokens. I only want the search   
mechanism. I do not need the IndexReader and IndexWriter as that  
will  be a natural part of my application. I only want to use the  
Searchable.


Implement the IndexReader API, overriding all of the abstract  
methods. That will enable you to search your index using Lucene's  
search code.


This was not even half as tough I thought it would be. I'm however  
not certain about a couple of methods:


1. TermPositions. It returns the next position of *what* in the  
document? It would make sence to me if it returned a start/end  
offset, but this just confuses me.


implements TermPositions {
/** Returns next position in the current document.  It is an  
error to call

 this more than [EMAIL PROTECTED] #freq()} times
 without calling [EMAIL PROTECTED] #next()}p This is
 invalid until [EMAIL PROTECTED] #next()} is called for
 the first time.
 */
public int nextPosition() throws IOException {
return 0; // todo
}


2. Norms. I've been looking in other code, but I honestly don't  
understand what data they are storing, thus it's really hard for me  
to implement :-) I read it as it contains the boost of each document  
per field? So what does the byte represent then?


 /** Returns the byte-encoded normalization factor for the  
named field of
 * every document.  This is used by the search code to score  
documents.

 * @see org.apache.lucene.document.Field#setBoost(float)
 */
public byte[] norms(String field) {
   return null; // todo
}

/** Reads the byte-encoded normalization factor for the  
named field of every
 *  document.  This is used by the search code to score  
documents.

 * @see org.apache.lucene.document.Field#setBoost(float)
 */
public void norms(String field, byte[] bytes, int offset)  
throws IOException {

// todo
}

/** Implements setNorm in subclass.*/
protected void doSetNorm(int doc, String field, byte value)  
throws IOException {

// todo
}

3. I presume I can just ignore the following methods:

/** Implements deletion of the document numbered  
codedocNum/code.
 * Applications should call [EMAIL PROTECTED] #delete(int)} or [EMAIL PROTECTED]  
#delete(org.apache.lucene.index.Term)}.

 */
protected void doDelete(int docNum) {

}

/** Implements actual undeleteAll() in subclass. */
protected void doUndeleteAll() {

}

/** Implements commit. */
protected void doCommit() {

}

/** Implements close. */
protected void doClose() {

}

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

2006-04-15 Thread karl wettin
Could it just be that the application was not shut down properly? If  
you dare, check for locks and remove them when you start your  
application.


Note that both IndexReader and IndexWriter can produce a write-lock.

15 apr 2006 kl. 18.56 skrev Raghavendra Prabhu:


You are creating two IndexWriters on the same directory

I guess that is the reason for the problem and one holds the lock

Rgds
Prabhu


On 4/15/06, Puneet Lakhina [EMAIL PROTECTED] wrote:


Hi all,
I am very new to lucene. I am using it in my application to index and
serach
through text files. And my program is more or less similar to the  
demo

privided with lucene distribution.
Initially everything was working fine without any problems. But today
while
running the application i have been getting this exception

java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-
dcc982e203ef1d2aebb5d8a4b55b3a60-write.lock

whever i try to read or write to the index. I am unable to  
understand why
this is happening. IS there some mistake I am making in the code..  
because

I
havent changed any code, which was working smoothly up until today!!!

My version of lucene is 1.9.1

I deleted the index directory and tried again and voila now it works
again!!
But if I am going to be delivering my application I would really  
like to

know why this was happening to guard against it..

Thanks
--
Puneet





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Catching BooleanQuery.TooManyClauses

2006-04-15 Thread Paul Elschot
On Saturday 15 April 2006 13:44, Erick Erickson wrote:
 With the warning that I'm not the most experienced Lucene user in the
 world...
 
 I *think*, that rather than search for each term, it's more efficient to
 just use IndexReader.termDocs. i.e.
 
 Indexreader ir = whatever;
 TermDocs termDocs = ir.TermDocs();
 WildcardTermEnum wildEnum = whatever;
 
 for (Term term = null; (term = wildEnum.term()) != null; wildEnum.next()) {
   termDocs.seek(term);

This avoids the buffer space needed for each TermDocs by using each term
separately. A BooleanQuery over all the terms will use the termDocs.next() and
termDocs.doc() for all terms at the same time. It has to, because more terms
might match each document and it has to compute the query score for each
document.

   while (termDocs.next()) {
 Document doc = reader.document(termDocs.doc())

The methods termDocs.next() and reader.document()
go to different places in the Lucene index (see the index format),
so this will send the disk head up and down.
It's better to collect the termDocs.doc() values first,  for example in a
BitSet, and then retrieve the Document's in numerical order.
Btw., this is what the ConstantScoreRangeQuery does to avoid using all terms
at the same time.

   }
 }
 
 I know that for loop looks odd, but I just peeked at the source code for the
 TermEnum classes and see why it works.
 
 One warning, as the folks on the board have pointed out to me is that the
 Hits object is not entirely efficient when you fetch lots of docs (more than
 100 has been mentioned) and you should think about TopDocs or some such.
 
 Also, if you can avoid fetching the document (i.e. get everything you want
 from the index) you'll add efficiency. I have no clue how much you're
 returning to the user, so I don't know whether that would work for you.

In other words, one can use the above BitSet in a Filter lateron
during an IndexSearcher.search() (or in a ConstantScoreQuery),
and use Hits or TopDocs for document retrieval.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why is BooleanQuery.maxClauseCount static?

2006-04-15 Thread Paul Elschot
On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote:
 What was the thinking behind making the BooleanQuery maxClauseCount a
 static?  Or, I guess more to the point, why not an instance setting as well?
 
 Not trying to point out a flaw, just curious about the original thinking
 behind the setting.  I have a situation where I have a set of BooleanQueries
 that use a high number of clauses, but another set that needs a low number
 of clauses (different indexes searched, and efficiencies dictate the
 high/low clause range.)

The reason is to have simplicity in dealing with the case of a single
BooleanQuery using many terms. This was done to avoid spurious
OutOfMemory problems for queries that happen to expand to a lot
of terms, and for that it works well.

With nested BooleanQuerys it wouldn't even make sence to have an
instance setting, because in that case the maximum number of clauses
should be associated with the top level query only.

Regards,
Paul Elschot.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Using Lucene for searching tokens, not storing them.

2006-04-15 Thread Paul Elschot
On Saturday 15 April 2006 19:25, karl wettin wrote:
 
 14 apr 2006 kl. 18.31 skrev Doug Cutting:
 
  karl wettin wrote:
  I would like to store all in my application rather than using the   
  Lucene persistency mechanism for tokens. I only want the search   
  mechanism. I do not need the IndexReader and IndexWriter as that  
  will  be a natural part of my application. I only want to use the  
  Searchable.
 
  Implement the IndexReader API, overriding all of the abstract  
  methods. That will enable you to search your index using Lucene's  
  search code.
 
 This was not even half as tough I thought it would be. I'm however  
 not certain about a couple of methods:
 
 1. TermPositions. It returns the next position of *what* in the  
 document? It would make sence to me if it returned a start/end  
 offset, but this just confuses me.
 
 implements TermPositions {
  /** Returns next position in the current document.  It is an  
 error to call
   this more than [EMAIL PROTECTED] #freq()} times
   without calling [EMAIL PROTECTED] #next()}p This is
   invalid until [EMAIL PROTECTED] #next()} is called for
   the first time.
   */
  public int nextPosition() throws IOException {
  return 0; // todo
  }

This enumerates all positions of the Term in the document
as returned by the Tokenizer used by the Analyzer (as normally
used by IndexWriter). The Tokenizer provides all terms as
analyzed, but here only the position of one term are enumerated.
Btw. this is why the index is called an inverted term index.

 
 
 2. Norms. I've been looking in other code, but I honestly don't  
 understand what data they are storing, thus it's really hard for me  
 to implement :-) I read it as it contains the boost of each document  
 per field? So what does the byte represent then?

What is stored is a byte representing the inverse of the number of
indexed terms in a field of a document, as returned by a Tokenizer.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Catching BooleanQuery.TooManyClauses

2006-04-15 Thread Erick Erickson
Cool, thanks for the clarification...

Erick


Re: Why is BooleanQuery.maxClauseCount static?

2006-04-15 Thread Jeff Rodenburg
Thanks Paul.  In my case, I don't have nested queries but rather separate
queries running against different indexes -- some with very high clause
counts, and some with very low clause counts.  These are executing in a web
environment with the same memory space and process, so concurrency can
sometimes cause problems when both types of queries need to execute
simultaneously.

-- j

On 4/15/06, Paul Elschot [EMAIL PROTECTED] wrote:

 On Saturday 15 April 2006 18:20, Jeff Rodenburg wrote:
  What was the thinking behind making the BooleanQuery maxClauseCount a
  static?  Or, I guess more to the point, why not an instance setting as
 well?
 
  Not trying to point out a flaw, just curious about the original thinking
  behind the setting.  I have a situation where I have a set of
 BooleanQueries
  that use a high number of clauses, but another set that needs a low
 number
  of clauses (different indexes searched, and efficiencies dictate the
  high/low clause range.)

 The reason is to have simplicity in dealing with the case of a single
 BooleanQuery using many terms. This was done to avoid spurious
 OutOfMemory problems for queries that happen to expand to a lot
 of terms, and for that it works well.

 With nested BooleanQuerys it wouldn't even make sence to have an
 instance setting, because in that case the maximum number of clauses
 should be associated with the top level query only.

 Regards,
 Paul Elschot.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]