Re: status of LARM project

2004-04-28 Thread Otis Gospodnetic
Kelvin is all correct.
A few years ago there were no quality open source crawlers available. 
There are now a number of very good ones.  Archive.org's crawler is
available, there is Larbin, Nutch, etc.
LARM works, it's just not maintained any more.

Otis

--- Kelvin Tan [EMAIL PROTECTED] wrote:
 As far as I know, LARM is defunct. I read somewhere, perhaps
 apocryphal, that
 Clemens got a job which wasn't supportive of his continued
 development on LARM.
 AFAIK there aren't any other active developers of LARM (at least at
 the time it
 branched off to SF).
 
 Otis recently posted to use Nutch instead of LARM.
 
 Kelvin
 
 On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
  Hi
 
  I have look at LARM website and I get different results
 
  http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
  It says that development has stopped for this project.
 
  LARM hosted on sourceforge.
  The last message was dated 2003 in the mailing list. Is it still
  supported and active?
 
  LARM hosted on apache.
  It says the project is moved to sourceforge.
 
  Any one here who is active in LARM can comment on the status?
 
  Regards
 
  Sebastian Ho
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: ArrayIndexOutOfBoundsException

2004-04-28 Thread Phil brunet
Hi.

I had this problem when i transfered a Lucene index by FTP in ASCII mode. 
Using binary mode, i never has such a problem.

Philippe

From: James Dunn [EMAIL PROTECTED]
Reply-To: Lucene Users List [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: ArrayIndexOutOfBoundsException
Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT)
Hello all,

I have a web site whose search is driven by Lucene
1.3.  I've been doing some load testing using JMeter
and occassionally I will see the exception below when
the search page is under heavy load.
Has anyone seen similar errors during load testing?

I've seen some posts with similar exceptions and the
general consensus is that this error means that the
index is corrupt.  I'm not sure my index is corrupt
however.  I can run all the queries I use for load
testing under normal load and I don't appear to get
this error.
Is there any way to verify that a Lucene index is
corrupt or not?
Thanks,

Jim

java.lang.ArrayIndexOutOfBoundsException: 53 = 52
at java.util.Vector.elementAt(Vector.java:431)
at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
at
org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.Hits.doc(Hits.java:130)




__
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
http://photos.yahoo.com/ph/print_splash
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Hotmail : un compte GRATUIT qui vous suit partout et tout le temps ! 
http://g.msn.fr/FR1000/9493

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Segments file get deleted?!

2004-04-28 Thread Surya Kiran
Hi Thanks for reply. I got that error in my previous build. Now i didnt see
it at all.
Also i couldnt able to retain the log. I will definetly come back if i see
it again.
Anyway below is my machine config:

Windows XP Personal Ed., 512MB, P4.
My app server is Resin 2.1.12

I will definetly come up with more details when i get it again.Thanks again.

Surya

- Original Message - 
From: Nader S. Henein [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, April 26, 2004 12:42 PM
Subject: RE: Segments file get deleted?!


Can you give us a bit of background, we've been using Lucene since the first
stable release 2 years ago, and I 've never had segments disappear on me,
first of all can you provide some background on your setup and secondly when
you say a certain period of time, how much time are we talking about here
and does that interval coincide with your indexing schedule, because you may
have the create flag on the Indexer set to true so it simply recreates the
index at every update and deleted whatever was there, of course if there are
no files to index at any point it will just give you a blank index.


Nader Henein

-Original Message-
From: Surya Kiran [mailto:[EMAIL PROTECTED]
Sent: Monday, April 26, 2004 7:48 AM
To: [EMAIL PROTECTED]
Subject: Segments file get deleted?!


Hi all, we have implemented our portal search using Lucene. It  works fine.
But after a certain period of time Lucene segments file get deleted.
Eventually all searches fails. Anyone can guess where the error could be.

Thanks a lot.

Regards
Surya.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Count for a keyword occurance in a file

2004-04-28 Thread hemal bhatt
Hi,

How can I get a count of the score given by Hits.Score().
i.e I want to know how many times a keyword occurs in a file.
Any help on this would be appreciated.
  
regards
Hemal Bhatt



regards
Hemal bhatt

[Lucene] XML Indexing

2004-04-28 Thread Samuel Tang
XMLIndexingDemo seems not able to index traditional Chinese characters. I can only 
search for English text and not Chinese. In fact, my XML document contains both 
Chinese and English text. How can I fix this problem? Is it necessary for me to 
convert the Chinese characters in BIG5 to UTF-8 before doing the file indexing? If it 
is, then how can we do it? This problem won't happen on indexing bilingual HTML files 
(Chinese  English) with Lucene Demo HTML parser. 

...
  
http://ringtone.yahoo.com.hk/


Combining text search + relational search

2004-04-28 Thread Mike_Belasco




I need to somehow aloow users to do a text search and query relational
database attributes at the same time. The attributes are basically metadata
about the documents that the text search will be perfomed on. I have the
text of the documents indexed in Lucene. Does anyone have any advice or
examples. I also need to make sure I don't garble up all the memory on our
server

Thanks
Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Re-associate a token with its source

2004-04-28 Thread Olaia Vázquez Sánchez
Thank you, but I think I didn't explained my problem clearly enough.

I have four positions (top, bottom, right and left) for each one of the
words of the document so I would have to store in the index the content of
the page with the positions in the middle.

org.apache.lucene.document.Field#UnIndexed(content, house 1142 1231 3212
2214 dog 2213 2432 3214 2134 ...)

In order to get the values after a search I would need to parse the document
returned to find the positions that are next to the searched word. I have
seen that the class Token has 4 properties: beginColumn, beginLine,
endColumn and endLine and I don't know if it is possible to use them to
store for each token the position that I want.

I think this approach is not the correct one so any help on this would be
appreciated.


Olaia.

-Mensaje original-
De: Stephane James Vaucher [mailto:[EMAIL PROTECTED] 
Enviado el: martes, 27 de abril de 2004 21:46
Para: Lucene Users List
Asunto: Re: Re-associate a token with its source

When indexing, use UnIndexed fields to store this data in your document.

org.apache.lucene.document.Field#UnIndexed(String name, String value) 

Add the fields using:
org.apache.lucene.document.Document.add(Field)

After your search, you can get the field value from:
Document Hits.doc(int)

You can retrieve your store values using 
String Document.get(String name) 

HTH,
sv

On Tue, 27 Apr 2004, Olaia Vázquez Sánchez wrote:

 Hello
 
  
 
 I have documents in XML in which, for each word, I have 4 positions (top,
 down, left and right) that would let me to highlight this word in a jpg
 image. I want to index this XML documents and to highlight the results of
 the queries in the image, so I need to store this positions for each word
 inside the index.
 
  
 
 I was searching about how can I use the Token fields to store this
 attributes but I didn’t found any example where this fields were used.
 
  
 
 Thanks,
 
  
 
 Olaia Vázquez
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining text search + relational search

2004-04-28 Thread Stephane James Vaucher
I'm a bit confused why you want this.

As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.

As for memory consumption, for searching, if the index is on disk, then
the memory footprint depends on the type of queries you use. For indexing,
it depends if you use tmp RAMDirectory to do merges, otherwise, memory
consumption is minimal.

HTH
sv

On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:


 I need to somehow aloow users to do a text search and query relational
 database attributes at the same time. The attributes are basically metadata
 about the documents that the text search will be perfomed on. I have the
 text of the documents indexed in Lucene. Does anyone have any advice or
 examples. I also need to make sure I don't garble up all the memory on our
 server

 Thanks
 Mike


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Read past EOF and negative bufferLength problem (1.4 rc2)

2004-04-28 Thread Joe Berkovitz
Daniel,

Everything works fine with the latest CVS version of lucene.  It looks 
like the bug I hit was the one that you referenced in your email which 
is now fixed.

Thanks for your help.

.   ..  . ...joe

Daniel Naber wrote:

Am Dienstag, 27. April 2004 21:00 schrieb Joe Berkovitz:

 

Using Lucene 1.4 rc2 I've run into a fatal problem:
   

Could you try with the latest version from CVS? Several severe problems have 
been fixed, but I'm not sure if yours was one of them. Also see
http://issues.apache.org/bugzilla/show_bug.cgi?id=27587
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Combining text search + relational search

2004-04-28 Thread Mike_Belasco




Bascially I want to limit the results of the text search by the rows that
are returned in a relational search of other attribute data related to the
document. The text of the document is just like any other attribute it just
needs to be queried differently. Does that make sense?

Thanks
Mike






   
 Stephane James
 Vaucher   
 [EMAIL PROTECTED]  To 
 qc.caLucene Users List   
   [EMAIL PROTECTED]
 04/28/2004 10:38   cc 
 AM
   Subject 
   Re: Combining text search + 
 Please respond to relational search   
   Lucene Users   
   List   
 [EMAIL PROTECTED] 
  rta.apache.org  
   
   




I'm a bit confused why you want this.

As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.

As for memory consumption, for searching, if the index is on disk, then
the memory footprint depends on the type of queries you use. For indexing,
it depends if you use tmp RAMDirectory to do merges, otherwise, memory
consumption is minimal.

HTH
sv

On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:


 I need to somehow aloow users to do a text search and query relational
 database attributes at the same time. The attributes are basically
metadata
 about the documents that the text search will be perfomed on. I have the
 text of the documents indexed in Lucene. Does anyone have any advice or
 examples. I also need to make sure I don't garble up all the memory on
our
 server

 Thanks
 Mike


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Combining text search + relational search

2004-04-28 Thread Otis Gospodnetic
Create a Lucene index from data in DB, and make sure to include PKs in
one of the fields (use Field.Keyword).
Then query your RDBMS and get back the ResultSet.
Then get the PK from each ResultSet and use it to construct a Lucene
BooleanQuery, which should include your original query string AND
returned PKs combined with OR.

That is, if I understand what yo uare trying to do :)

Otis


--- [EMAIL PROTECTED] wrote:
 
 
 
 
 Bascially I want to limit the results of the text search by the rows
 that
 are returned in a relational search of other attribute data related
 to the
 document. The text of the document is just like any other attribute
 it just
 needs to be queried differently. Does that make sense?
 
 Thanks
 Mike
 
 
 
 
 
 
  
  
  Stephane James  
  
  Vaucher 
  
  [EMAIL PROTECTED]   
   To 
  qc.caLucene Users List 
  
   
 [EMAIL PROTECTED]
  04/28/2004 10:38
   cc 
  AM  
  
   
 Subject 
Re: Combining text search +   
  
  Please respond to relational search 
  
Lucene Users 
  
List 
  
  [EMAIL PROTECTED]   
  
   rta.apache.org
  
  
  
  
  
 
 
 
 
 I'm a bit confused why you want this.
 
 As far as I know, but relational db searches will return exact
 matches without a mesure of relevancy. To mesure relevancy, you need
 a
 search engine. For your results to be coherent, you would have to put
 everything in the lucene index.
 
 As for memory consumption, for searching, if the index is on disk,
 then
 the memory footprint depends on the type of queries you use. For
 indexing,
 it depends if you use tmp RAMDirectory to do merges, otherwise,
 memory
 consumption is minimal.
 
 HTH
 sv
 
 On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:
 
 
  I need to somehow aloow users to do a text search and query
 relational
  database attributes at the same time. The attributes are basically
 metadata
  about the documents that the text search will be perfomed on. I
 have the
  text of the documents indexed in Lucene. Does anyone have any
 advice or
  examples. I also need to make sure I don't garble up all the memory
 on
 our
  server
 
  Thanks
  Mike
 
 
 
 -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: ArrayIndexOutOfBoundsException

2004-04-28 Thread James Dunn
Philippe, thanks for the reply.  I didn't FTP my index
anywhere, but your response does make it seem that my
index is in fact corrupted somehow.

Does anyone know of a tool that can verify the
validity of a Lucene index, and/or possibly repair it?
 If not, anyone have any idea how difficult it would
be to write one?  

Thanks,

Jim 

--- Phil brunet [EMAIL PROTECTED] wrote:
 
 Hi.
 
 I had this problem when i transfered a Lucene index
 by FTP in ASCII mode. 
 Using binary mode, i never has such a problem.
 
 Philippe
 
 From: James Dunn [EMAIL PROTECTED]
 Reply-To: Lucene Users List
 [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: ArrayIndexOutOfBoundsException
 Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT)
 
 Hello all,
 
 I have a web site whose search is driven by Lucene
 1.3.  I've been doing some load testing using
 JMeter
 and occassionally I will see the exception below
 when
 the search page is under heavy load.
 
 Has anyone seen similar errors during load testing?
 
 I've seen some posts with similar exceptions and
 the
 general consensus is that this error means that the
 index is corrupt.  I'm not sure my index is corrupt
 however.  I can run all the queries I use for load
 testing under normal load and I don't appear to get
 this error.
 
 Is there any way to verify that a Lucene index is
 corrupt or not?
 
 Thanks,
 
 Jim
 
 java.lang.ArrayIndexOutOfBoundsException: 53 = 52
  at
 java.util.Vector.elementAt(Vector.java:431)
  at

org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
  at

org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
  at

org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
  at

org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
  at

org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
  at

org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
  at

org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
  at
 org.apache.lucene.search.Hits.doc(Hits.java:130)
 
 
 
 
 
 __
 Do you Yahoo!?
 Yahoo! Photos: High-quality 4x6 digital prints for
 25¢
 http://photos.yahoo.com/ph/print_splash
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 

_
 Hotmail : un compte GRATUIT qui vous suit partout et
 tout le temps ! 
 http://g.msn.fr/FR1000/9493
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 





__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than Lock obtain timed out

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


RE: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread ANarayan
It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.

Anand

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 28, 2004 2:57 PM
To: Lucene Users List
Subject: 'Lock obtain timed out' even though NO locks exist... 

I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than Lock obtain timed out

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread James Dunn
Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  

You can also pass lucene a system property to increase
the lock timeout interval, like so:

-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6

The above sets the timeout to one minute.

Hope this helps,

Jim

--- Kevin A. Burton [EMAIL PROTECTED] wrote:
 I've noticed this really strange problem on one of
 our boxes.  It's 
 happened twice already.
 
 We have indexes where when Lucnes starts it says
 'Lock obtain timed out' 
 ... however NO locks exist for the directory. 
 
 There are no other processes present and no locks in
 the index dir or /tmp.
 
 Is there anyway to figure out what's going on here?
 
 Looking at the index it seems just fine... But this
 is only a brief 
 glance.  I was hoping that if it was corrupt (which
 I don't think it is) 
 that lucene would give me a better error than Lock
 obtain timed out
 
 Kevin
 
 -- 
 
 Please reply using PGP.
 
 http://peerfear.org/pubkey.asc
 
 NewsMonster - http://www.newsmonster.org/
 
 Kevin A. Burton, Location - San Francisco, CA, Cell
 - 415.595.9965
AIM/YIM - sfburtonator,  Web -
 http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D
 8D04 99F1 4412
   IRC - freenode.net #infoanarchy | #p2p-hackers |
 #newsmonster
 
 

 ATTACHMENT part 2 application/pgp-signature
name=signature.asc






__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
[EMAIL PROTECTED] wrote:

It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.
 

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was creating 
a lock. 

I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


lucene applicability and performance

2004-04-28 Thread Greg Conway
Hello.  Apologies if this has come up before, I'm new to the list and
didn't see anything in the archives that exactly matched my situation.

I am considering using Lucene to index and search a large collection of
small documents in a  specialized domain -- probably only a few
thousands unique terms spanning across anywhere from one million to ten
million small source documents.  I hope to be able to get ranked search
results back in less than 400 msec.

I suspect one issue I may face is index density owing to the large
numbers of documents and relatively small vocabulary.  That, in turn,
may be a drag on query processing.  I am working on strategies to
ameliorate that somewhat but it may be difficult.

In the meantime, I'm looking for some gut reactions from the experts
before I take this to the next stage.  Can Lucene scale well to this
kind of situation?  Can I realistically hope to get anywhere near my
performance targets?  Will I have to distribute pieces of the index
across several machines,  parallelize my retrievals, and merge the
results to do so?  If so, does Lucene already support that or will I
have to develop that logic in house?  (Seems like I saw a reference
somewhere that such a feature was coming soon, but I'm not sure when or
how it will be implemented.)

Any help, tips, references, or advice would be welcome and appreciated.
Thank you!

Regards,

Greg 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Kevin A. Burton wrote:

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was 
creating a lock.
I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

This is strange... after removing all the webapps (besides 1) Tomcat 
still refuses to allow Lucene to open this index with Lock obtain timed out.

If I open it up from the console it works just fine.  I'm only doing it 
with one index and a ulimit -n so it's not a files issue.  Memory is 1G 
for Tomcat.

If I figure this out will be sure to send a message to the list.  This 
is a strange one

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
James Dunn wrote:

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  
 

Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  
 

I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

You can also pass lucene a system property to increase
the lock timeout interval, like so:
-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6
 

I'll give that a try... good idea.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


RE: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Gus Kormeier
Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.

I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.

Hope that helps,
-Gus

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 28, 2004 1:01 PM
To: Lucene Users List
Subject: Re: 'Lock obtain timed out' even though NO locks exist...


James Dunn wrote:

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  
  

Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  
  

I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

You can also pass lucene a system property to increase
the lock timeout interval, like so:

-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6
  

I'll give that a try... good idea.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Bug in Sandbox - Berkeley DB

2004-04-28 Thread Andy Goodell
IndexReader.delete(int docid) doesn't work with the Berkeley DB
implementation of org.apache.lucene.store.Directory

This error message appears when closing an IndexReader which has a deletion:
PANIC: Invalid argument

I get this stack trace:
java.io.IOException: DB_RUNRECOVERY: Fatal error, run database recovery
   at org.apache.lucene.store.db.Block.put(Block.java:128)
   at org.apache.lucene.store.db.DbOutputStream.close(DbOutputStream.java:111)
   at org.apache.lucene.util.BitVector.write(BitVector.java:155)
   at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:162)
   at org.apache.lucene.store.Lock$With.run(Lock.java:148)
   at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:157)
   at org.apache.lucene.index.IndexReader.close(IndexReader.java:422)

Help!

- andy g

code that triggers this:
// dbdir is a working DbDirectory, docid is a search result
IndexReader read = IndexReader.open(dbdir);
read.delete(docid);
read.close();

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton
Gus Kormeier wrote:

Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.
I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.
 

Man... you ROCK!  I didn't even THINK of that... Hm... I wonder if we 
should include the name of the lock file in the Exception within 
Tomcat.  That would probably have saved me a lot of time :)

Either that or we can put this in the wiki

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature


Re: lucene applicability and performance

2004-04-28 Thread Ype Kingma
Greg,

On Wednesday 28 April 2004 21:44, Greg Conway wrote:
 Hello.  Apologies if this has come up before, I'm new to the list and
 didn't see anything in the archives that exactly matched my situation.

It has, but each situation is different. Try this:
http://jakarta.apache.org/lucene/docs/benchmarks.html

 I am considering using Lucene to index and search a large collection of
 small documents in a  specialized domain -- probably only a few

 thousands unique terms spanning across anywhere from one million to ten
 million small source documents.  I hope to be able to get ranked search
 results back in less than 400 msec.

 I suspect one issue I may face is index density owing to the large
 numbers of documents and relatively small vocabulary.  That, in turn,
 may be a drag on query processing.  I am working on strategies to
 ameliorate that somewhat but it may be difficult.

A text search engine is your best bet in this situation.

 In the meantime, I'm looking for some gut reactions from the experts
 before I take this to the next stage.  Can Lucene scale well to this
 kind of situation?  Can I realistically hope to get anywhere near my

Yes.

 performance targets?  Will I have to distribute pieces of the index

Yes.

 across several machines,  parallelize my retrievals, and merge the

That's more difficult to say. You'll need to try.

 results to do so?  If so, does Lucene already support that or will I

Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search.
(See the javadoc on the website)
But first make sure that the Analyzer you use for indexing fits your needs.

 have to develop that logic in house?  (Seems like I saw a reference

No.

 somewhere that such a feature was coming soon, but I'm not sure when or
 how it will be implemented.)

Have fun,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: lucene applicability and performance

2004-04-28 Thread Ype Kingma
Greg,

 Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search.
 (See the javadoc on the website)

I meant ParallelMultiSearcher.

Good night,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Created LockObtainTimedOut wiki page

2004-04-28 Thread Kevin A. Burton
I just created a LockObtainTimedOut wiki entry... feel free to add.  I 
just entered the Tomcat issue with java.io.tmpdir as well.

http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut  

Peace!

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature