Re: Document 'Context' Relation to each other

2005-01-22 Thread Erik Hatcher
On Jan 21, 2005, at 10:47 PM, Paul Smith wrote:
As a log4j developer, I've been toying with the idea of what Lucene 
could do for me, maybe as an excuse to play around with Lucene.
First off, let me thank you for your work with log4j!  I've been using 
it at lucenebook.com with the SMTPAppender (once I learned that I 
needed a custom trigger to release e-mails when I wanted, not just on 
errors) and it's been working great.

Now, I could provide a Field to the LoggingEvent Document that has a 
sequence #, and once a user has chosen an appropriate matching event, 
do another search for the documents with a Sequence # between +/- the 
context size.
My question is, is that going to be an efficient way to do this? The 
sequence # would be treated as text, wouldn't it?  Would the range 
search on an int be the most efficient way to do this?

I know from the Hits documentation that one can retrieve the Document 
ID of a matching entry.  What is the contract on this Document ID?  Is 
each Document added to the Index given an increasing number?  Can one 
search an index by Document ID?  Could one search for Document ID's 
between a range?   (Hope you can see where I'm going here).

You wouldn't even need the sequence number.  You'll certainly be adding 
the documents to the index in the proper sequence already (right?).  It 
is easy to random access documents if you know Lucene's document ids.  
Here's the pseudo-code

	- construct an IndexReader
	- open an IndexSearcher using the IndexReader
	- search, getting Hits back
	- for a hit you want to see the context, get hits.id(hit#)
	- subtract context size from the id, grab documents using 
reader.document(id)

You don't search for a document by id, but rather jump right to it 
with IndexReader.

Many thanks for an excellent API, and kudos to Erik  Otis for a great 
eBook btw.
Thanks!
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Paul Elschot
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
 Kevin A. Burton wrote:
 
  We have one large index right now... its about 60G ... When I open it 
  the Java VM used 940M of memory.  The VM does nothing else besides 
  open this index.
 
 After thinking about it I guess 1.5% of memory per index really isn't 
 THAT bad.  What would be nice if there was a way to do this from disk 
 and then use the a buffer (either via the filesystem or in-vm memory) to 
 access these variables.

It's even documented. From:
http://jakarta.apache.org/lucene/docs/fileformats.html :

The term info index, or .tii file. 
This contains every IndexIntervalth entry from the .tis file, along with its
location in the tis file. This is designed to be read entirely into memory
and used to provide random access to the tis file. 

My guess is that this is what you see happening.
To see the actuall .tii file, you need the non default file format.

Once searching starts you'll also see that the field norms are loaded,
these take one byte per searched field per document.

 This would be similar to the way the MySQL index cache works...

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-22 Thread
hi, Eric
 
If you can read chinese directly , Please reference to this blog:
http://blog.csdn.net/accesine960
or, search weblucene at www.sf.net which is a project based upon lucene by a 
chinese, name : chedong , his web site is : www.chedong.com 
 
good luck

Eric Chow [EMAIL PROTECTED] wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 

:
msn:  [EMAIL PROTECTED]
qq: 443803193







-
Do You Yahoo!?
150MP3

1G1000

Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.

Otis


--- Kevin A. Burton [EMAIL PROTECTED] wrote:
 We have one large index right now... its about 60G ... When I open it
 
 the Java VM used 940M of memory.  The VM does nothing else besides
 open 
 this index.
 
 Here's the code:
 
 System.out.println( opening... );
 
 long before = System.currentTimeMillis();
 Directory dir = FSDirectory.getDirectory( 
 /var/ksa/index-1078106952160/, false );
 IndexReader ir = IndexReader.open( dir );
 System.out.println( ir.getClass() );
 long after = System.currentTimeMillis();
 System.out.println( opening...done - duration:  + 
 (after-before) );
 
 System.out.println( totalMemory:  + 
 Runtime.getRuntime().totalMemory() );
 System.out.println( freeMemory:  + 
 Runtime.getRuntime().freeMemory() );
 
 Is there any way to reduce this footprint?  The index is fully 
 optimized... I'm willing to take a performance hit if necessary.  Is 
 this documented anywhere?
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
There Kevin, that's what I was referring to, the .tii file.

Otis

--- Paul Elschot [EMAIL PROTECTED] wrote:

 On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
  Kevin A. Burton wrote:
  
   We have one large index right now... its about 60G ... When I
 open it 
   the Java VM used 940M of memory.  The VM does nothing else
 besides 
   open this index.
  
  After thinking about it I guess 1.5% of memory per index really
 isn't 
  THAT bad.  What would be nice if there was a way to do this from
 disk 
  and then use the a buffer (either via the filesystem or in-vm
 memory) to 
  access these variables.
 
 It's even documented. From:
 http://jakarta.apache.org/lucene/docs/fileformats.html :
 
 The term info index, or .tii file. 
 This contains every IndexIntervalth entry from the .tis file, along
 with its
 location in the tis file. This is designed to be read entirely
 into memory
 and used to provide random access to the tis file. 
 
 My guess is that this is what you see happening.
 To see the actuall .tii file, you need the non default file format.
 
 Once searching starts you'll also see that the field norms are
 loaded,
 these take one byte per searched field per document.
 
  This would be similar to the way the MySQL index cache works...
 
 It would be possible to add another level of indexing to the terms.
 No one has done this yet, so I guess it's prefered to buy RAM
 instead...
 
 Regards,
 Paul Elschot
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread jian chen
Hi,

If it is really the case that every 128th term is loaded into memory.
Could you use a relational database or b-tree to index to do the work
of indexing of the terms instead?

Even if you create another level of indexing on top of the .tii fle,
it is just a hack and would not scale well.

I would think a b/b+ tree based approach is the way to go for better
memory utilization.

Cheers,

Jian


On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 There Kevin, that's what I was referring to, the .tii file.
 
 Otis
 
 --- Paul Elschot [EMAIL PROTECTED] wrote:
 
  On Saturday 22 January 2005 01:39, Kevin A. Burton wrote:
   Kevin A. Burton wrote:
  
We have one large index right now... its about 60G ... When I
  open it
the Java VM used 940M of memory.  The VM does nothing else
  besides
open this index.
  
   After thinking about it I guess 1.5% of memory per index really
  isn't
   THAT bad.  What would be nice if there was a way to do this from
  disk
   and then use the a buffer (either via the filesystem or in-vm
  memory) to
   access these variables.
 
  It's even documented. From:
  http://jakarta.apache.org/lucene/docs/fileformats.html :
 
  The term info index, or .tii file.
  This contains every IndexIntervalth entry from the .tis file, along
  with its
  location in the tis file. This is designed to be read entirely
  into memory
  and used to provide random access to the tis file.
 
  My guess is that this is what you see happening.
  To see the actuall .tii file, you need the non default file format.
 
  Once searching starts you'll also see that the field norms are
  loaded,
  these take one byte per searched field per document.
 
   This would be similar to the way the MySQL index cache works...
 
  It would be possible to add another level of indexing to the terms.
  No one has done this yet, so I guess it's prefered to buy RAM
  instead...
 
  Regards,
  Paul Elschot
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene in Action: Batch indexing by using RAMDirectory

2005-01-22 Thread Oscar Picasso
Hi,

On page 52 of Lucene in Action (Indexing  Controlling the indexing process 
Batch indexing by using RAMDirectory as a buffer) I read:

A more sophisticated approach would involve keeping track of RAMDirectory's
memory consumption, in order to prevent RAMDirectory from growing too large.

I've taken a look at Runtime.totalMemory() and so on but I didn't figure out
how to use these functions to prevent an OutMemoryException while using
RAMDirectory that way.

Any idea?



__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document 'Context' Relation to each other

2005-01-22 Thread Paul Smith

You wouldn't even need the sequence number.  You'll certainly be 
adding the documents to the index in the proper sequence already 
(right?).  It is easy to random access documents if you know Lucene's 
document ids.  Here's the pseudo-code

- construct an IndexReader
- open an IndexSearcher using the IndexReader
- search, getting Hits back
- for a hit you want to see the context, get hits.id(hit#)
- subtract context size from the id, grab documents using 
reader.document(id)

You don't search for a document by id, but rather jump right to it 
with IndexReader.

Perfect, that's exactly what I was after! It's going to be easier than I 
thought. 

Thanks,
Paul
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Paul Elschot wrote:
This would be similar to the way the MySQL index cache works...
   

It would be possible to add another level of indexing to the terms.
No one has done this yet, so I guess it's prefered to buy RAM instead...
 

The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.

Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Chris Hostetter wrote:
: We have one large index right now... its about 60G ... When I open it
: the Java VM used 940M of memory.  The VM does nothing else besides open
Just out of curiosity, have you tried turning on the verbose gc log, and
putting in some thread sleeps after you open the reader, to see if the
memory footprint settles down after a little while?  You're currently
checking the memoory usage immediately after opening the index, and some
of that memory may be used holding transient data that will get freed up
after some GC iterations.
 

Actually I haven't but to be honest the numbers seem dead on. The VM 
heap wouldn't reallocate if it didn't need that much memory and this is 
almost exactly the behavior I'm seeing in product.

Though I guess it wouldn't hurt ;)
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Kevin A. Burton
Otis Gospodnetic wrote:
It would be interesting to know _what_exactly_ uses your memory. 
Running under an optimizer should tell you that.

The only thing that comes to mind is... can't remember the details now,
but when the index is opened, I believe every 128th term is read into
memory.  This, I believe, helps with index seeks at search time.  I
wonder if this is what's using your memory.  The number '128' can't be
modified just like that, but somebody (Julien?) has modified the code
in the past to make this variable.  That's the only thing I can think
of right now and it may or may not be an idea in the right direction.
 

I loaded it into a profiler a long time ago. Most of the code was due to 
Term classes being loaded into memory.

I might try to get some time to load it into a profiler on monday...
Kevin
--
Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
   
Kevin A. Burton, Location - San Francisco, CA
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread petite_abeille
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote:
The problem I think for everyone right now is that 32bits just doesn't 
cut it in production systems...   2G of memory per process and you 
really start to feel it.
Hmmm... no... no pain at all... or perhaps you are implying that your 
entire system is running on one puny JVM instance... in that case, this 
is perhaps more of a design problem than an implementation one... 
YMMV...

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene in Action: Batch indexing by using RAMDirectory

2005-01-22 Thread markharw00d
I posted a suggested solution to this some time ago:
http://marc.theaimsgroup.com/?l=lucene-userm=108922279803667w=2
The overhead of doing these tests was negligible but I haven't tried it 
since TermVectors and the compound indexes were introduced.


Oscar Picasso wrote:
Hi,
On page 52 of Lucene in Action (Indexing  Controlling the indexing process 
Batch indexing by using RAMDirectory as a buffer) I read:
A more sophisticated approach would involve keeping track of RAMDirectory's
memory consumption, in order to prevent RAMDirectory from growing too large.
I've taken a look at Runtime.totalMemory() and so on but I didn't figure out
how to use these functions to prevent an OutMemoryException while using
RAMDirectory that way.
Any idea?
		
__ 
Do you Yahoo!? 
Read only the mail you want - Yahoo! Mail SpamGuard. 
http://promotions.yahoo.com/new_mail 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-22 Thread ansi
hi,Safarnejad
would you pls send me a copy of your code?
zhousp#gmail.com

thanks:)


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
[EMAIL PROTECTED] wrote:
 I've written a Chinese Analyzer for Lucene that uses a segmenter written by
 Erik Peterson. However, as the author of the segmenter does not want his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
 to test, use, and confirm this.  So anyone who wants it, can have it. Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED]
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 Search not really correct with UTF-8 !!!
 
 The following is the search result that I used the SearchFiles in the lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 
 Searching for: g  strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files contains
 the 
   -
 1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
 Query:
 
 From the above result only the ChineseDemo.html includes the character that I
 want to search !
 
 The modified code in SearchFiles.java:
 
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
---
This mail is for maillist only.
Any private mail pls send to [EMAIL PROTECTED]
-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Lucene in Action

2005-01-22 Thread ansi
hi,all

Does anyone know how to buy Lucene in Action in China?

Ansi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action

2005-01-22 Thread jian chen
Hi,

I am not sure. However I see that the book has an electronic version
you can buy online...

Cheers,

Jian


On Sun, 23 Jan 2005 10:30:24 +0800, ansi [EMAIL PROTECTED] wrote:
 hi,all
 
 Does anyone know how to buy Lucene in Action in China?
 
 Ansi
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene in Action

2005-01-22 Thread Otis Gospodnetic
Hi Ansi,

If you want the print version, I would guess you could order it from
the publisher (http://www.manning.com/hatcher2) or from Amazon and they
will ship it to you in China.  The electronic version (a PDF file) is
also available from the above URL.

I'll ask Manning Publications and see whether they ship outside the
U.S.

Otis


--- ansi [EMAIL PROTECTED] wrote:

 hi,all
 
 Does anyone know how to buy Lucene in Action in China?
 
 Ansi
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Opening up one large index takes 940M or memory?

2005-01-22 Thread Otis Gospodnetic
Yes, I remember your email about the large number of Terms.  If it can
be avoided and you figure out how to do it, I'd love to patch
something. :)

Otis

--- Kevin A. Burton [EMAIL PROTECTED] wrote:

 Otis Gospodnetic wrote:
 
 It would be interesting to know _what_exactly_ uses your memory. 
 Running under an optimizer should tell you that.
 
 The only thing that comes to mind is... can't remember the details
 now,
 but when the index is opened, I believe every 128th term is read
 into
 memory.  This, I believe, helps with index seeks at search time.  I
 wonder if this is what's using your memory.  The number '128' can't
 be
 modified just like that, but somebody (Julien?) has modified the
 code
 in the past to make this variable.  That's the only thing I can
 think
 of right now and it may or may not be an idea in the right
 direction.
   
 
 I loaded it into a profiler a long time ago. Most of the code was due
 to 
 Term classes being loaded into memory.
 
 I might try to get some time to load it into a profiler on monday...
 
 Kevin
 
 -- 
 
 Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an
 
 invite!  Also see irc.freenode.net #rojo if you want to chat.
 
 Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
 
 If you're interested in RSS, Weblogs, Social Networking, etc... then
 you 
 should work for Rojo!  If you recommend someone and we hire them
 you'll 
 get a free iPod!
 
 Kevin A. Burton, Location - San Francisco, CA
AIM/YIM - sfburtonator,  Web - http://peerfear.org/
 GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]