Re: Document 'Context' Relation to each other
On Jan 21, 2005, at 10:47 PM, Paul Smith wrote: As a log4j developer, I've been toying with the idea of what Lucene could do for me, maybe as an excuse to play around with Lucene. First off, let me thank you for your work with log4j! I've been using it at lucenebook.com with the SMTPAppender (once I learned that I needed a custom trigger to release e-mails when I wanted, not just on errors) and it's been working great. Now, I could provide a Field to the LoggingEvent Document that has a sequence #, and once a user has chosen an appropriate matching event, do another search for the documents with a Sequence # between +/- the context size. My question is, is that going to be an efficient way to do this? The sequence # would be treated as text, wouldn't it? Would the range search on an int be the most efficient way to do this? I know from the Hits documentation that one can retrieve the Document ID of a matching entry. What is the contract on this Document ID? Is each Document added to the Index given an increasing number? Can one search an index by Document ID? Could one search for Document ID's between a range? (Hope you can see where I'm going here). You wouldn't even need the sequence number. You'll certainly be adding the documents to the index in the proper sequence already (right?). It is easy to random access documents if you know Lucene's document ids. Here's the pseudo-code - construct an IndexReader - open an IndexSearcher using the IndexReader - search, getting Hits back - for a hit you want to see the context, get hits.id(hit#) - subtract context size from the id, grab documents using reader.document(id) You don't search for a document by id, but rather jump right to it with IndexReader. Many thanks for an excellent API, and kudos to Erik Otis for a great eBook btw. Thanks! Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
hi, Eric If you can read chinese directly , Please reference to this blog: http://blog.csdn.net/accesine960 or, search weblucene at www.sf.net which is a project based upon lucene by a chinese, name : chedong , his web site is : www.chedong.com good luck Eric Chow [EMAIL PROTECTED] wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] : msn: [EMAIL PROTECTED] qq: 443803193 - Do You Yahoo!? 150MP3 1G1000
Re: Opening up one large index takes 940M or memory?
It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( /var/ksa/index-1078106952160/, false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( opening...done - duration: + (after-before) ); System.out.println( totalMemory: + Runtime.getRuntime().totalMemory() ); System.out.println( freeMemory: + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Hi, If it is really the case that every 128th term is loaded into memory. Could you use a relational database or b-tree to index to do the work of indexing of the terms instead? Even if you create another level of indexing on top of the .tii fle, it is just a hack and would not scale well. I would think a b/b+ tree based approach is the way to go for better memory utilization. Cheers, Jian On Sat, 22 Jan 2005 08:32:50 -0800 (PST), Otis Gospodnetic [EMAIL PROTECTED] wrote: There Kevin, that's what I was referring to, the .tii file. Otis --- Paul Elschot [EMAIL PROTECTED] wrote: On Saturday 22 January 2005 01:39, Kevin A. Burton wrote: Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. It's even documented. From: http://jakarta.apache.org/lucene/docs/fileformats.html : The term info index, or .tii file. This contains every IndexIntervalth entry from the .tis file, along with its location in the tis file. This is designed to be read entirely into memory and used to provide random access to the tis file. My guess is that this is what you see happening. To see the actuall .tii file, you need the non default file format. Once searching starts you'll also see that the field norms are loaded, these take one byte per searched field per document. This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... Regards, Paul Elschot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene in Action: Batch indexing by using RAMDirectory
Hi, On page 52 of Lucene in Action (Indexing Controlling the indexing process Batch indexing by using RAMDirectory as a buffer) I read: A more sophisticated approach would involve keeping track of RAMDirectory's memory consumption, in order to prevent RAMDirectory from growing too large. I've taken a look at Runtime.totalMemory() and so on but I didn't figure out how to use these functions to prevent an OutMemoryException while using RAMDirectory that way. Any idea? __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document 'Context' Relation to each other
You wouldn't even need the sequence number. You'll certainly be adding the documents to the index in the proper sequence already (right?). It is easy to random access documents if you know Lucene's document ids. Here's the pseudo-code - construct an IndexReader - open an IndexSearcher using the IndexReader - search, getting Hits back - for a hit you want to see the context, get hits.id(hit#) - subtract context size from the id, grab documents using reader.document(id) You don't search for a document by id, but rather jump right to it with IndexReader. Perfect, that's exactly what I was after! It's going to be easier than I thought. Thanks, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Paul Elschot wrote: This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Opening up one large index takes 940M or memory?
Chris Hostetter wrote: : We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint settles down after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. Actually I haven't but to be honest the numbers seem dead on. The VM heap wouldn't reallocate if it didn't need that much memory and this is almost exactly the behavior I'm seeing in product. Though I guess it wouldn't hurt ;) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
On Jan 22, 2005, at 23:50, Kevin A. Burton wrote: The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Hmmm... no... no pain at all... or perhaps you are implying that your entire system is running on one puny JVM instance... in that case, this is perhaps more of a design problem than an implementation one... YMMV... Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action: Batch indexing by using RAMDirectory
I posted a suggested solution to this some time ago: http://marc.theaimsgroup.com/?l=lucene-userm=108922279803667w=2 The overhead of doing these tests was negligible but I haven't tried it since TermVectors and the compound indexes were introduced. Oscar Picasso wrote: Hi, On page 52 of Lucene in Action (Indexing Controlling the indexing process Batch indexing by using RAMDirectory as a buffer) I read: A more sophisticated approach would involve keeping track of RAMDirectory's memory consumption, in order to prevent RAMDirectory from growing too large. I've taken a look at Runtime.totalMemory() and so on but I didn't figure out how to use these functions to prevent an OutMemoryException while using RAMDirectory that way. Any idea? __ Do you Yahoo!? Read only the mail you want - Yahoo! Mail SpamGuard. http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
hi,Safarnejad would you pls send me a copy of your code? zhousp#gmail.com thanks:) On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --- This mail is for maillist only. Any private mail pls send to [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene in Action
hi,all Does anyone know how to buy Lucene in Action in China? Ansi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action
Hi, I am not sure. However I see that the book has an electronic version you can buy online... Cheers, Jian On Sun, 23 Jan 2005 10:30:24 +0800, ansi [EMAIL PROTECTED] wrote: hi,all Does anyone know how to buy Lucene in Action in China? Ansi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene in Action
Hi Ansi, If you want the print version, I would guess you could order it from the publisher (http://www.manning.com/hatcher2) or from Amazon and they will ship it to you in China. The electronic version (a PDF file) is also available from the above URL. I'll ask Manning Publications and see whether they ship outside the U.S. Otis --- ansi [EMAIL PROTECTED] wrote: hi,all Does anyone know how to buy Lucene in Action in China? Ansi - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Yes, I remember your email about the large number of Terms. If it can be avoided and you figure out how to do it, I'd love to patch something. :) Otis --- Kevin A. Burton [EMAIL PROTECTED] wrote: Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]