Re: Lucene vs. in-DB-full-text-searching
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person from a well-known online bookseller told me recently that Lucene was about 10x faster that MySQL for full-text searching, and I am currently helping someone get away from MySQL and into Lucene for performance reasons. Also... MySQL full text search isn't perfect. If you're not a java programmer it would be difficult to hack on. Another downside is that FT in MySQL only works with MyISAM tables which aren't transaction aware and use global tables locks (not fun). I'm sure though that MySQL would do a better job at online index maintenance than Lucene. It falls down a bit in this area... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene vs. in-DB-full-text-searching
David Sitsky wrote: On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote: You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports. Or to see Lucene + Derby (100% JAVA embedded database donated from IBM currently in Apache incubation) integrated together... that would be really nice and powerful. Does anyone know if there are any integration plans? Don't forget BerkeleyDB Java Edition... that would be interesting too... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. You know... it looks like the problem is that TermInfosReader uses INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the offsets that I need. If this is going to be a practical way of reducing Lucene memory footprint for HUGE indexes then its going to need a way to change this value based on the current index thats being opened. Is there anyway to determine the INDEX_INTERVAL from the file?It looks according to: http://jakarta.apache.org/lucene/docs/fileformats.html That the .tis file (which according to the docs the .tii file is very similar to the .tis file ) should have this data: So according to this: TermInfoFile (.tis)-- TIVersion, TermCount, IndexInterval, SkipInterval, TermInfos The only problem is that the .tii and .tis files I have on disk don't have a constant preamble and doesnt' look like there's an index interval here... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Doug Cutting wrote: Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. It looks like you're using a pre-1.4 version of Lucene. Since 1.4 this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather TermInfosWriter.indexInterval. Yes... we're trying to be conservative and haven't migrated yet. Though doing so might be required for this move I think... Is this setting incompatible with older indexes burned with the lower value? Prior to 1.4, yes. After 1.4, no. What happens after 1.4? Can I take indexes burned with 256 (a greater value) in 1.3 and open them up correctly with 1.4? Kevin PS. Once I get this working I'm going to create a wiki page documenting this process. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??
Doug Cutting wrote: Not without hacking things. If your 1.3 indexes were generated with 256 then you can modify your version of Lucene 1.4+ to use 256 instead of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today). Prior to 1.4 this was a constant, hardwired into the index format. In 1.4 and later each index segment stores this value as a parameter. So once 1.4 has re-written your index you'll no longer need a modified version. Thanks for the feedback doug. This makes more sense now. I didn't understand why the website documented the fact that the .tii file was soring the index interval. I think I'm going to investigate just moving to 1.4 ... I need to do it anyway. Might as well bite the bullet now. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
1.4.x TermInfosWriter.indexInterval not public static ?
Whats the desired pattern of using of TermInfosWriter.indexInterval ? Do I have to compile my own version of Lucene to change this? The last API was public static final but this is not public nor static. I'm wondering if we should just make this a value that can be set at runtime. Considering the memory savings for larger installs this can/will be important. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Doug Cutting wrote: Kevin A. Burton wrote: Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? You can increase TermInfosWriter.indexInterval. You'll need to re-write the .tii file for this to take effect. The simplest way to do this is to use IndexWriter.addIndexes(), adding your index to a new, empty, directory. This will of course take a while for a 60GB index... (Note... when this works I'll note my findings in a wiki page for future developers) Two more questions: 1. Do I have to do this with a NEW directory? Our nightly index merger uses an existing target index which I assume will re-use the same settings as before? I did this last night and it still seems to use the same amount of memory. Above you assert that I should use a new empty directory and I'll try that tonight. 2. This isn't destructive is it? I mean I'll be able to move BACK to a TermInfosWriter.indexInterval of 128 right? Thanks! Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
DbDirectory and Berkeley DB Java Edition...
I'm reading the Lucene in Action book right nowand on page 309 they talk about using the DbDirectory which berkeley DB for maintaining your index. Anyone ever consider a port to Berkeley DB Java Edition? The only downside would be the license (I think its GPL) but it could really free up the time it takes to optimize() I think. You could just rehash the hashtable and then insert rows into the new table. Would be interesting to benchmark I think though. Thoughts? http://www.sleepycat.com/products/je.shtml -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Paul Elschot wrote: This would be similar to the way the MySQL index cache works... It would be possible to add another level of indexing to the terms. No one has done this yet, so I guess it's prefered to buy RAM instead... The problem I think for everyone right now is that 32bits just doesn't cut it in production systems... 2G of memory per process and you really start to feel it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Re: Opening up one large index takes 940M or memory?
Chris Hostetter wrote: : We have one large index right now... its about 60G ... When I open it : the Java VM used 940M of memory. The VM does nothing else besides open Just out of curiosity, have you tried turning on the verbose gc log, and putting in some thread sleeps after you open the reader, to see if the memory footprint settles down after a little while? You're currently checking the memoory usage immediately after opening the index, and some of that memory may be used holding transient data that will get freed up after some GC iterations. Actually I haven't but to be honest the numbers seem dead on. The VM heap wouldn't reallocate if it didn't need that much memory and this is almost exactly the behavior I'm seeing in product. Though I guess it wouldn't hurt ;) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Otis Gospodnetic wrote: It would be interesting to know _what_exactly_ uses your memory. Running under an optimizer should tell you that. The only thing that comes to mind is... can't remember the details now, but when the index is opened, I believe every 128th term is read into memory. This, I believe, helps with index seeks at search time. I wonder if this is what's using your memory. The number '128' can't be modified just like that, but somebody (Julien?) has modified the code in the past to make this variable. That's the only thing I can think of right now and it may or may not be an idea in the right direction. I loaded it into a profiler a long time ago. Most of the code was due to Term classes being loaded into memory. I might try to get some time to load it into a profiler on monday... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Opening up one large index takes 940M or memory?
We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. Here's the code: System.out.println( opening... ); long before = System.currentTimeMillis(); Directory dir = FSDirectory.getDirectory( /var/ksa/index-1078106952160/, false ); IndexReader ir = IndexReader.open( dir ); System.out.println( ir.getClass() ); long after = System.currentTimeMillis(); System.out.println( opening...done - duration: + (after-before) ); System.out.println( totalMemory: + Runtime.getRuntime().totalMemory() ); System.out.println( freeMemory: + Runtime.getRuntime().freeMemory() ); Is there any way to reduce this footprint? The index is fully optimized... I'm willing to take a performance hit if necessary. Is this documented anywhere? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Opening up one large index takes 940M or memory?
Kevin A. Burton wrote: We have one large index right now... its about 60G ... When I open it the Java VM used 940M of memory. The VM does nothing else besides open this index. After thinking about it I guess 1.5% of memory per index really isn't THAT bad. What would be nice if there was a way to do this from disk and then use the a buffer (either via the filesystem or in-vm memory) to access these variables. This would be similar to the way the MySQL index cache works... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
nable to read TLD META-INF/c.tld from JAR file ... standard.jar
What in the world is up with this exception? We've migrated to using pre-compiled JSPs in Tomcat 5.5 for performance reasons but if I try to start with a FRESH webapp or try to update any of the JSPs and in-place and recompile I'll get this error: Any idea? I thought maybe the .jar files were corrupt but if I md5sum them they are identical to production and the Tomcat standard dist. Thoughts? org.apache.jasper.JasperException: /subscriptions/index.jsp(1,1) /init.jsp(2,0) Unable to read TLD META-INF/c.tld from JAR file file:/usr/local/jakarta-tomcat-5.5.4/webapps/rojo/ROOT/WEB-INF/lib/standard.jar: org.apache.jasper.JasperException: Failed to load or instantiate TagLibraryValidator class: org.apache.taglibs.standard.tlv.JstlCoreTLV org.apache.jasper.compiler.DefaultErrorHandler.jspError(DefaultErrorHandler.java:39) org.apache.jasper.compiler.ErrorDispatcher.dispatch(ErrorDispatcher.java:405) org.apache.jasper.compiler.ErrorDispatcher.jspError(ErrorDispatcher.java:86) org.apache.jasper.compiler.Parser.processIncludeDirective(Parser.java:339) org.apache.jasper.compiler.Parser.parseIncludeDirective(Parser.java:372) org.apache.jasper.compiler.Parser.parseDirective(Parser.java:475) org.apache.jasper.compiler.Parser.parseElements(Parser.java:1539) org.apache.jasper.compiler.Parser.parse(Parser.java:126) org.apache.jasper.compiler.ParserController.doParse(ParserController.java:211) org.apache.jasper.compiler.ParserController.parse(ParserController.java:100) org.apache.jasper.compiler.Compiler.generateJava(Compiler.java:146) org.apache.jasper.compiler.Compiler.compile(Compiler.java:286) org.apache.jasper.compiler.Compiler.compile(Compiler.java:267) org.apache.jasper.compiler.Compiler.compile(Compiler.java:255) org.apache.jasper.JspCompilationContext.compile(JspCompilationContext.java:556) org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:296) org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:295) org.apache.jasper.servlet.JspServlet.service(JspServlet.java:245) javax.servlet.http.HttpServlet.service(HttpServlet.java:802) -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: nable to read TLD META-INF/c.tld from JAR file ... standard.jar
Otis Gospodnetic wrote: Most definitely Jetty. I can't believe you're using Tomcat for Rojo! ;) I never said we were using Tomcat for Rojo ;) Sorry about that btw... wrong list! -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How to index Windows' Compiled HTML Help (CHM) Format
Tom wrote: Hi, Does anybody know how to index chm-files? A possible solution I know is to convert chm-files to pdf-files (there are converters available for this job) and then use the known tools (e.g. PDFBox) to index the content of the pdf files (which contain the content of the chm-files). Are there any tools which can directly grab the textual content out of the (binary) chm-files? I think chm-file indexing-support is really a big missing piece in the currently supported indexable filetype-collection (XML, HTML, PDF, MSWord-DOC, RTF, Plaintext). I believe its just a Microsoft .cab file with an index.html inside it... am I right? just uncompress it. The problem is that the HTML within them isn't any way NEAR standard and you can't really give them to the user in the UI... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene in action ebook
Erik Hatcher wrote: I have the e-book PDF in my possession. I have been prodding Manning on a daily basis to update the LIA website and get the e-book available. It is ready, and I'm sure that its just a matter of them pushing it out. There may be some administrative loose ends they are tying up before releasing it to the world. It should be available any minute now, really. :) Send off a link to the list when its out... We're all holding our breath ;) (seriously) Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBCDirectory to prevent optimize()?
Erik Hatcher wrote: Also, there is a DBDirectory in the sandbox to store a Lucene index inside Berkeley DB. I assume this would prevent prefix queries from working... Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Yes... I performed the same benchmark and in my situation RAMDirectory for searches was about 2% slower. I'm willing to bet that it has to do with the fact that its a Hashtable and not a HashMap (which isn't synchronized). Also adding a constructor for the term size could make loading a RAMDirectory faster since you could prevent rehash. If you're on a modern machine your filesystme cache will end up buffering your disk anyway which I'm sure was happening in my situation. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index in RAM - is it realy worthy?
Otis Gospodnetic wrote: For the Lucene book I wrote some test cases that compare FSDirectory and RAMDirectory. What I found was that with certain settings FSDirectory was almost as fast as RAMDirectory. Personally, I would push FSDirectory and hope that the OS and the Filesystem do their share of work and caching for me before looking for ways to optimize my code. Also another note is that doing an index merge in memory is probably faster if you just use a RAMDirectory and perform addIndexes to it. This would almost certainly be faster than optimizing on disk but I haven't benchmarked it. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
JDBCDirectory to prevent optimize()?
It seems that when compared to other datastores that Lucene starts to fall down. For example lucene doesn't perform online index optimizations so if you add 10 documents you have to run optimize() again and this isn't exactly a fast operation. I'm wondering about the potential for a generic JDBCDirectory for keeping the lucene index within a database. It sounds somewhat unconventional would allow you to perform live addDirectory updates without performing an optimize() again. Has anyone looked at this? How practical would it be. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Mozilla Desktop Search
http://www.peerfear.org/rss/permalink/2004/11/13/MozillaDesktopSearch/ The Mozilla foundation may be considering a desktop search implementation http://computerworld.com/developmenttopics/websitemgmt/story/0,10801,97396,00.html?f=x10 : Having launched the much-awaited Version 1.0 of the Firefox browser yesterday (see story), The Mozilla Foundation is busy planning enhancements to the open-source product, including the possibility of integrating it with a variety of desktop search tools. The Mozilla Foundation also wants to place Firefox in PCs through reseller deals with PC hardware vendors and continue to sharpen the product's pop-up ad-blocking technology. I'm not sure this is a good idea. Maybe it is though. The technology just isn't there for cross platform search. I'd have to suggest using Lucene but using GCJ for a native compile into XPCOM components but I'm not sure if GCJ is up to the job here. If this approach is possible then I'd be very excited. One advantage to this approach is that an HTTP server wouldn't be necessary since you're already within the brower. Good for everyone involved. No bloated Tomcat causing problem and blazingly fast access within the browser. Also since TCP isn't involved you could gracefully fail when the search service isn't running; you could just start it. -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Lucene external field storage contribution.
About 3 months ago I developed a external storage engine which ties into lucene. I'd like to discuss making a contribution so that this is integrated into a future version of Lucene. I'm going to paste my original PROPOSAL in this email. There wasn't a ton of feedback first time around but I figure squeaky wheel gets the grease... I created this proposal because we need this fixed at work. I want to go ahead and work on a vertical fix for our version of lucene and then submit this back to Jakarta. There seems to be a lot of interest here and I wanted to get feedback from the list before moving forward ... Should I put this in the wiki?! Kevin ** OVERVIEW ** Currently Lucene supports 'stored fields; where the content of these fields are kept within the lucene index for use in the future. While acceptable for small indexes, larger amounts of stored fields prevent: - Fast index merges since the full content has to be continually merged. - Storing the indexes in memory (since a LOT of memory would be required and this is cost prohibitive) - Fast queries since block caching can't be used on the index data. For example in our current setup our index size is 20G. Nearly 90% of this is content. If we could store the content outside of Lucene our merges and searches would be MUCH faster. If we could store the index in MEMORY this could be orders of magnitude faster. ** PROPOSAL ** Provide an external field storage mechanism which supports legacy indexes without modification. Content is stored in a content segment. The only changes would be a field with 3(or 4 if checksum enabled) values. - CS_SEGMENT Logical ID of the content segment. This is an integer value. There is a global Lucene property named CS_ROOT which stores all the content. The segments are just flat files with pointers. Segments are broken into logical pieces by time and size. Usually 100M of content would be in one segment. - CS_OFFSET The byte offset of the field. - CS_LENGTH The length of the field. - CS_CHECKSUM Optional checksum to verify that the content is correct when fetched from the index. - The field value here would be exactly 'N:O:L' where N is the segment number, O is the offset, and L is the length. O and L are 64bit values. N is a 32 bit value (though 64bit wouldn't really hurt). This mechanism allows for the external storage of any named field. CS_OFFSET, and CS_LENGTH allow use with RandomAccessFile and new NIO code for efficient content lookup. (Though filehandle caching should probably be used). Since content is broken into logical 100M segments the underlying filesystem can orgnize the file into contiguous blocks for efficient non-fragmented lookup. File manipulation is easy and indexes can be merged by simply concatenating the second file to the end of the first. (Though the segment, offset, and length need to be updated). (FIXME: I think I need to think about this more since I will have 100M per syncs) Supporting full unicode is important. Full java.lang.String storage is used with String.getBytes() so we should be able to avoid unicode issues. If Java has a correct java.lang.String representation it's possible easily add unicode support just by serializing the byte representation. (Note that the JDK says that the DEFAULT system char encoding is used so if this is ever changed it might break the index) While Linux and modern versions of Windows (not sure about OSX) support 64bit filesystems the 4G storage boundary of 32bit filesystems (ext2 is an example) are an issue. Using smaller indexes can prevent this but eventually segment lookup in the filesystem will be slow. This will only happen within terabyte storage systems so hopefully the developer has migrated to another (modern) filesystem such as XFS. ** FEATURES ** - Must be able to replicate indexes easily to other hosts. - Adding content to the index must be CHEAP - Deletes need to be cheap (these are cheap for older content. Just discard older indexes) - Filesystem needs to be able to optimize storage - Must support UNICODE and binary content (images, blobs, byte arrays, serialized objects, etc) - Filesystem metadata operations should be fast. Since content is kept in LARGE indexes this is easy to avoid. - Migration to the new system from legacy indexes should be fast and painless for future developers -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E
Lots Of Interest in Lucene Desktop
I've made a few passive mentions of my Lucene http://jakarta.apache.org/lucene Desktop prototype here on PeerFear in the last few days and I'm amazed how much feedback I've had. People really want to start work on an Open Source desktop search based on Lucene. http://www.peerfear.org/rss/permalink/2004/10/28/LotsOfInterestInLuceneDesktop/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
Ability to apply document age with the score?
Lets say I have an index with two documents. They both have the same score but one was added 6 months ago and the other was added 2 minutes ago. I want the score adjusted based on the age so that older documents have a lower score. I don't want to sort by document age (date) because if one document is older but has a HIGHER score it would be better to have it rise above newer documents that have a lower score. Is this possible? The only way I could think of doing it would be to have a DateFilter and then apply a dampening after the query. Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Poor Lucene Ranking for Short Text
http://www.peerfear.org/rss/permalink/2004/10/26/PoorLuceneRankingForShortText/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Poor Lucene Ranking for Short Text
Daniel Naber wrote: (Kevin complains about shorter documents ranked higher) This is something that can easily be fixed. Just use a Similarity implementation that extends DefaultSimilarity and that overwrites lengthNorm: just return 1.0f there. You need to use that Similarity for indexing and searching, i.e. it requires reindexing. What happens when I do this with an existing index? I don't want to have to rewrite this index as it will take FOREVER If the current behavior is all that happens this is fine... this way I can just get this behavior for new documents that are added. Also... why isn't this the default? Kevin -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Documents with 1 word are given unfair lengthNorm()
WRT to my blog post: It seems the problem is that the distribution for lengthNorm() starts at 1 and moves down from there. 1.0f would work but HUGE documents would be normalized and so would distort the results. What would you think of using this implementation for lengthNorm: public float lengthNorm( String fieldName, int numTokens ) { int THRESHOLD = 50; int nt = numTokens; if ( numTokens = THRESHOLD ) ++nt; if ( numTokens THRESHOLD ) nt -= THRESHOLD; float v = (float)(1.0 / Math.sqrt(nt)); if ( numTokens = THRESHOLD ) v = 1 - v; return v; } This starts the distribution low... approaches 1.0 when 50 terms are in the document... then asymptotically moves to zero from here on out based on sqrt. For example with values from 1 - 150 would yield (I'd graph this out but I'm too lazy): 1 - 0.29289323 2 - 0.42264974 3 - 0.5 4 - 0.5527864 5 - 0.5917517 6 - 0.6220355 7 - 0.6464466 8 - 0.666 9 - 0.6837722 10 - 0.69848865 11 - 0.7113249 12 - 0.72264993 13 - 0.73273873 14 - 0.74180114 15 - 0.75 16 - 0.7574644 17 - 0.7642977 18 - 0.7705843 19 - 0.7763932 20 - 0.7817821 21 - 0.7867993 22 - 0.7914856 23 - 0.79587585 24 - 0.8 25 - 0.80388385 26 - 0.8075499 27 - 0.81101775 28 - 0.81430465 29 - 0.81742585 30 - 0.8203947 31 - 0.8232233 32 - 0.82592237 33 - 0.8285014 34 - 0.83096915 35 - 0.833 36 - 0.83560103 37 - 0.83777857 38 - 0.8398719 39 - 0.8418861 40 - 0.84382623 41 - 0.8456966 42 - 0.8475014 43 - 0.84924436 44 - 0.8509288 45 - 0.852558 46 - 0.85413504 47 - 0.85566247 48 - 0.85714287 49 - 0.8585786 50 - 0.859972 51 - 1.0 52 - 0.70710677 53 - 0.57735026 54 - 0.5 55 - 0.4472136 56 - 0.4082483 57 - 0.37796447 58 - 0.35355338 59 - 0.3334 60 - 0.31622776 61 - 0.30151135 62 - 0.28867513 63 - 0.2773501 64 - 0.26726124 65 - 0.2581989 66 - 0.25 67 - 0.24253562 68 - 0.23570226 69 - 0.22941573 70 - 0.2236068 71 - 0.2182179 72 - 0.21320072 73 - 0.2085144 74 - 0.20412415 75 - 0.2 76 - 0.19611613 77 - 0.19245009 78 - 0.18898223 79 - 0.18569534 80 - 0.18257418 81 - 0.1796053 82 - 0.17677669 83 - 0.17407766 84 - 0.17149858 85 - 0.16903085 86 - 0.1667 87 - 0.16439898 88 - 0.16222142 89 - 0.16012815 90 - 0.15811388 91 - 0.15617377 92 - 0.15430336 93 - 0.15249857 94 - 0.15075567 95 - 0.1490712 96 - 0.14744195 97 - 0.145865 98 - 0.14433756 99 - 0.14285715 100 - 0.14142136 101 - 0.14002801 102 - 0.13867505 103 - 0.13736056 104 - 0.13608277 105 - 0.13483997 106 - 0.13363062 107 - 0.13245323 108 - 0.13130644 109 - 0.13018891 110 - 0.12909944 111 - 0.12803689 112 - 0.12700012 113 - 0.12598816 114 - 0.125 115 - 0.12403473 116 - 0.12309149 117 - 0.12216944 118 - 0.12126781 119 - 0.120385855 120 - 0.11952286 121 - 0.11867817 122 - 0.11785113 123 - 0.11704115 124 - 0.11624764 125 - 0.11547005 126 - 0.114707865 127 - 0.11396058 128 - 0.1132277 129 - 0.11250879 130 - 0.1118034 131 - 0. 132 - 0.11043153 133 - 0.10976426 134 - 0.10910895 135 - 0.10846523 136 - 0.107832775 137 - 0.107211255 138 - 0.10660036 139 - 0.10599979 140 - 0.10540926 141 - 0.104828484 142 - 0.1042572 143 - 0.10369517 144 - 0.10314213 145 - 0.10259783 146 - 0.10206208 147 - 0.10153462 148 - 0.101015255 149 - 0.10050378 -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Google Desktop Could be Better
http://www.peerfear.org/rss/permalink/2004/10/15/GoogleDesktopCouldBeBetter/ -- Use Rojo (RSS/Atom aggregator). Visit http://rojo.com. Ask me for an invite! Also see irc.freenode.net #rojo if you want to chat. Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html If you're interested in RSS, Weblogs, Social Networking, etc... then you should work for Rojo! If you recommend someone and we hire them you'll get a free iPod! Kevin A. Burton, Location - San Francisco, CA AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OptimizeIt -- Re: force gc idiom - Re: OutOfMemory example
David Spencer wrote: Ji Kuhn wrote: This doesn't work either! You're right. I'm running under JDK1.5 and trying larger values for -Xmx and it still fails. Running under (Borlands) OptimzeIt shows the number of Terms and Terminfos (both in org.apache.lucene.index) increase every time thru the loop, by several hundred instances each. Yes... I'm running into a similar situation on JDK 1.4.2 with Lucene 1.3... I used the JMP debugger and all my memory is taken by Terms and TermInfo... I can trace thru some Term instances on the reference graph of OptimizeIt but it's unclear to me what's right. One *guess* is that maybe the WeakHashMap in either SegmentReader or FieldCacheImpl is the problem. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory example
Ji Kuhn wrote: Hi, I think I can reproduce memory leaking problem while reopening an index. Lucene version tested is 1.4.1, version 1.4 final works OK. My JVM is: $ java -version java version 1.4.2_05 Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2_05-b04) Java HotSpot(TM) Client VM (build 1.4.2_05-b04, mixed mode) The code you can test is below, there are only 3 iterations for me if I use -Xmx5m, the 4th fails. At least this test seems tied to the Sort API... I removed the sort under Lucene 1.3 and it worked fine... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IRC?!
Harald Tijink wrote: I hope your idea isn't to replace this Users List and pull the discussions into the IRC scene. I (and most of us) can not attend to any IRC chat because of work and other priorities. This list gives me the opportunity to keep informed (involved). Yup... I want to replace the mailing lists, wiki, website, CVS, and Bugzilla with IRC. And if you can't keep up thats just your fault ;) (joke). Its just another tool ;) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
TermInfo using 300M for large index?
I'm trying to do some heap debugging of my application to find a memory leak. Noticed that org.apache.lucene.index.TermInfo had 1.7M instances which consumed 300M ... this is of course for a 40G index. Is this normal and is there any way I can streamline this? We are of course caching the IndexSearchers but I want to reduce the memory footprint... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IRC?!
There isn't a Lucene IRC room is there (at least there isn't according to Google)? I just joined #lucene on irc.freenode.net if anyone is interested... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Possible to remove duplicate documents in sort API?
I want to sort a result set but perform a group by as well... IE remove duplicate items. Is this possible with the new API? Seems like a huge drawback to lucene right now. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why doesn't Document use a HashSet instead of a LinkedList (DocumentFieldList)
It looks like Document.java uses its own implementation of a LinkedList.. Why not use a HashMap to enable O(1) lookup... right now field lookup is O(N) which is certainly no fun. Was this benchmarked? Perhaps theres the assumption that since documents often have few fields the object overhead and hashcode overhead would have been less this way. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Anyone avail for Lucene consulting or employment in the SF area?
Hope no one considers this spam ;) We're hiring either someone full-time who has strong experience with Java, Lucene, and Jakarta technologies or someone to act as a consultant working on Lucene for about a month optimizing our search infra. This is for a startup located in downtown SF. Send me an email including your resume (html or text only) and I'll respond with full details. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Patch for IndexWriter.close which prevents NPE...
I just attached a patch which: 1. prevents multiple close() of an IndexWriter 2. prevents an NPE if the writeLock was null. We have been noticing this from time to time and I haven't been able to come up with a hard test case. This is just a bit of defensive programming to prevent it from happening in the first place. It would happen from time to time without any reliable cause. Anyway... Thanks... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster --- IndexWriter.java.bak.close 2004-09-03 11:27:37.0 -0700 +++ IndexWriter.java2004-09-03 11:32:02.0 -0700 @@ -107,6 +107,11 @@ */ private boolean useCompoundFile = false; + /** + * True when we have closed this IndexWriter + */ + protected boolean isClosed = false; + /** Setting to turn on usage of a compound file. When on, multiple files * for each segment are merged into a single file once the segment creation * is finished. This is done regardless of what directory is in use. @@ -183,15 +188,27 @@ }.run(); } } - + /** Flushes all changes to an index, closes all associated files, and closes the directory that the index is stored in. */ public synchronized void close() throws IOException { + +if ( isClosed ) { + return; +} + flushRamSegments(); ramDirectory.close(); -writeLock.release(); // release write lock + +if ( writeLock != null ) { + // release write lock + writeLock.release(); +} + writeLock = null; directory.close(); +isClosed = true; + } /** Release the write lock, if needed. */ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Benchmark of filesystem cache for index vs RAMDirectory...
Daniel Naber wrote: On Sunday 08 August 2004 03:40, Kevin A. Burton wrote: Would a HashMap implementation of RAMDirectory beat out a cached FSDirectory? It's easy to test, so it's worth a try. Please try if the attached patch makes any difference for you compared to the current implementation of RAMDirectory. True... I was just thinking out loud... was being lazy. Now I actually have to do the work to create the benchmark again... damn you ;) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster
Benchmark of filesystem cache for index vs RAMDirectory...
I'm not sure how Solaris or Windows perform but the Linux block cache will essentially use all avali memory to buffer the filesystem. If one is using an FSDirectory to perform searches while the first search would be slow, remaining searches would be fast since Linux will now buffer the index in RAM. The RAMDirectory has the advantage of pre-loading everything and can keep it in memory if the box is performing other operations. In my benchmarks though RAMDirectory is slightly slower. I assume this is because its Hashtable based even though it only needs to be synchronized in a few places. IE searches should never be synchronized... Would a HashMap implementation of RAMDirectory beat out a cached FSDirectory? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Performance when computing computing a filter using hundreds of diff terms.
I'm trying to compute a filter to match documents in our index by a set of terms. For example some documents have a given field 'category' so I need to compute a filter with mulitple categories. The problem is that our category list is 200 items so it takes about 80 seconds to compute. We cache it of course but this seems WAY too slow. Is there anything I could do to speed it up? Maybe run the queries myself and then combine the bitsets? We're using a BooleanQuery with nested TermQueries to build up the filter... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Split an existing index into smaller segments without a re-index?
Is it possible to take an existing index (say 1G) and break it up into a number of smaller indexes (say 10 100M indexes)... I don't think theres currently an API for this but its certainly possible (I think). Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
GROUP BY functionality.
In 1.4 we now have arbitrary sort support... Is it possible to use GROUP BY without having do to this on the client (which would be inneficient)... I have a field I want to make sure is unique in my search results. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Doug Cutting wrote: Aviran wrote: I changed the Lucene 1.4 final source code and yes this is the source version I changed. Note that this patch won't produce the a speedup on earlier releases, since their was another multi-thread bottleneck higher up the stack that was only recently removed, revealing this lower-level bottleneck. The other patch was: http://www.mail-archive.com/[EMAIL PROTECTED]/msg07873.html Both are required to see the speedup. Thanks... Also, is there any reason folks cannot use 1.4 final now? No... just that I'm trying to be conservative... I'm probably going to look at just migrating to 1.4 ASAP but we're close to a milestone... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.java - STORED, NOT_STORED, etc...
Doug Cutting wrote: It would be best to get the compiler to check the order. If we change this, why not use type-safe enumerations: http://www.javapractices.com/Topic1.cjp The calls would look like: new Field(name, value, Stored.YES, Indexed.NO, Tokenized.YES); Stored could be implemented as the nested class: public final class Stored { private Stored() {} public static final Stored YES = new Stored(); public static final Stored NO = new Stored(); } +1... I'm not in love with this pattern but since Java 1.4 doesnt' support enum its better than nothing. I also didn't want to submit a recommendation that would break APIs. I assume the old API would be deprecated? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Why is Field.java final?
Doug Cutting wrote: Kevin A. Burton wrote: I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? You don't need to subclass to do this, just a static method somewhere. Why is this? I can't see any harm in making it non-final... Field and Document are not designed to be extensible. They are persisted in such a way that added methods are not available when the field is restored. In other words, when a field is read, it always constructs an instance of Field, not a subclass. Thats fine... I think thats acceptable behavior. I don't think anyone would assume that inner vars are restored or that the field is serialized. Not a big deal but it would be nice... -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Search has poor cpu utilization on a 4-CPU machine
Aviran wrote: Bug 30058 posted Which of course is here: http://issues.apache.org/bugzilla/show_bug.cgi?id=30058 Is this the source of the revision you modified? http://www.mail-archive.com/[EMAIL PROTECTED]/msg06116.html Also what version of Lucene? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Field.java - STORED, NOT_STORED, etc...
I've been working with the Field class doing index conversions between an old index format to my new external content store proposal (thus the email about the 14M convert). Anyway... I find the whole Field.Keyword, Field.Text thing confusing. The main problem is that the constructor to Field just takes booleans and if you forget the ordering of the booleans its very confusing. new Field( name, value, true, false, true ); So looking at that you have NO idea what its doing without fetching javadoc. So I added a few constants to my class: new Field( name, value, NOT_STORED, INDEXED, NOT_TOKENIZED ); which IMO is a lot easier to maintain. Why not add these constants to Field.java: public static final boolean STORED = true; public static final boolean NOT_STORED = false; public static final boolean INDEXED = true; public static final boolean NOT_INDEXED = false; public static final boolean TOKENIZED = true; public static final boolean NOT_TOKENIZED = false; Of course you still have to remember the order but this becomes a lot easier to maintain. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why is Field.java final?
I was going to create a new IDField class which just calls super( name, value, false, true, false) but noticed I was prevented because Field.java is final? Why is this? I can't see any harm in making it non-final... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Increasing Linux kernel open file limits.
Don't know if anyone knew this: http://www.hp-eloquence.com/sdb/html/linux_limits.html The kernel allocates filehandles dynamically up to a limit specified by file-max. The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. The three values in file-nr denote the number of allocated file handles, the number of used file handles and the maximum number of file handles. When the allocated filehandles come close to the maximum, but the number of actually used ones is far behind, you've encountered a peak in your filehandle usage and you don't need to increase the maximum. So while root you can allocate as many file handles without any limits enforced by glibc you still have to fight against the kernel Just doing a echo 100 /proc/sys/fs/file-max works fine. Then I can keep track of my file limit by doing a cat /proc/sys/fs/file-nr At least this works on 2.6.x... Think this is going to save me a lot of headache! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Peter M Cipollone wrote: You might try merging the existing index into a new index located on a ram disk. Once it is done, you can move the directory from ram disk back to your hard disk. I think this will work as long as the old index did not finish merging. You might do a strings command on the segments file to make sure the new (merged) segment is not in there, and if there's a deletable file, make sure there are no segments from the old index listed therein. Its a HUGE index. It won't fit in memory ;) Right now its at 8G... Thanks though! :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: Also... what can I do to speed up this optimize? Ideally it wouldn't take 6 hours. Was this the index with the mergeFactor of 5000? If so, that's why it's so slow: you've delayed all of the work until the end. Indexing on a ramfs will make things faster in general, however, if you have enough RAM... No... I changed the mergeFactor back to 10 as you suggested. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: So is it possible to fix this index now? Can I just delete the most recent segment that was created? I can find this by ls -alt Sorry, I forgot to answer your question: this should work fine. I don't think you should even have to delete that segment. I'm worried about duplicate or missing content from the original index. I'd rather rebuild the index and waste another 6 hours (I've probably blown 100 hours of CPU time on this already) and have a correct index :) During an optimize I assume Lucene starts writing to a new segment and leaves all others in place until everything is done and THEN deletes them? Also, to elaborate on my previous comment, a mergeFactor of 5000 not only delays the work until the end, but it also makes the disk workload more seek-dominated, which is not optimal. The only settings I uses are: targetIndex.mergeFactor=10; targetIndex.minMergeDocs=1000; the resulting index has 230k files in it :-/ I assume this is contributing to all the disk seeks. So I suspect a smaller merge factor, together with a larger minMergeDocs, will be much faster overall, including the final optimize(). Please tell us how it goes. This is what I did for this last round but then I ended up with the highly fragmented index. hm... Thanks for all the help btw! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Understanding TooManyClauses-Exception and Query-RAM-size
[EMAIL PROTECTED] wrote: Hi, a couple of weeks ago we migrated from Lucene 1.2 to 1.4rc3. Everything went smoothly, but we are experiencing some problems with that new constant limit maxClauseCount=1024 which leeds to Exceptions of type org.apache.lucene.search.BooleanQuery$TooManyClauses when certain RangeQueries are executed (in fact, we get this Excpetion when we execute certain Wildcard queries, too). Although we are working with a fairly small index with about 35.000 documents, we encounter this Exception when we search for the property modificationDate. For example modificationDate:[00 TO 0dwc970kw] We talked about this the other day. http://wiki.apache.org/jakarta-lucene/IndexingDateFields Find out what type of precision you need and use that. If you only need days or hours or minutes then use that. Millis is just too small. We're only using days and have queries for just the last 7 days as max so this really works out well... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Otis Gospodnetic wrote: Hey Kevin, Not sure if you're aware of it, but you can specify the lock dir, so in your example, both JVMs could use the exact same lock dir, as long as you invoke the VMs with the same params. Most people won't do this or won't even understand WHY they need to do this :-/. You shouldn't be writing the same index with more than 1 IndexWriter though (not sure if this was just a bad example or a real scenario). Yes... I realize that you shouldn't use more than one IndexWriter. That was the point. The locks are to prevent this from happening. If one were to accidentally do this the locks would be in different directories and our IndexWriter would corrupt the index. This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Kevin A. Burton wrote: No... I changed the mergeFactor back to 10 as you suggested. Then I am confused about why it should take so long. Did you by chance set the IndexWriter.infoStream to something, so that it logs merges? If so, it would be interesting to see that output, especially the last entry. No I didn't actually... If I run it again I'll be sure to do this. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Doug Cutting wrote: Kevin A. Burton wrote: This is why I think it makes more sense to use our own java.io.tmpdir to be on the safe side. I think the bug is that Tomcat changes java.io.tmpdir. I thought that the point of the system property java.io.tmpdir was to have a portable name for /tmp on unix, c:\windows\tmp on Windows, etc. Tomcat breaks that. So must Lucene have its own way of finding the platform-specific temporary directory that everyone can write to? Perhaps, but it seems a shame, since Java already has a standard mechanism for this, which Tomcat abuses... I've seen this done in other places as well. I think Weblogic did/does it. I'm wondering what some of these big EJB containsers use which is why I brought this up. I'm not sure the problem is just with Tomcat. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Doug Cutting wrote: Something sounds very wrong for there to be that many files. The maximum number of files should be around: (7 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) With 14M documents, log_10(14M/1000) is 4, which gives, for you: (7 + numIndexedFields) * 36 = 230k 7*36 + numIndexedFields*36 = 230k numIndexedFields = (230k - 7*36) / 36 =~ 6k So you'd have to have around 6k unique field names to get 230k files. Or something else must be wrong. Are you running on win32, where file deletion can be difficult? With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. This is very strange... I'm going to increase minMergeDocs to 1 and then run the full converstion on one box and then try to do an optimize (of the corrupt) another box. See which one finishes first. I assume the speed of optimize() can be increased the same way that indexing is increased... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene shouldn't use java.io.tmpdir
As per 1.3 (or was it 1.4) Lucene migrated to using java.iot.tmpdir to store the locks for the index. While under most situations this is save a lot of application servers change java.io.tmpdir at runtime. Tomcat is a good example. Within Tomcat this property is set to TOMCAT_HOME/temp.. Under this situation if I were to create two IndexWriters within two VMs and try to write to the same index the index would get corrupted if one Lucene instance was within Tomcat and the other was within a standard VM. I think we should consider either: 1. Using out own tmpdir property based on the given OS. 2. Go back to the old mechanism of storing the locks within the index basedir (if it's not readonly). Thoughts? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Most efficient way to index 14M documents (out of memory/file handles)
I'm trying to burn an index of 14M documents. I have two problems. 1. I have to run optimize() every 50k documents or I run out of file handles. this takes TIME and of course is linear to the size of the index so it just gets slower by the time I complete. It starts to crawl at about 3M documents. 2. I eventually will run out of memory in this configuration. I KNOW this has been covered before but for the life of me I can't find it in the archives, the FAQ or the wiki. I'm using an IndexWriter with a mergeFactor of 5k and then optimizing every 50k documents. Does it make sense to just create a new IndexWriter for every 50k docs and then do one big optimize() at the end? Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Preventing duplicate document insertion during optimize
Let's say you have two indexes each with the same document literal. All the fields hash the same and the document is a binary duplicate of a different document in the second index. What happens when you do a merge to create a 3rd index from the first two? I assume you now have two documents that are identical in one index. Is there any way to prevent this? It would be nice to figure out if there's a way to flag a field as a primary key so that if it has already added it to just skip. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
'Lock obtain timed out' even though NO locks exist...
I've noticed this really strange problem on one of our boxes. It's happened twice already. We have indexes where when Lucnes starts it says 'Lock obtain timed out' ... however NO locks exist for the directory. There are no other processes present and no locks in the index dir or /tmp. Is there anyway to figure out what's going on here? Looking at the index it seems just fine... But this is only a brief glance. I was hoping that if it was corrupt (which I don't think it is) that lucene would give me a better error than Lock obtain timed out Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
[EMAIL PROTECTED] wrote: It is possible that a previous operation on the index left the lock open. Leaving the IndexWriter or Reader open without closing them ( in a finally block ) could cause this. Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Kevin A. Burton wrote: Actually this is exactly the problem... I ran some single index tests and a single process seems to read from it. The problem is that we were running under Tomcat with diff webapps for testing and didn't run into this problem before. We had an 11G index that just took a while to open and during this open Lucene was creating a lock. I wasn't sure that Tomcat was multithreading this so maybe it is and it's just taking longer to open the lock in some situations. This is strange... after removing all the webapps (besides 1) Tomcat still refuses to allow Lucene to open this index with Lock obtain timed out. If I open it up from the console it works just fine. I'm only doing it with one index and a ulimit -n so it's not a files issue. Memory is 1G for Tomcat. If I figure this out will be sure to send a message to the list. This is a strange one Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
James Dunn wrote: Which version of lucene are you using? In 1.2, I believe the lock file was located in the index directory itself. In 1.3, it's in your system's tmp folder. Yes... 1.3 and I have a script that removes the locks from both dirs... This is only one process so it's just fine to remove them. Perhaps it's a permission problem on either one of those folders. Maybe your process doesn't have write access to the correct folder and is thus unable to create the lock file? I thought about that too... I have plenty of disk space so that's not an issue. Also did a chmod -R so that should work too. You can also pass lucene a system property to increase the lock timeout interval, like so: -Dorg.apache.lucene.commitLockTimeout=6 or -Dorg.apache.lucene.writeLockTimeout=6 I'll give that a try... good idea. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: 'Lock obtain timed out' even though NO locks exist...
Gus Kormeier wrote: Not sure if our installation is the same or not, but we are also using Tomcat. I had a similiar problem last week, it occurred after Tomcat went through a hard restart and some software errors had the website hammered. I found the lock file in /usr/local/tomcat/temp/ using locate. According to the README.txt this is a directory created for the JVM within Tomcat. So it is a system temp directory, just inside Tomcat. Man... you ROCK! I didn't even THINK of that... Hm... I wonder if we should include the name of the lock file in the Exception within Tomcat. That would probably have saved me a lot of time :) Either that or we can put this in the wiki Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Created LockObtainTimedOut wiki page
I just created a LockObtainTimedOut wiki entry... feel free to add. I just entered the Tomcat issue with java.io.tmpdir as well. http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut Peace! -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Does a RAMDirectory ever need to merge segments... (performanceissue)
Gerard Sychay wrote: I've always wondered about this too. To put it another way, how does mergeFactor affect an IndexWriter backed by a RAMDirectory? Can I set mergeFactor to the highest possible value (given the machine's RAM) in order to avoid merging segments? Yes... actually I was thinking of increasing these vars on the RAMDirectory in the hope to avoid this CPU overhead.. Also I think the var you want is minMergeDocs not mergeFactor. the only problem is that the source to maybeMergeSegments says: private final void maybeMergeSegments() throws IOException { long targetMergeDocs = minMergeDocs; while (targetMergeDocs = maxMergeDocs) { So I guess to prevent this we would have to set minMergeDocs to maxMergeDocs+1 ... which makes not sense. Also by default maxMergeDocs is Integer.MAX_VALUE so that will have to be changed. Anyway... I'm still playing with this myself. It might be easier to just use an ArrayList of N documents if you know for sure how big your RAM dir will grow to. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Does a RAMDirectory ever need to merge segments... (performance issue)
I've been benchmarking our indexer to find out if I can squeeze any more performance out of it. I noticed one problem with RAMDirectory... I'm storing documents in memory and then writing them to disk every once in a while. ... IndexWriter.maybeMergeSegments is taking up 5% of total runtime. DocumentWriter.addDocument is taking up another 17% of total runtime. Notice that this doesn't == 100% becuase there are other tasks taking up CPU before and after Lucene is called. Anyway... I don't see why RAMDirectory is trying to merge segments. Is there anyway to prevent this? I could just store them in a big ArrayList until I'm ready to write them to a disk index but I'm not sure how efficient this will be. Anyone run into this before? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: index update (was Re: Large InputStream.BUFFER_SIZE causes OutOfMemoryError.. FYI)
petite_abeille wrote: On Apr 13, 2004, at 02:45, Kevin A. Burton wrote: He mentioned that I might be able to squeeze 5-10% out of index merges this way. Talking of which... what strategy(ies) do people use to minimize downtime when updating an index? This should probably be a wiki page. Anyway... two thoughts I had on the subject a while back: You maintain two disk (not RAID ... you get reliability through software). Searches are load balanced between disks for performance reasons. If one fails you just stop using it. When you want to do an index merge you read from disk0 and write to disk1. Then you take disk0 out of search rotation and add disk1 and copy the contents of disk1 to disk two. Users shouldn't notice much of a performance issue during the merge because it will be VERY fast and it's just reads from disk0. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: verifying index integrity
Doug Cutting wrote: If you use this method, it is possible to corrupt things. In particular, if you unlock an index that another process is modifying, then modify it, then these two processes might step on one another. So this method should only be called when you are certain that no one else is modifying the index. We're handling this by using .pid files. We use a standard initializer and use your own lock files with process IDs. If you're on UNIX I can give you the source to the JNI getpid that I created. I've been meaning on Open Sourcing this anyway... putting it into commons probably. This way you can prevent multiple initialization if a java process is currently running that might be working with your index. Otherwise there's no real way to be sure the lock isn't stale (unless time is a factor but that slows things down) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of hit highlighting and finding term positions for
[EMAIL PROTECTED] wrote: 730 msecs is the correct number for 10 * 16k docs with StandardTokenizer! The 11ms per doc figure in my post was for highlighlighting using a \ lower-case-filter-only analyzer. 5ms of this figure was the cost of the \ lower-case-filter-only analyzer. 73 msecs is the cost of JUST StandardTokenizer (no highlighting) StandardAnalyzer uses StandardTokenizer so is probably used in a lot of apps. It \ tries to keep certain text eg email addresses as one term. I can live without it and \ I suspect most apps can too. I haven't looked into why its slow but I notice it does \ make use of Vectors. I think a lot of people's highlighter performance issues may \ extend from this. Looking at StandardTokenizer I can't see anything that would slow it down much... can we get the source to your lower case fitler?! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
[patch] MultiSearcher should support getSearchables()
Seems to only make sense to allow a caller to find the searchables a MultiSearcher was created with: 'diff' -uN MultiSearcher.java.bak MultiSearcher.java --- MultiSearcher.java.bak 2004-03-30 14:57:41.660109642 -0800 +++ MultiSearcher.java 2004-03-30 14:57:46.530330183 -0800 @@ -208,4 +208,8 @@ return searchables[i].explain(query,doc-starts[i]); // dispatch to searcher } + public Searchable[] getSearchables() { +return searchables; + } + } -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Performance of hit highlighting and finding term positions for a specific document
I'm playing with this package: http://home.clara.net/markharwood/lucene/highlight.htm Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. My question is whether it's hard to find a TermPosition for a given term in a given document rather than the whole index. IndexReader.termPositions( Term term ) is term specific not term and document specific. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Performance of hit highlighting and finding term positions for a specific document
Erik Hatcher wrote: On Mar 30, 2004, at 7:56 PM, Kevin A. Burton wrote: Trying to do hit highlighting. This implementation uses another Analyzer to find the positions for the result terms. This seems that it's very inefficient since lucene already knows the frequency and position of given terms in the index. What if the original analyzer removed stopped words, stemmed, and injected synonyms? Just use the same analyzer :)... I agree it's not the best approach for this reason and the CPU reason. Also it seems that after all this time that Lucene should have efficient hit highlighting as a standard package. Is there any interest in seeing a contribution in the sandbox for this if it uses the index positions? Big +1, regardless of the implementation details. Hit hilighting is so commonly requested that having it available at least in the sandbox, or perhaps even in the core, makes a lot of sense. Well if we could make it efficient by using the frequency and positions of terms we're all set :)... I just need to figure out how to do this efficiently per document. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: [patch] MultiSearcher should support getSearchables()
Erik Hatcher wrote: On Mar 30, 2004, at 5:59 PM, Kevin A. Burton wrote: Seems to only make sense to allow a caller to find the searchables a MultiSearcher was created with: Could you elaborate on why it makes sense? What if the caller changed a Searchable in the array? Would anything bad happen? (I don't know, haven't looked at the code). Yes... something bad could happen... but that would be amazingly stupid ... we should probably recommend that it be readonly. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Is RangeQuery more efficient than DateFilter?
I have a 7G index. A query for a random term comes back fast (300ms) when I'm not using a DateFilter but when I add the DateFilter it takes 2.6 seconds. Way too long. I assume this is because the filter API does a post process so it has to read fields off disk. Is it possible to do to this with a RangeQuery. For example you could create a days since January 1, 1970 fields and do a range query from between 5 and 10... and then add the original field as well. I have to make some app changes so I figured I would ask here before moving forward. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: One way to force larger read-aheads might be to pump up Lucene's input buffer size. As an experiment, try increasing InputStream.BUFFER_SIZE to 1024*1024 or larger. You'll want to do this just for the merge process and not for searching and indexing. That should help you spend more time doing transfers with less wasted on seeks. If that helps, then perhaps we ought to make this settable via system property or somesuch. Good suggestion... seems about 10% - 15% faster in a few strawman benchmarks I ran. Note that right now this var is final and not public... so that will probably need to change. Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. I'm obviously willing to throw memory at the problem -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Tracking/Monitoring Search Terms in Lucene
Katie Lord wrote: I am trying to figure out how to track the search terms that visitors are using on our site on a monthly basis. Do you all have any suggestions? Don't use lucene for this... just have your form record the search terms. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Is RangeQuery more efficient than DateFilter?
Erik Hatcher wrote: One more point... caching is done by the IndexReader used for the search, so you will need to keep that instance (i.e. the IndexSearcher) around to benefit from the caching. Great... Damn... looked at the source of CachingWrapperFilter and it makes sense. Thanks for the pointer. The results were pretty amazing. Here are the results before and after. Times are in millis: Before caching the Field: Searching for Jakarta: 2238 1910 1899 1901 1904 1906 After caching the field: 2253 10 6 8 6 6 That's a HUGE difference :) I'm very happy :) -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Lucene optimization with one large index and numerous small indexes.
Doug Cutting wrote: How long is it taking to merge your 5GB index? Do you have any stats about disk utilization during merge (seeks/second, bytes transferred/second)? Did you try buffer sizes even larger than 1MB? Are you writing to a different disk, as suggested? I'll do some more testing tonight and get back to you Note that right now this var is final and not public... so that will probably need to change. Perhaps. I'm reticent to make it too easy to change this. People tend to randomly tweak every available knob and then report bugs, or, if it doesn't crash, start recommending that everyone else tweak the knob as they do. There are lots of tradeoffs with buffer size, cases that folks might not think of (like that a wildcard query creates a buffer for every term that matches), etc. Or you can do what I do and recompile ;) Does it make sense to also increase the OutputStream.BUFFER_SIZE? This would seem to make sense since an optimize is a large number of reads and writes. It might help a little if you're merging to the same disk as you're reading from, but probably not a lot. If you're merging to a different disk then it shouldn't make much difference at all. Right now we are merging to the same disk... I'll perform some real benchmarks with this var too. Long term we're going to migrate to using to SCSI disks per machine and then doing parallel queries across them with optimized indexes. Also with modern disk controllers and filesystems I'm not sure how much difference this should make. Both Reiser and XFS do a lot of internal buffering as does our disk controller. I guess I'll find out... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: BooleanQuery$TooManyClauses
hui wrote: Hi, I have a range query for the date like [20011201 To 20040201], it works fine for Lucene API 1.3 RC1. When I upgrade to 1.3 final, I got BooleanQuery$TooManyClauses exception sometimes no matter the index is created by 1.3RC1 or 1.3 final. Check on the email archive, it seems related with maxClauseCount. Is increasing maxClauseCount the only way to avoid this issue in 1.3 final? The dev mail list has some discussion on the future plan on this. I've noticed the same problem.. The strange thing is that it only happens on some queries. For example the query blog results in this exception but the query for linux in my index works just fine. This is the stacktrace if anyone's interested: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:109) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:101) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:99) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240) at org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:240) at org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:188) at org.apache.lucene.search.Query.weight(Query.java:120) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:128) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:150) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:93) at org.apache.lucene.search.Hits.init(Hits.java:80) at org.apache.lucene.search.Searcher.search(Searcher.java:71) For the record I'm also using a DateRange but I disabled it still noticed the same behavior. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Lucene optimization with one large index and numerous small indexes.
We're using lucene with one large target index which right now is 5G. Every night we take sub-indexes which are about 500M and merging them into this main index. This merge (done via IndexWriter.addIndexes(Directory[]) is taking way too much time. Looking at the stats for the box we're essentially blocked on reads. The disk is blocked on read IO and CPU is at 5%. If I'm right I think this could be minimized by continually picking the two smaller indexes, merging them, then picking the next two smallest, merging them, and then keep doing this until we're down to one index. Does this sound about right? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: too many files open error
Charlie Smith wrote: /opt/famhistdev/fhstage/jbin/.docSearcher/indexes/fhstage_update/_3ff.f6 (Too many open files) Just a suggestion... why not put a URL string in the Too many open files. Exception. Tons of people keep running into this problem and we keep wasting both our time annd their time. We could just link to the FAQ entry. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: too many files open error
Chad Small wrote: Is this :) serious? Because we have a need/interest in the new field sorting capabilities URL to documentation for field sorting? and QueryParser keyword handling of dashes (-) that would be in 1.4, I believe. It's so much easier to explain that we'll use a final release of Lucene instead of a dev build Lucene. -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: code works with 1.3-rc1 but not with 1.3-final??
Dan wrote: I have some code that creates a lucene index. It has been working fine with lucene-1.3-rc1.jar but I wanted to upgrade to lucene-1.3-final.jar. I did this and the indexer breaks. I get the following error when running the index with 1.3-final: Optimizing the index IOException: /home/danl001/index-Mar-22-14_31_30/_ni.f43 (Too many open files) Indexed 884 files in 8 directories Index creation took 242 seconds % No... it's you... ;) Read the FAQ and then run ulimit -n 100 or so... You need to increase your file handles. Chance are you never noticed this before but the problem was still present. If you're on a Linux box you would be amazed to find out that you're only about 200 file handles away from running out of your per-user quota file quota. You might have to su as root to change this.. RedHat is more strict because it uses the glibc resource restrictions thingy. (who's name slips my mind at the moment). Debian is configured better here as per defaults. Also a google query would have solved this for you very quickly ;).. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Lock timeout should show the index it failed on...
Just an RFE... if a lock times out we should probably throw the name of the FSDirectory (or if it's a RAMDirectory) ... I'm lazy so this is a reminder for either myself to do this or wait until one of you guys take care of it :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Real time indexing and distribution to lucene on separate boxes (long)
I'm curious to find out what others are doing in this situation. I have two boxes... the indexer and the searcher. The indexer is taking documents and indexing them and creating indexes in a RAMDirectory (for efficiency) and is then writing these indexes to disk as we begin to run out of memory. Usually these aren't very big... 15-100M or so. Obviously I'm dividing the indexing and searching onto dedicated boxes to improve efficiency. The real issue though is that the searchers need to be live all the time as indexes are being added at runtime. So if that wasn't clear. I actually have to push out fresh indexes WHILE users are searching them. Not a very easy thing to do. Here's my question. What are the optimum ways to then distribute these index segments to the secondary searcher boxes. I don't want to use the MultiSearcher because it's slow once we have too many indexes (see my PS) Here's what I'm currently thinking: 1. Have the indexes sync'd to the searcher as shards directly. This doesn't scale as I would have to use the MultiSearcher which is slow when it has too many indexes. (And ideally we would want an optimized index). 2. Merge everything into one index on the indexer. Lock the searcher, then copy over the new index via rsync. The problem here is that the searcher would need to lock up while the sync is happening to prevent reads on the index. If I do this enough and the system is optimzed I think I would only have to block for 5 seconds or so but that's STILL very long. 3. Have two directories on the searcher. The indexer would then sync to a tmp directory and then at run time swap them via a rename once the sync is over. The downside here is that this will take up 2x disk space on the searcher. The upside is that the box will only slow down while the rsync is happening. 4. Do a LIVE index merge on the production box. This might be an interesting approach. The major question I have is whether you can do an optimize/merge on an index that's currently being used. I *think* it might be possible but I'm not sure. This isn't as fast as performing the merge on the indexer before hand but it does have the benefits of both worlds. If anyone has any other ideas I would be all ears... PS.. Random question. The performance of the MultiSearcher is Mlog(N) correct? Where N is the number of documents in the index and M is the number of indexes? Is this about right? -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
int vs long and document ids on 64bit machines.
A discussion I had a while back had someone note (Doug?) that the decision to go with 32bit ints for document IDs was that on 32 bit machines that 64bits weren't threadsafe. Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? How hard would it be to build a lucene64 that used 64bit document handles (longs) for 64bit procesors?! Is it just a recompile? Will the file format break and need updating?! Also ... what are the symptoms of a Lucene build using 64bit ints on 32bit processors. Right now we're personally stuck on 32bit machines but I would like to see us migrate to 64 bit boxes over the next 6 months... Anyway... thinking out loud. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: int vs long and document ids on 64bit machines.
Doug Cutting wrote: Somone, not me, perhaps provided that rationalization, which isn't a bad one. In fact, the situation was more that, in 1997, when I started Lucene, 2 billion documents seemed like a lot for a Java-based search engine which was designed to scale to perhaps millions of documents, but probably not to the world. Java was slow then, remember? Yes... agreed. Does anyone know how JDK 1.4.2 works on Itanium, Opteron (AMD64)? How hard would it be to build a lucene64 that used 64bit document handles (longs) for 64bit procesors?! Is it just a recompile? Will the file format break and need updating?! I think the file format is 64-bit safe. But the code changes would be quite numerous. No doubt we should make this change someday. Do you anticipate more than 2 billion documents in your Lucene index sometime soon, e.g., this year? Also, with Java, it's not just a recompile, it's a lot of code changes. Weill ... the refactor should at LEAST be pretty easy... just start changing int-long and follow up until the code compiles. Not sure if it's that easy. Also ... what are the symptoms of a Lucene build using 64bit ints on 32bit processors. Right now we're personally stuck on 32bit machines but I would like to see us migrate to 64 bit boxes over the next 6 months... Java's int datatype is defined as 32 bit. So there are no 64-bit ints. There are longs. I doubt longs are much slower than ints to deal with on most JVMs today. However a long[] is twice as big as an int[], and an array may only be indexed by an int. Currently Lucene uses a byte[] indexed by document number to store normalization factors. This would not work if document numbers are longs. Filters index bit vectors with document numbers, and that also would not work if document numbers were longs. Working around these will not only take some code, it may also impact performance a bit. I suspect that Java will soon evolve to better embrace 64-bit machines. Someday assignment of longs will be atomic. (This is hinted at in the language spec.) Someday arrays will probably be indexable by longs. I'd prefer to wait until these changes happen before changing Lucene's document numbers to longs. At some point I might take a look at the code and see how hard it would be... Thanks for you notes... I'll probably use these in the future. The main problem that with indexes that have lots of SMALL documents you could see yourself running out of ints. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: Real time indexing and distribution to lucene on separate boxes (long)
Otis Gospodnetic wrote: I like option 3. I've done it before, and it worked well. I dealt with very small indices, though, and if your indices are several tens or hundred gigs, this may be hard for you. Option 4: search can be performed on an index that is being modified (update, delete, insert, optimize). You'd just have to make sure not to recreate new IndexSearcher too frequently, if your index is being modified often. Just change it every X index modification or every X minutes, and you'll be fine. Right now I'm thinking about #4... Disk may be cheap but a fast RAID 10 array with 100G twice isn't THAT cheap... That's the worse case scenario of course and most modern search clusters use cheap hardware though... Also... since the new indexes are SO small (~100M) the merges would probably be easier on the machine than just doing a whole new write. Of course it's hard to make that argument with a 100G RAID array but we're using rysnc to avoid distribution of network IO so the CPU computation and network read would slow things down. The only way around this is the re-upload the whole 100G index but even over gigabit ethernet this will take 15 minutes. This doesn't scale as we add more searchers. Thanks for the feedback! I think now tha tI know that optmize is safe as long as I don't create a new reader... I'll be fine. I do have think about how I'm going to handle search result nav. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: update performance
Chris Kimm wrote: Unfortunately, I'm not able to batch the updates. The application needs to make some descisions based on what each document looks like before and after the update, so I have to do it one at a time. I guess this is not a common useage scenario for Lucene. Otherwise, an update() might already be built in somewhere. Is there anything in the locking/sync framework which precludes saving the cost of closing the Directory object and deleting the temp lock file each time an update is made? Use a RAM directory... then when you're pretty sure you're done call IndexWriter.addIndexes() on the disk index. Will that work for you? You can also do this every N documents, or minutes, or memory usage, and have the commit work with a synchronized thread. Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Scott ganyo wrote: I don't buy it. HashSet is but one implementation of a Set. By choosing the HashSet implementation you are not only tying the class to a hash-based implementation, you are trying the interface to *that specific* hash-based implementation or it's subclasses. In the end, either you buy the concept of the interface and its abstraction or you don't. I firmly believe in using interfaces as they were intended to be used. An interface isn't just the concept of a Java interface but ALSO the implied and required semantics. TreeSet, etc are too slow to be used with the StopFitler thus we should prevent their use. We require HashSet/Map... Scott P.S. In fact, HashSet isn't always going to be the most efficient anyway. Just for one example: Consider possible implementations if I have only 1 or 2 entries. HashSet is not always the most efficient... if you need to do runtime inserts and bulk removal TreeSet/Map might be more efficient. Also if you need to sort the map then you're stuck with a tree. KEvin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: I will refactor again using Set with no copying this time (except for the String[] and Hashtable) constructors. This was my original preference, but I got caught up in the arguments by Kevin and lost my ideals temporarily :) I expect to do this later tonight or tomorrow. How about this as a compromise... No copy on constructor... use a Set but in the documentation summarize this conversation and point out that the user should use a HashSet and NOT any other type of set and that it will result in a copy.. I think Doug's comment about a potentially faster impl in the future was a good point... Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: Also... while you're at it... the private variable name is 'table' which this HashSet certainly is *not* ;) Well, depends on your definition of 'table' I suppose :) I changed it to a type-agnostic stopWords. Did you know that internally HashSet uses a HashMap? I sure didn't! hashset.contains() maps to hashmap.containsKey() It uses a key - value mapping to a generic PRESENT Object... hm. Probably makes sense to just call this variable 'hashset' and then force the type to be HashSet since it's necessary for this to be a HashSet to maintain any decent performance. You'll need to update your second constructor to require a HashSet too.. would be very bad to let callers use another set impl... TreeSet and SortedSet would still be too slow... I refuse to expose HashSet... sorry! :) But I did wrap what is passed in, like above, in a HashSet in my latest commit. Hm... You're doing this EVEN if the caller passes a HashSet directly?! Why do you have a problem exposing a HashSet/Map... it SHOULD be a Hash based implementation. Doing anything else is just wrong and would seriously slow down Lucene indexing. Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). :) Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Also... you're HashSet constructor has to copy values from the original HashSet into the new HashSet ... not very clean and this can just be removed by forcing the caller to use a HashSet (which they should). I've caved in and gone HashSet all the way. Did you not see my message suggesting a way to both not expose HashSet publicly and also not to copy values? If not, I attached it. For the record I didn't see it... but it echos my points... Thanks! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Otis Gospodnetic wrote: I really don't think this will make any noticable difference, but why not. Could you please send a diff -uN patch, please? I made the same changes locally about a year ago, but have since thrown away my local changes (for no good reason that I recall). Just diff it locally... it's just a search replace for Hashtable - HashMap... Pretty trivial. Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Erik Hatcher wrote: I don't see any reason for this to be a Hashtable. It seems an acceptable alternative to not share analyzer/filter instances across threads - they don't really take up much space, so is there a reason to share them? Or I'm guessing you're sharing it implicitly through an IndexWriter, huh? I'll away further feedback before committing this change, but seems reasonable to me. Yeah... I'm using a RAMDirectory and adding documents to it across multiple threads... some of them index at the same time. The patch is super small... the only difference is that it's using a HashMap which isn't synchronized... it can't hurt anything... but feedback is a good thing :) Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
Doug Cutting wrote: Erik Hatcher wrote: Well, one issue you didn't consider is changing a public method signature. I will make this change, but leave the Hashtable signature method there. I suppose we could change the signature to use a Map instead, but I believe there are some issues with doing something like this if you do not recompile your own source code against a new Lucene JAR so I will simply provide another signature too. This is also a problem for folks who're implementing analyzers which use StopFilter. For example: public MyAnalyzer extends Analyzer { private static Hashtable stopTable = StopFilter.makeStopTable(stopWords); public TokenStream tokenStream(String field, Reader reader) { ... new StopFilter(stopTable) ... } This would no longer compile with the change Kevin proposes. To make things back-compatible we must: 1. Keep but deprectate StopFilter(Hashtable) constructor; 2. Keep but deprecate StopFilter.makeStopTable(String[]); 3. Add a new constructor: StopFilter(HashMap); 4. Add a new method: StopFilter.makeStopMap(String[]); Does that make sense? Ah... ok... good point. If no one does this I'll take care of it... Kevin -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature
Re: DocumentWriter, StopFilter should use HashMap... (patch)
David Spencer wrote: Maybe I missed something but I always thought the stop list should be a Set, not a Map (or Hashtable/Dictionary). After all, all you need to know is existence and that's what a Set does. It stores the word as the key and the value... I don't care either way... There was no HashSet back when this was written. I was just going to leave it as a HashMap so that in the future if we ever wanted to change the value we could... Either way. -- Please reply using PGP: http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster begin:vcard fn:Kevin Burton n:Burton;Kevin email;internet:[EMAIL PROTECTED] x-mozilla-html:TRUE version:2.1 end:vcard signature.asc Description: OpenPGP digital signature