Compass Framework
Hi all, I recently came across the Compass Framework, which is built on top of lucene. I am interested in it because it stores the lucene index in an RDBMS and provides transaction support for index updates (it also has several other features but this is the part I'm mostly interested in). I wanted to know if any people here have had any experience with compass and what they think about it. Is the database implementation of the index fast enough and does it introduce any additional issues/problems? Thanks in advance, Marios Msg sent via eXis webmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compass Framework
Database implementation of the index is always bound to be slow compared to storing it on the filesystem. Probably the group which stores indexes into Berkley DB should be able to give you a performance measuer of what will happen you store indexex in databases. Rgds Prabhu On 4/8/06, Marios Skounakis [EMAIL PROTECTED] wrote: Hi all, I recently came across the Compass Framework, which is built on top of lucene. I am interested in it because it stores the lucene index in an RDBMS and provides transaction support for index updates (it also has several other features but this is the part I'm mostly interested in). I wanted to know if any people here have had any experience with compass and what they think about it. Is the database implementation of the index fast enough and does it introduce any additional issues/problems? Thanks in advance, Marios Msg sent via eXis webmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I just don't get wildcards at all.
Eric, Wildcard queries are tricky business. WildcardQuery by itself without leveraging any analysis tricks is what you've got, but you may want to consider injecting rotated tokens. For example, the word cat would be indexed as cat$, at$c, t$ca, and $cat (all in the same position, increment 0). That's half the equation. The other half is to adjust the queries so that if someone searches for c*t that it becomes a WildcardQuery (or PrefixQuery in this case) for t$c*, making the search space much smaller. CSRQ definitely isn't what you want for wildcard queries. Another alternative is to create a custom Filter, if its reasonable to extract wildcarded clauses from a query expression, that can enumerate terms as efficiently as possible (like WildcardTermEnum does) and lights up only the documents that contain matching terms - this would eliminate the TooManyClauses headache. There really isn't anything pre-built that does what you're after any better than the suggestions above, I don't think. Erik On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote: OK, I know I'm asking you to write my code for me (or at least point me to an example), but I'm at my wits end, so please rescue me This is a reprise of TooManyClauses. We have a large amount of text, and a requirement to do a wildcard query. Of course, it's wy too big to use Wildcard or the other expanding queries. They frighten me anyway. y'all pointed me at the ConstantScoreRangeQuery (CSRQ), but actually using it is not making sense to me. I just don't get how, for instance, CSRQ helps me that much. Say I want to search for big*er. I can use a CSRQ to get all the docs that include this term, just by using biga and bigz as my min/max terms. But then I'm stuck. I could iterate through all the docs returned, but that seems inefficient. Not to mention that the HitCollector (?) class warns against this due to an order of magnitude decrease in response time. What I *want* is a way to, for each doc in the CSRQ, get to answer whether it's a match. Really, on the order of a callback with the value that worked for the CSRQ and the ability to return a yes/no or a ranking. Again, I can interate all the docs matched, but this seems expensive. Using filters doesn't really seem to do the trick for me either. If I understand them properly, they allow me to set up a bitset for all the documents that should be searched. All 1,000,000 of them? Or am I thinking about this completely backwards? I have LIA, but I'm also wondering if there's something in 1.9 that I haven't found yet. Now, given how easy the rest of Lucene is to use, I assume that I'm approaching this poorly, but I sure am stumped. All that said, I'm quite Java-naieve, so please bear with me if this question demonstrates my ignorance painfully. Thanks Erick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I just don't get wildcards at all.
Erik: Thanks, that helps a lot. I won't waste any more time chasing CSRQ, which is definitely a plus. I have to admit that I was hoping for a RTFM, page ### (Read The FIELD ManualG)) response. Although since I completely missed WildCardTermEnum maybe I *did* get the response I hoped for. I have go buy frogs (it's a long story), so I won't be able to look at this til later. If I understand this right, I could build my own BooleanQuery in chunks of, say, 1,000 terms each by just adding words given me by the WildCardTermEnum, right? Or I could iterate through the list, recording the most similar terms and only search on those, etc, etc, etc And I assume that TermDocs will get me lists of documents associated with any of the terms I come up with, which will also help... I'll run some test later today to see what kind of performance I get. You mean I actually have to *think* about this? A. Thanks agein Erick P.S. A big thanks for thie resopnse, since I have a self-imposed deadline of Monday for solving this. We're trying to decide whether to use Lucene or a horrible old C interface to a commercial search engine. The frightening thing is that I have the skills to go ahead and use the old C interface, but really, really, really would like to use something a little (well, a lot) more friendly.
Re: Exception in WildCardQuery
8 apr 2006 kl. 13.06 skrev Erik Hatcher: Feel free to log this as a bug report in our JIRA issue tracker. It seems like a reasonable change to make, such that a WildcardQuery without a wildcard character would behave like TermQuery. -1 Even though very few, it is a waste of clockticks. I belive that any lib always should try to force the developer to write optimized code. If you for some reason need to autotedetect wildcard/term query, the developer should write a facade. Another error message could be good though. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Compass Framework
As far as I know, Compass Framework does not store index into database. It just index data when data pass Hibernate, iBatis, or other layer. So if you use these layers in your code, you can use Compass. Chris Lu -- Full-Text Lucene Search on Any Databases http://www.dbsight.net Faster to setup than reading marketing materials! On 4/8/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote: Database implementation of the index is always bound to be slow compared to storing it on the filesystem. Probably the group which stores indexes into Berkley DB should be able to give you a performance measuer of what will happen you store indexex in databases. Rgds Prabhu On 4/8/06, Marios Skounakis [EMAIL PROTECTED] wrote: Hi all, I recently came across the Compass Framework, which is built on top of lucene. I am interested in it because it stores the lucene index in an RDBMS and provides transaction support for index updates (it also has several other features but this is the part I'm mostly interested in). I wanted to know if any people here have had any experience with compass and what they think about it. Is the database implementation of the index fast enough and does it introduce any additional issues/problems? Thanks in advance, Marios Msg sent via eXis webmail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exception in WildCardQuery
I have to disagree. Nowhere in the javadoc is this condition noted. There is no way for a user to know that this is a restriction, and forcing developers to find this by having a program fail, even with a better error message, is...er...unfortunate. Even if this were in the javadoc, I still have to remember it. And my memory isn't what it used to be Optimization where it really doesn't count is, in my experience, bad. Period. I'm making the assumption that setting up the query is a tiny fraction of the time spent in a search. I'm perfectly willing to lose those very few clockticks in code that accounts for a tiny, tiny fraction of my search time than I am willing to be surprised by behavior that I have no way of anticipating. And spending the developer/customer/company time chasing such a problem down. So the notion of a library forcing me to optimize where *the library writers* think I should raises a red flag right away. Of course it's a balancing act. I'd also not like the library to get so concerned with being idiot-proof that it gets noticeably slower. Given all the time and energy that I expect Lucene to save me, I'm content to let the Lucene folks make that determination. They are in a far better place to judge whether this would be worth it or not. So, I'll put in the bug report and be happy with whatever decision is made by the Lucene folks. As you can probably tell, I've spent f too much of my professional life looking at code that was efficient, complicated, and wrong in some subtle or not-so-subtle way and caused failures of one sort or another. And improved execution time by, say, .0001%. I don't accept the efficiency argument unless it can be shown to matter. The eXtreme Programming folks have it right, Make it work, make it right, make it fast. I'd change it a bit to make it fast if it matters. Those times it has mattered, my guesses as to where the time was being wasted have been wrong most of the time. Ok, now you know where one of my buttons is. I'll get off my soap-box now... Erick P.S. I'll be glad to exchange a few e-mail with you if you want to try to persuade me. We probably shouldn't turn this into a philosophical debate over optimization, since it *is* a Lucene forum...
Re: Exception in WildCardQuery
8 apr 2006 kl. 19.04 skrev Erick Erickson: I have to disagree. Optimization where it really doesn't count is, in my experience, bad. Period. My intent was not to ephasise on optimization. The waste of clockticks is just a side effect from what I consider bad design. WildcardQuery and TermQuery do diffrent things. If I want to encapsulate the functions of both in one class I write a factory. class TermOrWildcardQuery implements Query { private Query factory(Term t) { if (t.termText.endsWith(*)) return new WildcardQuery(t); else return new TermQuery(t); } } I have never used such a factory, but my guess is that programmatically I would always know If I wanted to use a wildcard or not. So only when writing a primitive query parser for human entered text could I see the use for such a thing. Perhaps then it is then better to write a real parser/lexer using Antlr or JavaCC? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exception in WildCardQuery
I don't mean literally that WildcardQuery would morph into a TermQuery, but rather behave like it by simply doing what it currently does but without the string index exception that currently is thrown. It wouldn't take any additional clockticks, per se, I don't think - it'd just behave as most would expect. Erik On Apr 8, 2006, at 11:57 AM, karl wettin wrote: 8 apr 2006 kl. 13.06 skrev Erik Hatcher: Feel free to log this as a bug report in our JIRA issue tracker. It seems like a reasonable change to make, such that a WildcardQuery without a wildcard character would behave like TermQuery. -1 Even though very few, it is a waste of clockticks. I belive that any lib always should try to force the developer to write optimized code. If you for some reason need to autotedetect wildcard/term query, the developer should write a facade. Another error message could be good though. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: I just don't get wildcards at all.
: If I understand this right, I could build my own BooleanQuery in chunks of, : say, 1,000 terms each by just adding words given me by the WildCardTermEnum, : right? if you took that approach, you would avoid getting a TooManyClauses exception, but you could far more easily avoid it by increaseing the max allowed clause count. THe key to the whole issue of query expansion is to understand (1) why some queries expand, (2) what happens when they expand, and (3) why BooleanQuery.maxClauseCount exists. let's answer those slightly out of order... (2) Queries like PrefixQuery and WildCardQuery expand to a BooleanQuery containing TermQueries for each of the individual terms in the index that match the prefix or the wildcard pattern. Each of these TermQueries has it's own TermWeight and TermScorer -- which means that the resulting score of a document that contains some terms which match the orriginal Prefix/WIldCard pattern is determined by the TF and IDF of those terms (relative the document) (1) why this happens arguably has two answers: a) because that's just the way it was implimented orriginally b) because it usually makes sense to work that way. (a) doesn't really merrit much alaboration, but (b) might make more sense if you consider what happens when you do a search for the prefix ca* ... if document X contains the text the cat was in the car it makes sense that you want it to score higher then document Y which just contains the cat was on the roof. If the terms cat and car appear in almost all of your documents, but some document Z is the only document to contain the terms cap and can then it might also make sense that Z should score high since it not only matches the prefix but it matches it with unique terms (you may disagree with this sentiment, but i'm just expalining the rationale) (3) so what's the deal with maxClauseCount? If you have a big index, with lots of terms then a sufficiently general prefix/wildcard can be rewritten into a really honking big BooleanQuery, which can take up a lot of RAM (for all of those TermQueries and TermWeights and TermSCorerers) and can take a lot of time to execute. If you've got gobs abd gobs of RAM, and don't care how long your queries take, then set the maxClauseCount to MAX_INT and forget about. maxClauseCount is just there as a safety valve to protect you. Which brings us back to your question : If I understand this right, I could build my own BooleanQuery in chunks of, : say, 1,000 terms each by just adding words given me by the WildCardTermEnum, : right? if you did that, then the resulting query would take up just as much RAM (if not more), and it would take just as long to execute (if not more) as if you called setMaxCLauseCount(MAX_INT) and used a regular WildCardQuery. Erik suggested two independent ways of addressing your problem, which can acctually be combined to make things even better -- the first is the character rotation idea which has been discussed in more detail on the list in the past (try googling lucene wildcard rotate) The second was to build a *Filter* that uses WildcardTermEnum -- not a Query. This would benefit you in the same way RangeFilter benefits people who get TooManyClauses using RangeQuery ... because it's a filter, the scoring aspects of each document are taken out of the equation -- a complete set of TermQueries/TermScorers doesn't need to be built in memory, you can just iterate over the applicable Terms at query time. Take a look at RangeFilter and (Solr's) PrefixFilter for an example of whats involved in writing a Filter thta uses Term Enumerators, and then re-think about Erik's suggestion. Once you have a WildcardFilter wrapping it in a ConstantScoreQuery would give you a drop in replacement for WildCardQuery that would sacrifive the TF/IDF scoring factors for speed and garunteed execution on any pattern in any index regardless of size. Personally, i think a generic WildcardFilter would make a great contribution to the Lucene core. http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/RangeFilter.java?view=markup http://svn.apache.org/viewcvs.cgi/incubator/solr/trunk/src/java/org/apache/solr/search/PrefixFilter.java?view=markup -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Exception in WildCardQuery
OK, I get it now. Obviously I didn't read your response as you meant it and was focusing on the optimization part. I'd also agree that using the most specific tool for the task is better design. There's also an argument that I'd rather have a library fail than do something non-obvious. I'll leave it to Erik whether this would be an 'obvious' behavior G... Thanks for clarifying. Best Erick
Re: I just don't get wildcards at all.
Chris: Thanks for that exposition, that helps me greatly. I didn't mention that I tried increasing the MaxAllowedClause count and ran out of memory. And that I don't trust those kinds of tweaks anyway. They'll blow up sometime, somewhere and I'll get a phone call because our product is offline and customers are screaming. Been there, done that, don't want to do it again G. I'm reluctant to do the wildcard rotation thing, b/c I assume it'll increase my index size, but that's just an uninformed assumption. I'll look in the places you indicated and re-think that. My index is already 3G, most all of it in the field I have to search via wildcards And I wasn't really proposing my own chunked boolean query. In fact I hadn't thought much about what I was *really* going to do, had to go buy frogs. Mostly, I was seeing if I understood what a WildcardTermEnum did. But given that it seems to have prompted you to write some of my code for me, or at least point me at a place where I can steal some, I'm glad I wrote a half-baked response. But right now I have to go deal with the pond and the fish. Which is entirely unrelated to the frogs.. Thanks again for taking the time to explain this to me (and others out there). It's a great help. Erick
Lucene and top words query
I noticed when using the Luke tool that I it provides a set of top words from an index. What is a programmatic way of doing this? -- Berlin Brown (ramaza3 on freenode) http://www.newspiritcompany.com also checkout alpha version of botverse: http://www.newspiritcompany.com:8086/universe_home - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Fwd: HOT SPOT VIRTUAL MACHINE aleatory crash while index documents
-- Forwarded message -- From: pepone pepone [EMAIL PROTECTED] Date: Apr 9, 2006 1:42 AM Subject: Re: HOT SPOT VIRTUAL MACHINE aleatory crash while index documents To: Daniel Naber [EMAIL PROTECTED] i change to Sun JVM but crash persist the crass is a lot aleatory and occurrs after index and search thowsands of objets last crash while searching using this code when call is.search() synchronized public ResultSet search( String q, int page, Current current) { PerFieldAnalyzerWrapper analyzerWrapper = new PerFieldAnalyzerWrapper(new StandardAnalyzer()); analyzerWrapper.addAnalyzer (identity,new KeywordAnalyzer()); analyzerWrapper.addAnalyzer(type,new KeywordAnalyzer()); analyzerWrapper.addAnalyzer(name,new KeywordAnalyzer()); analyzerWrapper.addAnalyzer (path,new KeywordAnalyzer()); analyzerWrapper.addAnalyzer(parent-id,new KeywordAnalyzer()); System.out.println(Query: +q); int resultsPerPage=15; ResultSet resultSet=new ResultSet(); resultSet.query=q; resultSet.page=page; resultSet.results=new ArrayList(); System.out.println(resultset build OK); Directory fsDir=null; IndexSearcher is=null; try { fsDir=FSDirectory.getDirectory(indexDir,false); System.out.println(FSDirectory build OK); is=new IndexSearcher(fsDir); System.out.println (IndexSearcher build OK); QueryParser parser=new QueryParser(contents,analyzerWrapper); System.out.println(Query build OK); Query query=parser.parse (q); System.out.println(Query parse OK); long start=new Date().getTime(); Hits hits=is.search(query); long end=new Date().getTime(); System.out.println(Found + hits.length() + document(s) in + (end-start) + milliseconds); resultSet.pages=hits.length()/resultsPerPage; if((hits.length()%resultsPerPage)0) { resultSet.pages++; } resultSet.size=hits.length (); int firstResult=(page*resultsPerPage); for( int i=firstResult; (ihits.length()) (ifirstResult+resultsPerPage); i++) { ObjectMetadata metadata=new ObjectMetadataI(); Document doc=hits.doc(i); metadata.objectId=doc.get(identity); resultSet.results.add(metadata); } is.close(); fsDir.close(); } catch(Exception e) { try { e.printStackTrace(); if(is!=null) is.close(); if(fsDir!=null) fsDir.close(); } catch(java.io.IOException ex) { ex.printStackTrace(); } } return resultSet; } # # An unexpected error has been detected by HotSpot Virtual Machine: # # SIGSEGV (0xb) at pc=0xb7c78c59, pid=12714, tid=2695793584 # # Java VM: Java HotSpot(TM) Client VM (1.4.2_10-b03 compiled mode) # Problematic frame: # V [libjvm.so+0x285c59] # --- T H R E A D --- Current thread (0x08090c88): VMThread [id=12714] siginfo:si_signo=11, si_errno=0, si_code=1, si_addr=0x887d647b Registers: EAX=0x887d641b, EBX=0xb7e1bef0, ECX=0x0001, EDX=0xad6664f8 ESP=0xa0ae7fc0, EBP=0xa0ae7fd8, ESI=0xa529ba28, EDI=0xa1badb80 EIP=0xb7c78c59, CR2=0x887d647b, EFLAGS=0x00010246 Top of Stack: (sp=0xa0ae7fc0) 0xa0ae7fc0: ad6664f8 a529ba28 b7e1bef0 0001 0xa0ae7fd0: a0ae8078 a5f56e00 a0ae7fe8 b7cceb60 0xa0ae7fe0: 0806b450 b7e1bef0 a0ae7ffc b7b7a581 0xa0ae7ff0: a0ae8014 0806b450 b7e1bef0 a0ae801c 0xa0ae8000: b7b7a9c2 0806b330 a0ae8014 0001 0xa0ae8010: b7e1bef0 a0ae8034 b7e19908 a0ae802c 0xa0ae8020: b7cce4a2 0806b330 b7e1bef0 a0ae8050 0xa0ae8030: b7b726bf a0ae8078 0806b330 b7e1bef0 Instructions: (pc=0xb7c78c59) 0xb7c78c49: 83 f8 03 75 18 8b 46 04 8d 50 08 8b 40 08 56 52 0xb7c78c59: 8b 40 60 ff d0 83 c4 08 8d 34 86 eb 07 8b 06 89 Stack: [0xa0a76000,0xa0ae9000), sp=0xa0ae7fc0, free space=455k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x285c59] V [libjvm.so+0x2dbb60] V [libjvm.so+0x187581] V [libjvm.so+0x1879c2] V [libjvm.so+0x2db4a2] V [libjvm.so+0x17f6bf] V [libjvm.so+0x181008] V [libjvm.so+0x180b5d] V [libjvm.so+0x187908] V [libjvm.so+0x2a4ab4] V [libjvm.so+0x17e5bd] V [libjvm.so+0x1548dd] V [libjvm.so+0x1805e3] V [libjvm.so+0x2cb695 ] V [libjvm.so+0x2cb5cd] V [libjvm.so+0x2ca867] V [libjvm.so+0x2cab01] V [libjvm.so+0x2ca72a] V [libjvm.so+0x260113] C [libpthread.so.0+0x5aba]
Exact date search doesn't work with 1.9.1?
Hi all, I have a document with a date in it and I put it into a field like so: DateTools.dateToString(theDate, Resolution.DAY), Field.Index.UN_TOKENIZED. What I find is that a range query works: [20060131 TO 20060601] and wildcard works e.g. 2006* but exact matches do not work e.g. 20060130 Any ideas on how I am misusing the API? This is 1.9.1. tia, -arturo - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]