Compass Framework

2006-04-08 Thread Marios Skounakis



Hi all,

I recently came across the Compass Framework, which is built on top of lucene. I
am interested in it because it stores the lucene index in an RDBMS and provides
transaction support for index updates (it also has several other features but
this is the part I'm mostly interested in).

I wanted to know if any people here have had any experience with compass and 
what
they think about it. Is the database implementation of the index fast enough and
does it introduce any additional issues/problems?

Thanks in advance,
Marios
 Msg sent via eXis webmail

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compass Framework

2006-04-08 Thread Raghavendra Prabhu
Database implementation of the index is always bound to be slow compared to
storing it on the filesystem.

Probably the group which stores indexes into Berkley DB should be able to
give you a performance measuer of what will happen you store indexex in
databases.

Rgds
Prabhu


On 4/8/06, Marios Skounakis [EMAIL PROTECTED] wrote:




 Hi all,

 I recently came across the Compass Framework, which is built on top of
 lucene. I
 am interested in it because it stores the lucene index in an RDBMS and
 provides
 transaction support for index updates (it also has several other features
 but
 this is the part I'm mostly interested in).

 I wanted to know if any people here have had any experience with compass
 and what
 they think about it. Is the database implementation of the index fast
 enough and
 does it introduce any additional issues/problems?

 Thanks in advance,
 Marios
  Msg sent via eXis webmail

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: I just don't get wildcards at all.

2006-04-08 Thread Erik Hatcher

Eric,

Wildcard queries are tricky business.  WildcardQuery by itself  
without leveraging any analysis tricks is what you've got, but you  
may want to consider injecting rotated tokens.  For example, the word  
cat would be indexed as cat$, at$c, t$ca, and $cat (all in  
the same position, increment 0).  That's half the equation.  The  
other half is to adjust the queries so that if someone searches for  
c*t that it becomes a WildcardQuery (or PrefixQuery in this case) for  
t$c*, making the search space much smaller.


CSRQ definitely isn't what you want for wildcard queries.  Another  
alternative is to create a custom Filter, if its reasonable to  
extract wildcarded clauses from a query expression, that can  
enumerate terms as efficiently as possible (like WildcardTermEnum  
does) and lights up only the documents that contain matching terms -  
this would eliminate the TooManyClauses headache.


There really isn't anything pre-built that does what you're after any  
better than the suggestions above, I don't think.


Erik


On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote:

OK, I know I'm asking you to write my code for me (or at least  
point me to

an example), but I'm at my wits end, so please rescue me

This is a reprise of TooManyClauses. We have a large amount of  
text, and a
requirement to do a wildcard query. Of course, it's wy too big  
to use
Wildcard or the other expanding queries. They frighten me  
anyway.


y'all pointed me at the ConstantScoreRangeQuery (CSRQ), but  
actually using

it is not making sense to me.

I just don't get how, for instance,  CSRQ helps me that much. Say I  
want to
search for big*er. I can use a CSRQ to get all the docs that  
include this
term, just by using biga and bigz as my min/max terms. But then I'm  
stuck. I
could iterate through all the docs returned, but that seems  
inefficient. Not
to mention that the HitCollector (?) class warns against this due  
to an

order of magnitude decrease in response time.

What I *want* is a way to, for each doc in the CSRQ, get to answer  
whether
it's a match. Really, on the order of a callback with the value  
that worked
for the CSRQ and the ability to return a yes/no or a ranking.  
Again, I can

interate all the docs matched, but this seems expensive.

Using filters doesn't really seem to do the trick for me either. If I
understand them properly, they allow me to set up a bitset for all the
documents that should be searched. All 1,000,000 of them? Or am I  
thinking

about this completely backwards? I have LIA, but I'm also wondering if
there's something in 1.9 that I haven't found yet.

Now, given how easy the rest of Lucene is to use, I assume that I'm
approaching this poorly, but I sure am stumped.

All that said, I'm quite Java-naieve, so please bear with me if this
question demonstrates my ignorance painfully.

Thanks
Erick



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: I just don't get wildcards at all.

2006-04-08 Thread Erick Erickson
Erik:

Thanks, that helps a lot. I won't waste any more time chasing CSRQ, which is
definitely a plus.

I have to admit that I was hoping for a RTFM, page ### (Read The FIELD
ManualG)) response. Although since I completely missed WildCardTermEnum
maybe I *did* get the response I hoped for. I have go buy frogs (it's a long
story), so I won't be able to look at this til later.

If I understand this right, I could build my own BooleanQuery in chunks of,
say, 1,000 terms each by just adding words given me by the WildCardTermEnum,
right?

Or I could iterate through the list, recording the most similar terms and
only search on those, etc, etc, etc

And I assume that TermDocs will get me lists of documents associated with
any of the terms I come up with, which will also help...

I'll run some test later today to see what kind of performance I get.

You mean I actually have to *think* about this? A.

Thanks agein
Erick


P.S. A big thanks for thie resopnse, since I have a self-imposed deadline of
Monday for solving this. We're trying to decide whether to use Lucene or a
horrible old C interface to a commercial search engine. The frightening
thing is that I have the skills to go ahead and use the old C interface,
but really, really, really would like to use something a little (well, a
lot) more friendly.


Re: Exception in WildCardQuery

2006-04-08 Thread karl wettin


8 apr 2006 kl. 13.06 skrev Erik Hatcher:

Feel free to log this as a bug report in our JIRA issue tracker.   
It seems like a reasonable change to make, such that a  
WildcardQuery without a wildcard character would behave like  
TermQuery.


-1

Even though very few, it is a waste of clockticks. I belive that any  
lib always should try to force the developer to write optimized code.  
If you for some reason need to autotedetect wildcard/term query, the  
developer should write a facade.


Another error message could be good though.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Compass Framework

2006-04-08 Thread Chris Lu
As far as I know, Compass Framework does not store index into
database. It just index data when data pass Hibernate, iBatis, or
other layer.

So if you use these layers in your code, you can use Compass.

Chris Lu
--
Full-Text Lucene Search on Any Databases
http://www.dbsight.net
Faster to setup than reading marketing materials!

On 4/8/06, Raghavendra Prabhu [EMAIL PROTECTED] wrote:
 Database implementation of the index is always bound to be slow compared to
 storing it on the filesystem.

 Probably the group which stores indexes into Berkley DB should be able to
 give you a performance measuer of what will happen you store indexex in
 databases.

 Rgds
 Prabhu


 On 4/8/06, Marios Skounakis [EMAIL PROTECTED] wrote:
 
 
 
 
  Hi all,
 
  I recently came across the Compass Framework, which is built on top of
  lucene. I
  am interested in it because it stores the lucene index in an RDBMS and
  provides
  transaction support for index updates (it also has several other features
  but
  this is the part I'm mostly interested in).
 
  I wanted to know if any people here have had any experience with compass
  and what
  they think about it. Is the database implementation of the index fast
  enough and
  does it introduce any additional issues/problems?
 
  Thanks in advance,
  Marios
   Msg sent via eXis webmail
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception in WildCardQuery

2006-04-08 Thread Erick Erickson
I have to disagree. Nowhere in the javadoc is this condition noted. There is
no way for a user to know that this is a restriction, and forcing developers
to find this by having a program fail, even with a better error message,
is...er...unfortunate. Even if this were in the javadoc, I still have to
remember it. And my memory isn't what it used to be

Optimization where it really doesn't count is, in my experience, bad.
Period. I'm making the assumption that setting up the query is a tiny
fraction of the time spent in a search. I'm perfectly willing to lose those
very few clockticks in code that accounts for a tiny, tiny fraction of my
search time than I am willing to be surprised by behavior that I have no way
of anticipating. And spending the developer/customer/company time chasing
such a problem down. So the notion of a library forcing me to optimize where
*the library writers* think I should raises a red flag right away.

Of course it's a balancing act. I'd also not like the library to get so
concerned with being idiot-proof that it gets noticeably slower. Given all
the time and energy that I expect Lucene to save me, I'm content to let the
Lucene folks make that determination. They are in a far better place to
judge whether this would be worth it or not.

So, I'll put in the bug report and be happy with whatever decision is made
by the Lucene folks.

As you can probably tell, I've spent f too much of my professional
life looking at code that was efficient, complicated, and wrong in some
subtle or not-so-subtle way and caused failures of one sort or another. And
improved execution time by, say, .0001%. I don't accept the efficiency
argument unless it can be shown to matter. The eXtreme Programming folks
have it right, Make it work, make it right, make it fast. I'd change it a
bit to make it fast if it matters. Those times it has mattered, my guesses
as to where the time was being wasted have been wrong most of the time.

Ok, now you know where one of my buttons is. I'll get off my soap-box now...

Erick

P.S. I'll be glad to exchange a few e-mail with you if you want to try to
persuade me. We probably shouldn't turn this into a philosophical debate
over optimization, since it *is* a Lucene forum...


Re: Exception in WildCardQuery

2006-04-08 Thread karl wettin


8 apr 2006 kl. 19.04 skrev Erick Erickson:


I have to disagree.



Optimization where it really doesn't count is, in my experience, bad.
Period.


My intent was not to ephasise on optimization. The waste of  
clockticks is just a side effect from what I consider bad design.  
WildcardQuery and TermQuery do diffrent things. If I want to  
encapsulate the functions of both in one class I write a factory.


class TermOrWildcardQuery implements Query {
private Query factory(Term t) {
		if (t.termText.endsWith(*)) return new WildcardQuery(t); else  
return new TermQuery(t);

}
}


I have never used such a factory, but my guess is that  
programmatically I would always know If I wanted to use a wildcard or  
not. So only when writing a primitive query parser for human entered  
text could I see the use for such a thing. Perhaps then it is then  
better to write a real parser/lexer using Antlr or JavaCC?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception in WildCardQuery

2006-04-08 Thread Erik Hatcher
I don't mean literally that WildcardQuery would morph into a  
TermQuery, but rather behave like it by simply doing what it  
currently does but without the string index exception that currently  
is thrown.  It wouldn't take any additional clockticks, per se, I  
don't think - it'd just behave as most would expect.


Erik


On Apr 8, 2006, at 11:57 AM, karl wettin wrote:



8 apr 2006 kl. 13.06 skrev Erik Hatcher:

Feel free to log this as a bug report in our JIRA issue tracker.   
It seems like a reasonable change to make, such that a  
WildcardQuery without a wildcard character would behave like  
TermQuery.


-1

Even though very few, it is a waste of clockticks. I belive that  
any lib always should try to force the developer to write optimized  
code. If you for some reason need to autotedetect wildcard/term  
query, the developer should write a facade.


Another error message could be good though.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: I just don't get wildcards at all.

2006-04-08 Thread Chris Hostetter

: If I understand this right, I could build my own BooleanQuery in chunks of,
: say, 1,000 terms each by just adding words given me by the WildCardTermEnum,
: right?

if you took that approach, you would avoid getting a TooManyClauses
exception, but you could far more easily avoid it by increaseing the max
allowed clause count.

THe key to the whole issue of query expansion is to understand (1) why
some queries expand, (2) what happens when they expand, and (3) why
BooleanQuery.maxClauseCount exists.

let's answer those slightly out of order...

(2) Queries like PrefixQuery and WildCardQuery expand to a BooleanQuery
containing TermQueries for each of the individual terms in the index
that match the prefix or the wildcard pattern.  Each of these
TermQueries has it's own TermWeight and TermScorer -- which means that the
resulting score of a document that contains some terms which match the
orriginal Prefix/WIldCard pattern is determined by the TF and IDF of those
terms (relative the document)

(1) why this happens arguably has two answers:
  a) because that's just the way it was implimented orriginally
  b) because it usually makes sense to work that way.
(a) doesn't really merrit much alaboration, but (b) might make more sense
if you consider what happens when you do a search for the prefix ca* ...
if document X contains the text the cat was in the car it makes sense
that you want it to score higher then document Y which just contains the
cat was on the roof.  If the terms cat and car appear in almost all
of your documents, but some document Z is the only document to contain
the terms cap and can then it might also make sense that Z should
score high since it not only matches the prefix but it matches it with
unique terms  (you may disagree with this sentiment, but i'm just
expalining the rationale)

(3) so what's the deal with maxClauseCount?   If you have a big index,
with lots of terms then a sufficiently general prefix/wildcard can be
rewritten into a really honking big BooleanQuery, which can take up a lot
of RAM (for all of those TermQueries and TermWeights and TermSCorerers)
and can take a lot of time to execute.  If you've got gobs abd gobs
of RAM, and don't care how long your queries take, then
set the maxClauseCount to MAX_INT and forget about.  maxClauseCount is
just there as a safety valve to protect you.

Which brings us back to your question

: If I understand this right, I could build my own BooleanQuery in chunks of,
: say, 1,000 terms each by just adding words given me by the WildCardTermEnum,
: right?

if you did that, then the resulting query would take up just as much RAM
(if not more), and it would take just as long to execute (if not more) as
if you called setMaxCLauseCount(MAX_INT) and used a regular WildCardQuery.


Erik suggested two independent ways of addressing your problem, which can
acctually be combined to make things even better -- the first is the
character rotation idea which has been discussed in more detail on the
list in the past (try googling lucene wildcard rotate)

The second was to build a *Filter* that uses WildcardTermEnum -- not a
Query.  This would benefit you in the same way RangeFilter benefits people
who get TooManyClauses using RangeQuery ... because it's a filter, the
scoring aspects of each document are taken out of the equation -- a
complete set of TermQueries/TermScorers doesn't need to be built in
memory, you can just iterate over the applicable Terms at query time.

Take a look at RangeFilter and (Solr's) PrefixFilter for an example of
whats involved in writing a Filter thta uses Term Enumerators, and then
re-think about Erik's suggestion.  Once you have a WildcardFilter
wrapping it in a ConstantScoreQuery would give you a drop in replacement
for WildCardQuery that would sacrifive the TF/IDF scoring factors for
speed and garunteed execution on any pattern in any index regardless of
size.

Personally, i think a generic WildcardFilter would make a great
contribution to the Lucene core.

http://svn.apache.org/viewcvs.cgi/lucene/java/trunk/src/java/org/apache/lucene/search/RangeFilter.java?view=markup
http://svn.apache.org/viewcvs.cgi/incubator/solr/trunk/src/java/org/apache/solr/search/PrefixFilter.java?view=markup



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Exception in WildCardQuery

2006-04-08 Thread Erick Erickson
OK, I get it now. Obviously I didn't read your response as you meant it and
was focusing on the optimization part.

I'd also agree that using the most specific tool for the task is better
design. There's also an argument that I'd rather have a library fail than
do something non-obvious. I'll leave it to Erik whether this would be an
'obvious' behavior G...

Thanks for clarifying.

Best
Erick


Re: I just don't get wildcards at all.

2006-04-08 Thread Erick Erickson
Chris:

Thanks for that exposition, that helps me greatly.  I didn't mention that I
tried increasing the MaxAllowedClause count and ran out of memory. And that
I don't trust those kinds of tweaks anyway. They'll blow up sometime,
somewhere and I'll get a phone call because our product is offline and
customers are screaming. Been there, done that, don't want to do it again
G.

I'm reluctant to do the wildcard rotation thing, b/c I assume it'll increase
my index size, but that's just an uninformed assumption. I'll look in the
places you indicated and re-think that. My index is already 3G, most all of
it in the field I have to search via wildcards

And I wasn't really proposing my own chunked boolean query. In fact I hadn't
thought much about what I was *really* going to do, had to go buy frogs.
Mostly, I was seeing if I understood what a WildcardTermEnum did. But given
that it seems to have prompted you to write some of my code for me, or at
least point me at a place where I can steal some, I'm glad I wrote a
half-baked response.

But right now I have to go deal with the pond and the fish. Which is
entirely unrelated to the frogs..

Thanks again for taking the time to explain this to me (and others out
there). It's a great help.

Erick


Lucene and top words query

2006-04-08 Thread Berlin Brown
I noticed when using the Luke tool that I it provides a set of top words 
from an index.  What is a programmatic way of doing this?


--
Berlin Brown
(ramaza3 on freenode)
http://www.newspiritcompany.com
also checkout alpha version of botverse: 
http://www.newspiritcompany.com:8086/universe_home



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Fwd: HOT SPOT VIRTUAL MACHINE aleatory crash while index documents

2006-04-08 Thread pepone pepone
-- Forwarded message --
From: pepone pepone [EMAIL PROTECTED]
Date: Apr 9, 2006 1:42 AM
Subject: Re: HOT SPOT VIRTUAL MACHINE aleatory crash while index documents
To: Daniel Naber [EMAIL PROTECTED]


i change to Sun JVM but crash persist the crass is a lot aleatory and
occurrs after index and search thowsands of objets

last crash while searching using this code when call is.search()

synchronized public ResultSet search(
String q,
int page,
Current current)

{
PerFieldAnalyzerWrapper analyzerWrapper =
new PerFieldAnalyzerWrapper(new StandardAnalyzer());
analyzerWrapper.addAnalyzer (identity,new KeywordAnalyzer());
analyzerWrapper.addAnalyzer(type,new KeywordAnalyzer());
analyzerWrapper.addAnalyzer(name,new KeywordAnalyzer());
analyzerWrapper.addAnalyzer (path,new KeywordAnalyzer());
analyzerWrapper.addAnalyzer(parent-id,new KeywordAnalyzer());
System.out.println(Query: +q);
int resultsPerPage=15;
ResultSet resultSet=new ResultSet();
resultSet.query=q;
resultSet.page=page;
resultSet.results=new ArrayList();
System.out.println(resultset build OK);
Directory fsDir=null;
IndexSearcher is=null;
try
{
fsDir=FSDirectory.getDirectory(indexDir,false);
System.out.println(FSDirectory build OK);
is=new IndexSearcher(fsDir);
System.out.println (IndexSearcher build OK);
QueryParser parser=new QueryParser(contents,analyzerWrapper);
System.out.println(Query build OK);
Query query=parser.parse (q);
System.out.println(Query parse OK);
long start=new Date().getTime();
Hits hits=is.search(query);
long end=new Date().getTime();

System.out.println(Found  +
hits.length() +
 document(s) in  +
(end-start) +
 milliseconds);

resultSet.pages=hits.length()/resultsPerPage;
if((hits.length()%resultsPerPage)0)
{
resultSet.pages++;
}
resultSet.size=hits.length ();
int firstResult=(page*resultsPerPage);
for(
int i=firstResult;
(ihits.length()) 
(ifirstResult+resultsPerPage);
i++)
{
ObjectMetadata metadata=new ObjectMetadataI();
Document doc=hits.doc(i);
metadata.objectId=doc.get(identity);
resultSet.results.add(metadata);
}
is.close();
fsDir.close();
}
catch(Exception e)
{
try
{
e.printStackTrace();
if(is!=null)
is.close();
if(fsDir!=null)
fsDir.close();
}
catch(java.io.IOException ex)
{
ex.printStackTrace();
}
}
return resultSet;

}




#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
#  SIGSEGV (0xb) at pc=0xb7c78c59, pid=12714, tid=2695793584
#
# Java VM: Java HotSpot(TM) Client VM (1.4.2_10-b03 compiled mode)
# Problematic frame:
# V  [libjvm.so+0x285c59]

#

---  T H R E A D  ---

Current thread (0x08090c88):  VMThread [id=12714]

siginfo:si_signo=11, si_errno=0, si_code=1, si_addr=0x887d647b

Registers:
EAX=0x887d641b, EBX=0xb7e1bef0, ECX=0x0001, EDX=0xad6664f8
ESP=0xa0ae7fc0, EBP=0xa0ae7fd8, ESI=0xa529ba28, EDI=0xa1badb80
EIP=0xb7c78c59, CR2=0x887d647b, EFLAGS=0x00010246

Top of Stack: (sp=0xa0ae7fc0)
0xa0ae7fc0:   ad6664f8 a529ba28 b7e1bef0 0001
0xa0ae7fd0:   a0ae8078 a5f56e00 a0ae7fe8 b7cceb60
0xa0ae7fe0:   0806b450 b7e1bef0 a0ae7ffc b7b7a581
0xa0ae7ff0:   a0ae8014 0806b450 b7e1bef0 a0ae801c
0xa0ae8000:   b7b7a9c2 0806b330 a0ae8014 0001
0xa0ae8010:   b7e1bef0 a0ae8034 b7e19908 a0ae802c
0xa0ae8020:   b7cce4a2 0806b330 b7e1bef0 a0ae8050
0xa0ae8030:   b7b726bf a0ae8078 0806b330 b7e1bef0

Instructions: (pc=0xb7c78c59)
0xb7c78c49:   83 f8 03 75 18 8b 46 04 8d 50 08 8b 40 08 56 52
0xb7c78c59:   8b 40 60 ff d0 83 c4 08 8d 34 86 eb 07 8b 06 89

Stack: [0xa0a76000,0xa0ae9000),  sp=0xa0ae7fc0,  free space=455k

Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
V  [libjvm.so+0x285c59]
V  [libjvm.so+0x2dbb60]
V  [libjvm.so+0x187581]
V  [libjvm.so+0x1879c2]
V  [libjvm.so+0x2db4a2]
V  [libjvm.so+0x17f6bf]
V  [libjvm.so+0x181008]
V  [libjvm.so+0x180b5d]
V  [libjvm.so+0x187908]
V  [libjvm.so+0x2a4ab4]
V  [libjvm.so+0x17e5bd]
V  [libjvm.so+0x1548dd]
V  [libjvm.so+0x1805e3]
V  [libjvm.so+0x2cb695 ]
V  [libjvm.so+0x2cb5cd]
V  [libjvm.so+0x2ca867]
V  [libjvm.so+0x2cab01]
V  [libjvm.so+0x2ca72a]
V  [libjvm.so+0x260113]

C  [libpthread.so.0+0x5aba]


Exact date search doesn't work with 1.9.1?

2006-04-08 Thread Perez
Hi all,

I have a document with a date in it and I put it into a field like so:
DateTools.dateToString(theDate, Resolution.DAY), 
Field.Index.UN_TOKENIZED.

What I find is that a range query works:
[20060131 TO 20060601] and wildcard works e.g.
2006*
but exact matches do not work e.g.
20060130

Any ideas on how I am misusing the API?

This is 1.9.1.

tia,
-arturo


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]