What about the index writing efficiency of large index ?

2008-03-06 Thread Eric Th
Hi All,
Does anyone do a benchmark to verify the index writing efficiency of lucene?
When the index size is larger than 10G, will it be much slower than smaller
ones ?

Actually i did some works about this issue,
and i found that, if build small index firstly then merge them all, the time
taken will significantly decline.

Hi Yonik,
How does Solr deal with this issue, does it just leave this problem to
lucene?


Thanks,

Eric


Re: storing position - keyword

2008-03-06 Thread John Byrne
To confuse matters more, it is not really a matter of synonyms, as the 
orginal term is discarded from the index and there is only one mapped term


I'm not sure I fully understand this: am I right in thinking that you 
will be searching using these controlled volcabulary words, and that the 
search must then find any of the ordinary words which map to the 
controled vocaburlary words, and highlight them?


Because if that's the case, I think it's relatively simple: You create a 
separate index, which only maps the controlled vocubulary to the 
ordinary words. That's your synonyms index. Then, you index your 
target document as normal. When you search, you first look up your 
search term against the synonyms index. So, following your exmaple, if 
you looked up dog in the synoyms index, youd get back chien, canis 
and cane. (Achieving this part is easy: you just keep adding 
synonyms to the field at the same position.) Whether or not the 
returned list also contains the orignal dog is up to you when you 
create your synonyms index. (In a typical synonyms ring, the original 
word would have to be in there, because you don't know which word will 
be used to search)


Now all you have to do is combine those returned terms as Boolean OR 
clauses in a single BooleanQuery, and search on the main index. You'll 
find all documents containing any of those 3 words, and you can use the 
highlighting code form the Lucene contrib projects to highlight


Does this help? Forgive me if I've misunderstood or undersetimated the 
problem!


Regards,
-John

per original term or phrase and the algorithm determines the controlled
meaning from the context.

1world1love wrote:

First off Karl, thanks for your reply and your time.



karl wettin-3 wrote:
  

One could also say you are classifying your data based on keywords in
the text?




I probably didn't explain myself very well or more specifically provide a
good example. In my case, there really isn't any relationship between the
mapped terms per document. That is to say that an individual term or phrase
in the document is mapped to a concrete concept in a controlled vocabulary.
The concept doesn't represent a class of anything and no relationship exists
between the concepts. They would never be grouped by any means. It is more a
matter of replacing some arbitrary word or phrase with an adjudicated
version.

The example I gave did in fact use classifications for the terms, but that
is not exactly the point that I was trying to convey. I suppose a better
example would be where each term or phrase in the sentence mapped to any
equivilent in another language:

dog - canis
dog - cane
dog - chien

So that if you searched for canis, then any document with dog would be
returned (unless the context inferred that dog meant something else). By the
same token, if the text was here we go or let's go, then it may map to
vamos or vamonos.

To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term
per original term or phrase and the algorithm determines the controlled
meaning from the context.


karl wettin-3 wrote:
  

You can always store values in a field, but the term and the stored
value is not coupled. Thus you would need to store the positions per
document in each field in machine readable format you then parse:

doc.addField(f, keyword:12,32;54,32, Field.Store.YES, ..

But that is a way expensive solution.




Indeed, though doesn't a analyzed field have some other information attached
to it?

Forgive me if this is a naive question. I am fairly new to Lucene.


karl wettin-3 wrote:
  
 


This is known as faceted classification.

http://en.wikipedia.org/wiki/Faceted_classification
http://www.nabble.com/forum/Search.jtp?query=facetslocal=yforum=44




Again, I am not overly familiar with these disciplines, but I always thought
of facets as a organizational strategy. As I said, my example betrayed me a
bit, as I am not that interested in organizing these documents, rather
providing a controlled vocabulary from which to search as opposed to any
random text.



karl wettin-3 wrote:
  

Are you aware of the hightlighter contrib module?

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

The simplest solution is to a new facet Term per classification in text
and use the text start and end positions of the text field, and have the
hightligher to load the text and highlight this text field.




This is actually not a web based application and the highlighting would
really only be used for analyzing performance of the mapping algorithms. The
main issue is that we do need to be able to provide the location of the
original term for each mapped keyword.



karl wettin-3 wrote:
  

Matching a document with the same terms occuring multiple times will
cause a greater score than it only occuring once. This is probably
problematic for you.




It may not be that big of an 

Re: Boolean Query search performance

2008-03-06 Thread Eric Th
2008/3/6, Chris Hostetter [EMAIL PROTECTED]:


 : If I do a query.toString(), both queries give different results, which

 : is probably a clue (additional paren's with the BooleanQuery)
 :
 : Query.toString the old way using queryParser:
 : +(id:1^2.0 id:2 ... ) +type:CORE
 :
 : Query.toString the new way using BooleanQuery:
 : +((id:1^2.0) (id:2) ... ) +type:CORE


 i didn't look too closely at the psuedo code you posted, but the
 additional parens normally indicates that you are actually creating an
 extra layer of BooleanQueries (ie: a BooleanQuery with only one clause for
 each term) ... but the rewrite method should optimize those away (even
 back in lucene 2.2) ... if you look at query.rewrite(reader).toString()
 then the queries *really* should be the same, if they aren't, then that
 may be your culprit.


look here,
parens will also be add is each term has a boost value larger than 1.0.

public String toString(String field) {
StringBuffer buffer = new StringBuffer();
boolean needParens=(getBoost() != 1.0) ||
(getMinimumNumberShouldMatch()0) ;
if (needParens) {
  buffer.append(();
}

for (int i = 0 ; i  clauses.size(); i++) {
  BooleanClause c = (BooleanClause)clauses.get(i);
  if (c.isProhibited())
buffer.append(-);
  else if (c.isRequired())
buffer.append(+);

  Query subQuery = c.getQuery();
  if (subQuery instanceof BooleanQuery) {  // wrap sub-bools in
parens
buffer.append(();
buffer.append(c.getQuery().toString(field));
buffer.append());
  } else
buffer.append(c.getQuery().toString(field));

  if (i != clauses.size()-1)
buffer.append( );
}

if (needParens) {
  buffer.append());
}

if (getMinimumNumberShouldMatch()0) {
  buffer.append('~');
  buffer.append(getMinimumNumberShouldMatch());
}

if (getBoost() != 1.0f)
{
  buffer.append(ToStringUtils.boost(getBoost()));
}

return buffer.toString();
  }




-Hoss




-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




combine wildcard and phrase query

2008-03-06 Thread JensBurkhardt

hey everybody,

I'm wondering if it's possible to combine wildcards and phrase query. 

For example term1 term*

I know that the documentation says Lucene supports single and multiple
character wildcard searches within single terms (not within phrase queries)
but maybe someone has had the same problem and found a solution.

Thanks for your help

Jens Burkhardt
-- 
View this message in context: 
http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15870647.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Swapping between indexes

2008-03-06 Thread Sridhar Raman
This is my situation.  I have an index, which has a lot of search requests
coming into it.  I use just a single instance of IndexSearcher to process
these requests.  At the same time, this index is also getting updated by an
IndexWriter.  And I want these new changes to be reflected _only_ at certain
intervals.  I have thought of a few ways of doing this.  Each has its share
of problems and pluses.  I would be glad if someone can help me in figuring
out the right approach, especially from the performance point of view, as
the number of documents that will get indexed are pretty large.

Approach 1:
Have just one copy of the index for both Search  Index.  At time T, when I
need to see the new changes reflected, I close the Searcher, and open it
again.
- The re-open of the Searcher might be a bit slow (which I could probably
solve by using some warm-up threads).
- Update and Search on the index at the same - will this affect the
performance?
- If server crashes before time T, the new Searcher would reflect the
changes, which is not acceptable.  I want the changes to be reflected only
at time T.  If server crashes, the index should be the previous T-1 index.
- Possible problems while optimising the index (as Search is also
happening).
+ Just one copy of the index being stored.

Approach 2:
Keep 2 copies of the index - 1 for Search, 1 for Index.  At time T, I just
switch the Searcher to a copy of index that is being updated.
- Before I do the switch to the new index, I need to make a copy of it so
that the updates continue to happen on the other index.  Is there a
convenient way to make this copy?  Is it efficient?
- Time taken to create a new Searcher will still be a problem (but this is a
problem in the previous approach as well, and we can live with it).
+ Optimise can happen on an index that is not being read, as a result, its
resource requirements would be lesser.  And probably even the speed of
optimisation.
+ Faster search as the index update is happening on a different index.

So, these are the 2 approaches I am contemplating about.  Any pointers which
would be the better approach?

Thanks,
Sridhar


Re: combine wildcard and phrase query

2008-03-06 Thread JensBurkhardt

okay, another problem occured. I have different fields with the same name. I
can't seperate them like naming them field1 field2 etc. cause while indexing
i don't know how many fields i will need.
Like a book has several signature numbers i want to save them in a field
signature and when i search for such a number i want the search hit every
single field and not all fields together.
Right now i separate the string using an unique separator (in this case just
$$$) so i can split the string into the numbers but i think this is kinda
the worst form doing it.




JensBurkhardt wrote:
 
 hey everybody,
 
 I'm wondering if it's possible to combine wildcards and phrase query. 
 
 For example term1 term*
 
 I know that the documentation says Lucene supports single and multiple
 character wildcard searches within single terms (not within phrase
 queries) but maybe someone has had the same problem and found a solution.
 
 Thanks for your help
 
 Jens Burkhardt
 

-- 
View this message in context: 
http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping between indexes

2008-03-06 Thread Michael McCandless


A simple variant on Approach 1 would be to open your writer with  
autoCommit=false.


This way no reader will ever see the changes until you successfully  
close the writer.  If the machine crashes the index is still in the  
starting state as of when the writer was first opened.


Also, re-open of Approach 1 should be a bit (not a lot, though there  
is work to make it a lot) faster than wholly new open required in  
approach 2.


There should not be problems optimizing while searching.  Yes, you  
use more disk space, but no more (in fact, less) than approach 2  
requires.


I think approach 2 is only possibly better if the indexing would be  
done on a different computer / IO system.


Mike

Sridhar Raman wrote:

This is my situation.  I have an index, which has a lot of search  
requests
coming into it.  I use just a single instance of IndexSearcher to  
process
these requests.  At the same time, this index is also getting  
updated by an
IndexWriter.  And I want these new changes to be reflected _only_  
at certain
intervals.  I have thought of a few ways of doing this.  Each has  
its share
of problems and pluses.  I would be glad if someone can help me in  
figuring
out the right approach, especially from the performance point of  
view, as

the number of documents that will get indexed are pretty large.

Approach 1:
Have just one copy of the index for both Search  Index.  At time  
T, when I
need to see the new changes reflected, I close the Searcher, and  
open it

again.
- The re-open of the Searcher might be a bit slow (which I could  
probably

solve by using some warm-up threads).
- Update and Search on the index at the same - will this affect the
performance?
- If server crashes before time T, the new Searcher would reflect the
changes, which is not acceptable.  I want the changes to be  
reflected only
at time T.  If server crashes, the index should be the previous T-1  
index.

- Possible problems while optimising the index (as Search is also
happening).
+ Just one copy of the index being stored.

Approach 2:
Keep 2 copies of the index - 1 for Search, 1 for Index.  At time T,  
I just

switch the Searcher to a copy of index that is being updated.
- Before I do the switch to the new index, I need to make a copy of  
it so

that the updates continue to happen on the other index.  Is there a
convenient way to make this copy?  Is it efficient?
- Time taken to create a new Searcher will still be a problem (but  
this is a

problem in the previous approach as well, and we can live with it).
+ Optimise can happen on an index that is not being read, as a  
result, its

resource requirements would be lesser.  And probably even the speed of
optimisation.
+ Faster search as the index update is happening on a different index.

So, these are the 2 approaches I am contemplating about.  Any  
pointers which

would be the better approach?

Thanks,
Sridhar



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping between indexes

2008-03-06 Thread Michael McCandless


Sridhar Raman wrote:


This way no reader will ever see the changes until you successfully
close the writer.  If the machine crashes the index is still in the
starting state as of when the writer was first opened.
Ok, I have a slight doubt in this.  Say I have gone ahead with  
Approach 1
If I have opened the writer with autoCommit=false, and the system  
crashes,
does it mean that the changes made to IdxSrch are lost?  If that is  
the

case, that might be a problem.  What I actually want is something like
this.  When the system crashes in between, the search continues to  
happen on
the index at T0.  But the updates that were done since T0 also  
needs to be

preserved.  Would that happen if I set autoCommit to false?

I realise that I want the cake and eat it too.  But that's the  
problem we

face if we keep just a single copy of the index.


Alas, you are right: all changes not committed are lost.  Ie on  
coming back

up after the crash, you would have to re-index everything again.

Lucene is actually not that far from doing what you're asking for  
here.  I think
the only thing missing is the ability to open a reader on a prior  
commit, rather
than the latest one.  If we added that then you could make a custom  
deletion

policy that'd keep your T0 commit, as well as commits being done by your
writer, and only remove them when you decide to switch your readers  
to the

current commit.

But, realize that even with such a change to Lucene, you would still  
lose

everything since the last commit, when the machine crashes.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping between indexes

2008-03-06 Thread Sridhar Raman
 This way no reader will ever see the changes until you successfully
 close the writer.  If the machine crashes the index is still in the
 starting state as of when the writer was first opened.
Ok, I have a slight doubt in this.  Say I have gone ahead with Approach 1
If I have opened the writer with autoCommit=false, and the system crashes,
does it mean that the changes made to IdxSrch are lost?  If that is the
case, that might be a problem.  What I actually want is something like
this.  When the system crashes in between, the search continues to happen on
the index at T0.  But the updates that were done since T0 also needs to be
preserved.  Would that happen if I set autoCommit to false?

I realise that I want the cake and eat it too.  But that's the problem we
face if we keep just a single copy of the index.

On Thu, Mar 6, 2008 at 4:58 PM, Michael McCandless 
[EMAIL PROTECTED] wrote:


 A simple variant on Approach 1 would be to open your writer with
 autoCommit=false.

 This way no reader will ever see the changes until you successfully
 close the writer.  If the machine crashes the index is still in the
 starting state as of when the writer was first opened.

 Also, re-open of Approach 1 should be a bit (not a lot, though there
 is work to make it a lot) faster than wholly new open required in
 approach 2.

 There should not be problems optimizing while searching.  Yes, you
 use more disk space, but no more (in fact, less) than approach 2
 requires.

 I think approach 2 is only possibly better if the indexing would be
 done on a different computer / IO system.

 Mike

 Sridhar Raman wrote:

  This is my situation.  I have an index, which has a lot of search
  requests
  coming into it.  I use just a single instance of IndexSearcher to
  process
  these requests.  At the same time, this index is also getting
  updated by an
  IndexWriter.  And I want these new changes to be reflected _only_
  at certain
  intervals.  I have thought of a few ways of doing this.  Each has
  its share
  of problems and pluses.  I would be glad if someone can help me in
  figuring
  out the right approach, especially from the performance point of
  view, as
  the number of documents that will get indexed are pretty large.
 
  Approach 1:
  Have just one copy of the index for both Search  Index.  At time
  T, when I
  need to see the new changes reflected, I close the Searcher, and
  open it
  again.
  - The re-open of the Searcher might be a bit slow (which I could
  probably
  solve by using some warm-up threads).
  - Update and Search on the index at the same - will this affect the
  performance?
  - If server crashes before time T, the new Searcher would reflect the
  changes, which is not acceptable.  I want the changes to be
  reflected only
  at time T.  If server crashes, the index should be the previous T-1
  index.
  - Possible problems while optimising the index (as Search is also
  happening).
  + Just one copy of the index being stored.
 
  Approach 2:
  Keep 2 copies of the index - 1 for Search, 1 for Index.  At time T,
  I just
  switch the Searcher to a copy of index that is being updated.
  - Before I do the switch to the new index, I need to make a copy of
  it so
  that the updates continue to happen on the other index.  Is there a
  convenient way to make this copy?  Is it efficient?
  - Time taken to create a new Searcher will still be a problem (but
  this is a
  problem in the previous approach as well, and we can live with it).
  + Optimise can happen on an index that is not being read, as a
  result, its
  resource requirements would be lesser.  And probably even the speed of
  optimisation.
  + Faster search as the index update is happening on a different index.
 
  So, these are the 2 approaches I am contemplating about.  Any
  pointers which
  would be the better approach?
 
  Thanks,
  Sridhar


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: combine wildcard and phrase query

2008-03-06 Thread Erick Erickson
No, as far as I know you can't combine wildcards in phrases. This would
get extraordinarily ugly extraordinarily quickly. The way Lucene handles
wildcards (conceputally) is to expand all the possible terms into a large OR
clause. Say my index contains term1, term2, and term3. The search for term*
really expands into term1 OR term2 OR term3. Now imagine the
complexity of a phrase like dog* cat* hors*. Now say your index contained
10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000
ORed phrase queries. And this is a tiny example

You can try various approximations, and depending upon your index size they
may or may not work. For instance, you could index all the successive
shorter
forms. with increments of 0 (see synonym analyzer)  I.e. index horse, hors$
hor$
ho$ h$ all in the same position. Then searching for hor* becomes searching
for
hor$ and it all just works. Of course this makes your index bigger.

About your second issue: I'm not clear what your trying to accomplish. It's
no
problem to add the same field multiple times for a document. That is, you
can
doc.add(new field(field1, ..)
doc.add(new field(field1, ..)
doc.add(new field(field1, ..)
doc.add(new field(field1, ..)
as many times as you want before you add the document to the index. For
retrieval you can call getFields (field1) and get an array of Fields back,
one
for each call to add above. You can also set the PositionIncrementGap while
indexing to separate the termposition of the first term of successive add()
calls
by, say, 100 (or whatever) if you need to worry about SpanNear or some such.

This may be wy off base. If so, could you give a concrete example of
what
your inputs are and how you want to search them?

Best
Erick

On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt [EMAIL PROTECTED] wrote:


 okay, another problem occured. I have different fields with the same name.
 I
 can't seperate them like naming them field1 field2 etc. cause while
 indexing
 i don't know how many fields i will need.
 Like a book has several signature numbers i want to save them in a field
 signature and when i search for such a number i want the search hit every
 single field and not all fields together.
 Right now i separate the string using an unique separator (in this case
 just
 $$$) so i can split the string into the numbers but i think this is kinda
 the worst form doing it.




 JensBurkhardt wrote:
 
  hey everybody,
 
  I'm wondering if it's possible to combine wildcards and phrase query.
 
  For example term1 term*
 
  I know that the documentation says Lucene supports single and multiple
  character wildcard searches within single terms (not within phrase
  queries) but maybe someone has had the same problem and found a
 solution.
 
  Thanks for your help
 
  Jens Burkhardt
 

 --
 View this message in context:
 http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




Re: What about the index writing efficiency of large index ?

2008-03-06 Thread Yonik Seeley
On Thu, Mar 6, 2008 at 3:57 AM, Eric Th [EMAIL PROTECTED] wrote:
 Hi All,
  Does anyone do a benchmark to verify the index writing efficiency of lucene?
  When the index size is larger than 10G, will it be much slower than smaller
  ones ?

  Actually i did some works about this issue,
  and i found that, if build small index firstly then merge them all, the time
  taken will significantly decline.

You can increase the merge factor to get fewer merges,
or you can set maxMergeDocs or setMaxMergeMB to prevent merging of any
segments above a certain size and them call optimize at the end.

  Hi Yonik,
  How does Solr deal with this issue, does it just leave this problem to
  lucene?

Solr leaves it to Lucene.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Swapping between indexes

2008-03-06 Thread Yonik Seeley
On Thu, Mar 6, 2008 at 8:02 AM, Sridhar Raman [EMAIL PROTECTED] wrote:
  This way no reader will ever see the changes until you successfully
   close the writer.  If the machine crashes the index is still in the
   starting state as of when the writer was first opened.
  Ok, I have a slight doubt in this.  Say I have gone ahead with Approach 1
  If I have opened the writer with autoCommit=false, and the system crashes,
  does it mean that the changes made to IdxSrch are lost?

Since Lucene buffers in memory, you will always have the risk of
losing recently added documents that haven't been flushed yet.
Committing on every document would be too slow to be practical.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MultiSearcher to overcome the Integer.MAX_VALUE limit

2008-03-06 Thread Ray
Hey Guys,

just a quick question to confirm an assumption I have.

Is it correct that I can have around 100 Indexes each at its
Integer.MAX_VALUE limit of documents, but can happily
search  them all with a MultiSearcher if all combined returned
hits don't add up to the Integer.MAX_VALUE themselves ?

Kind regards,

Ray.




Re: combine wildcard and phrase query

2008-03-06 Thread JensBurkhardt

okay thanks. the first thing was what i've expected :-) . well about my
second issue,
i was totally wrong. Just forget what i've said! I had in mind that if i
have several fields with the same name
these fields are connected to a big string.
Now as i read your message i remember that this behavior is just with
boost-values ;-) and not with field-values. hell no ... 
thanks for your answers :-). You saved a lot of time ;-)

best regards
Jens


Erick Erickson wrote:
 
 No, as far as I know you can't combine wildcards in phrases. This would
 get extraordinarily ugly extraordinarily quickly. The way Lucene handles
 wildcards (conceputally) is to expand all the possible terms into a large
 OR
 clause. Say my index contains term1, term2, and term3. The search for
 term*
 really expands into term1 OR term2 OR term3. Now imagine the
 complexity of a phrase like dog* cat* hors*. Now say your index
 contained
 10 terms starting with dog, 10 with cat and 10 with hors. You'd have 1,000
 ORed phrase queries. And this is a tiny example
 
 You can try various approximations, and depending upon your index size
 they
 may or may not work. For instance, you could index all the successive
 shorter
 forms. with increments of 0 (see synonym analyzer)  I.e. index horse,
 hors$
 hor$
 ho$ h$ all in the same position. Then searching for hor* becomes searching
 for
 hor$ and it all just works. Of course this makes your index bigger.
 
 About your second issue: I'm not clear what your trying to accomplish.
 It's
 no
 problem to add the same field multiple times for a document. That is, you
 can
 doc.add(new field(field1, ..)
 doc.add(new field(field1, ..)
 doc.add(new field(field1, ..)
 doc.add(new field(field1, ..)
 as many times as you want before you add the document to the index. For
 retrieval you can call getFields (field1) and get an array of Fields
 back,
 one
 for each call to add above. You can also set the PositionIncrementGap
 while
 indexing to separate the termposition of the first term of successive
 add()
 calls
 by, say, 100 (or whatever) if you need to worry about SpanNear or some
 such.
 
 This may be wy off base. If so, could you give a concrete example of
 what
 your inputs are and how you want to search them?
 
 Best
 Erick
 
 On Thu, Mar 6, 2008 at 7:28 AM, JensBurkhardt [EMAIL PROTECTED]
 wrote:
 

 okay, another problem occured. I have different fields with the same
 name.
 I
 can't seperate them like naming them field1 field2 etc. cause while
 indexing
 i don't know how many fields i will need.
 Like a book has several signature numbers i want to save them in a field
 signature and when i search for such a number i want the search hit every
 single field and not all fields together.
 Right now i separate the string using an unique separator (in this case
 just
 $$$) so i can split the string into the numbers but i think this is kinda
 the worst form doing it.




 JensBurkhardt wrote:
 
  hey everybody,
 
  I'm wondering if it's possible to combine wildcards and phrase query.
 
  For example term1 term*
 
  I know that the documentation says Lucene supports single and multiple
  character wildcard searches within single terms (not within phrase
  queries) but maybe someone has had the same problem and found a
 solution.
 
  Thanks for your help
 
  Jens Burkhardt
 

 --
 View this message in context:
 http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15872169.html
 Sent from the Lucene - Java Users mailing list archive at Nabble.com.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 
 

-- 
View this message in context: 
http://www.nabble.com/combine-wildcard-and-phrase-query-tp15870647p15874560.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Swapping between indexes

2008-03-06 Thread spring
 Since Lucene buffers in memory, you will always have the risk of
 losing recently added documents that haven't been flushed yet.
 Committing on every document would be too slow to be practical.

Well it is not sooo slw...

I have indexed 10.000 docs, resulting in 14 MB index. The index has 2 stored
fields and the tokenized content field.

With a commit after every add: 30 min.
With a commit after 100 add: 23 min.
Only one commit: 20 min.

(including time to get the document from the archive)

I use lucene 2.3 so a commit is a combination of closing and creating the
writer.
2.4/3.0 has a commit method which may be faster.

Before this test I thought it would be much slower than 30 min...

So one has to decide if correctness is more important than performance.

I use a batch size of 100, first committing lucene, then committing the
database which holds the status of the document if it is already indexed or
not.
If the db commit fails it is no problem, because my app does not care about
multiple indexed documents. But until now neither the lucene nor the db
commit ever failed...



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Swapping between indexes

2008-03-06 Thread spring
   With a commit after every add: 30 min.
   With a commit after 100 add: 23 min.
   Only one commit: 20 min.
 
 All of these times look pretty slow... perhaps lucene is not the
 bottleneck here?

Therefore I wrote:

(including time to get the document from the archive)

Not the absolute times are important, the differences are imported.
They only occur due to the different batch sizes.

I think it is a real world scenario because one has always the read the docs
from somewhere and offen has to store the index state somewhere else.

A test with docs created in memory and no state in a database would have of
cause completely other results.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Boolean Query search performance

2008-03-06 Thread Beard, Brian
Thanks for all replies.

Today when I printed out the query that's generated it does not have the
extra paren's. And query.rewrite(reader).toString() now gives the same
result as query.toString(). All I can figure is I must have changed
something between starting the email and sending it out. The other
oddity is the performance degradation is not as apparent. I'm wondering
if part of the problem is generating consistent data for comparing
search performance. I do a warmup before actually running the test, but
maybe it's not a good enough way to test.

One additional thing - from an earlier suggestion - is it possible to
add multiple terms per BooleanClause? I tried using TermQuery.combine()
to add in an array of them into one query and making a clause from that,
but there was no difference in performance.

Brian



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help with Fuzzy Queries

2008-03-06 Thread Eloi Rocha Neto
Hi,

  I am new with Lucene.

  I dont understand how Lucene works in some cases. For example:

  If I have an index with the following three entries:
   - ATUAÇÃO FALHA DE DISJUNTOR
   - RESET DE FALHA DE DISJUNTOR
   - FALHA DE COMANDO

  When I try to look for something limilar with FALHA DE DISJUNTOR, I've
got the following results:
 Result | score
 FALHA DE COMANDO | 0.9277342
 ATUAÇÃO FALHA DE DISJUNTOR | 0.8880876
 RESET DE FALHA DE DISJUNTOR | 0.5709133

   If you pay attention, FALHA DE COMANDO is comming before ATUAÇÃO FALHA
DE DISJUNTOR, but it is not what I would like. For my client, the first
result should be ATUAÇÃO FALHA DE DISJUNTOR.

   Other example happens when I try to look for FALHA DISJUNTOR. For my
surprise, the unique result is FALHA DE COMANDO. Like the other example, I
was expecting ATUAÇÃO FALHA DE DISJUNTOR as the first or unique result.

   What should I do in order to have the ATUAÇÃO FALHA DE DISJUNTOR as the
first result (or unique) in the explained cases.

   I am attaching the code at the end of this mail.

Thanks in advance,

Eloi


public class TestLucene {

private static final String FILENAME = test.idx;
private static final String FIELD = FIELD;

private static ListDocument getDocs() {
ListDocument docs = new ArrayListDocument();
docs.add(toDoc(Atuação Falha de Disjuntor));
docs.add(toDoc(Reset de Falha de Disjuntor));
docs.add(toDoc(Falha de Comando));
return docs;
}

private static Document toDoc(String text) {
Document doc = new Document();
doc.add(new Field(FIELD, text.toUpperCase(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
return doc;
}

private static void createIndex(Collection docs) throws IOException {
IndexWriter writer = new IndexWriter(FILENAME, new
StandardAnalyzer(),
true);
Iterator itDocs = docs.iterator();
while (itDocs.hasNext()) {
Document doc = (Document) itDocs.next();
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}

public static void main(String[] args) throws IOException {
createIndex(getDocs());
IndexSearcher indexSearcher = new IndexSearcher(FILENAME);
Hits hits = indexSearcher.search(new FuzzyQuery(new
Term(FIELD,Falha de Disjuntor.toUpperCase()), 0.4f));
System.out.println(hits.length());
Iterator it = hits.iterator();
while (it.hasNext()) {
Hit hit = (Hit) it.next();
System.out.println(hit.get(FIELD) +   + hit.getScore());
}
}

}


Re: Swapping between indexes

2008-03-06 Thread Peter Keegan
Sridhar,

We have been using approach 2 in our production system with good results. We
have separate processes for indexing and searching. The main issue that came
up was in deleting old indexes (see: *http://tinyurl.com/32q8c4*). Most of
our production problems occur during indexing, and we are able to fix these
without having to interrupt searching at all. This has been a real benefit.

Peter


On Thu, Mar 6, 2008 at 5:30 AM, Sridhar Raman [EMAIL PROTECTED]
wrote:

 This is my situation.  I have an index, which has a lot of search requests
 coming into it.  I use just a single instance of IndexSearcher to process
 these requests.  At the same time, this index is also getting updated by
 an
 IndexWriter.  And I want these new changes to be reflected _only_ at
 certain
 intervals.  I have thought of a few ways of doing this.  Each has its
 share
 of problems and pluses.  I would be glad if someone can help me in
 figuring
 out the right approach, especially from the performance point of view, as
 the number of documents that will get indexed are pretty large.

 Approach 1:
 Have just one copy of the index for both Search  Index.  At time T, when
 I
 need to see the new changes reflected, I close the Searcher, and open it
 again.
 - The re-open of the Searcher might be a bit slow (which I could probably
 solve by using some warm-up threads).
 - Update and Search on the index at the same - will this affect the
 performance?
 - If server crashes before time T, the new Searcher would reflect the
 changes, which is not acceptable.  I want the changes to be reflected only
 at time T.  If server crashes, the index should be the previous T-1 index.
 - Possible problems while optimising the index (as Search is also
 happening).
 + Just one copy of the index being stored.

 Approach 2:
 Keep 2 copies of the index - 1 for Search, 1 for Index.  At time T, I just
 switch the Searcher to a copy of index that is being updated.
 - Before I do the switch to the new index, I need to make a copy of it so
 that the updates continue to happen on the other index.  Is there a
 convenient way to make this copy?  Is it efficient?
 - Time taken to create a new Searcher will still be a problem (but this is
 a
 problem in the previous approach as well, and we can live with it).
 + Optimise can happen on an index that is not being read, as a result, its
 resource requirements would be lesser.  And probably even the speed of
 optimisation.
 + Faster search as the index update is happening on a different index.

 So, these are the 2 approaches I am contemplating about.  Any pointers
 which
 would be the better approach?

 Thanks,
 Sridhar



Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

2008-03-06 Thread Erick Erickson
Well, I'm not sure. But any index, even one split amongst many nodes
is going to have some interesting performance characteristics if you
have over 2 billion documents So I'm not sure it matters G...

What problem are you really trying to solve? You'll probably get
more meaningful answers if you tell us what that is.

Best
Erick

On Thu, Mar 6, 2008 at 10:23 AM, Ray [EMAIL PROTECTED] wrote:

 Hey Guys,

 just a quick question to confirm an assumption I have.

 Is it correct that I can have around 100 Indexes each at its
 Integer.MAX_VALUE limit of documents, but can happily
 search  them all with a MultiSearcher if all combined returned
 hits don't add up to the Integer.MAX_VALUE themselves ?

 Kind regards,

 Ray.





Re: MultiSearcher to overcome the Integer.MAX_VALUE limit

2008-03-06 Thread Ray


Thanks for your answer.

Well I want to search around  6 billion documents.
Most of them very small, but I am confident to be hitting
that number in the long run.

I am currently running a small random text indexer with 400 docs/second.
It will reach 2 billion in around 45 days.

I really hope you all who are saying 2 billion docs
will bring lucene to its knees are wrong... 


Ray.

- Original Message - 
From: Erick Erickson [EMAIL PROTECTED]

To: java-user@lucene.apache.org
Sent: Thursday, March 06, 2008 10:40 PM
Subject: Re: MultiSearcher to overcome the Integer.MAX_VALUE limit



Well, I'm not sure. But any index, even one split amongst many nodes
is going to have some interesting performance characteristics if you
have over 2 billion documents So I'm not sure it matters G...

What problem are you really trying to solve? You'll probably get
more meaningful answers if you tell us what that is.

Best
Erick

On Thu, Mar 6, 2008 at 10:23 AM, Ray [EMAIL PROTECTED] wrote:


Hey Guys,

just a quick question to confirm an assumption I have.

Is it correct that I can have around 100 Indexes each at its
Integer.MAX_VALUE limit of documents, but can happily
search  them all with a MultiSearcher if all combined returned
hits don't add up to the Integer.MAX_VALUE themselves ?

Kind regards,

Ray.








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to create segments files?

2008-03-06 Thread Hasan Diwan

Ladies and Gentlemen:
Below is an exception and the source code that generates it:
ERROR opening the Index - contact sysadmin!

Error message: no segments* file found in 
org.apache.lucene.store.FSDirectory@/home/hdiwan/public_html/Q4D: files:


Stack Trace follows...
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:587)
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48)
org.apache.jsp.results_jsp._jspService(results_jsp.java:130)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:374)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:337)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
java.lang.Thread.run(Thread.java:619)

-- Source code follows --
import java.io.File;
import java.io.FilenameFilter;
import java.io.IOException;
import java.net.URLDecoder;
import java.util.Collection;
import java.util.Collections;
import java.util.Comparator;
import java.util.Date;
import java.util.HashSet;
import java.util.Vector;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.store.FSDirectory;
public class Parser implements Runnable,Comparator {
   String path;
   public Parser(String string) {
path = string;
}
public void run() {
IndexWriter writer = null;
Directory directory = null;
try {
 directory = FSDirectory.getDirectory(this.path);
} catch (IOException e) {
System.err.println(e.getStackTrace());
}
try {
writer = new IndexWriter(directory,new 
WhitespaceAnalyzer(), true);

} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
Document doc = new Document();
File image = null;
File file = null;
for (File f : this.listFiles(new File(path))) {
if (f.getAbsolutePath().endsWith(xml) || 
f.getAbsolutePath().endsWith(q4d)) {

System.err.println(Q4D file found!);
file = f;
} else {
image = f;
}
if ((f != null)  (image != null)) break;
}
Date lastModified = new Date(file.lastModified());
System.err.println(Found a file and its 
corresponding image!);

String imageName = image.getName();
String filename = file.getName();
String lastModifiedDownToSecond = 
DateTools.dateToString(lastModified, DateTools.Resolution.SECOND);
System.err.println(the time the file was last 
modified was +lastModifiedDownToSecond);

String author = System.getProperty(author);
   String source = System.getProperty(source);