Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@path of index file

2007-03-01 Thread Jerome Chauvin
 
All,
 
We encounter issues while updating the lucene index, here is the stack trace:
 
Caused by: java.io.IOException: Lock obtain timed out:
SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:69)
 at org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526)
 at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551)
 at org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:578)
 at
com.bi.commerce.service.catalog.spring.lucene.LuceneIndex.deleteFromIndex(Luc
eneIndex.java:692)
 ... 25 more
 
 
Here is the source code of the lucene API invocation where the error occurs:
 
class com.bi.commerce.service.catalog.spring.lucene.LuceneIndex:
 
import org.apache.lucene.index.IndexReader;
...
 
public synchronized void deleteFromIndex(ICatalogEntity entity) {
if(!indexExists()) return;
try {
IndexReader reader = IndexReader.open(store);
String uid = getUID(entity);
try{
line 692=  reader.deleteDocuments(new Term(uid,uid));
}catch(ArrayIndexOutOfBoundsException e){
//CHECK ignore this. Can happen if index has not been built
yet (??)
}
reader.close();
} catch (IOException e) {
throw new SearchEngineException(e);
}catch(RuntimeException e){
throw new SearchEngineException(e);
}
}
 
 
 
Do I something wrong? If somebody already encountered this error, or knows a
fix, I'm really interested!
 
Thanks in advance,
 
BRegards.
 
-Jerome Chauvin-
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [ANN] ParallelSearcher in multi-node environment

2007-03-01 Thread dmitri

e.g. I've changed original ParallelSearcher to use thread pool
(java.util.concurrent.ThreadPoolExecutor from jdk 1.5).
But implementing multi-host installation still requires a lot of changes
since ParallelSearcher calles underlying Searchables too many times (e.g.
for separate network call for every document)

Dmitri 
-- 
View this message in context: 
http://www.nabble.com/ParallelSearcher-in-multi-node-environment-tf3301080.html#a9245525
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: indexing performance

2007-03-01 Thread Nadav Har'El
On Tue, Feb 27, 2007, Saravana wrote about indexing performance:
 Hi,
 
 Is it possible to scale lucene indexing like 2000/3000 documents per
 second?

I don't know about the actual numbers, but one trick I've used in the past
to get really fast indexing was to create several independent indexes in
parallel. Simply, if you have, say, 4 CPUs and perhaps even several physical
disks, run 4 indexing processes each indexing a 1/4 of the files and creating
a separate index (on separate disks on separate IO channels, if possible).

At the end, you have 4 indexes which you can actually search together without
any real need to merge them, unless query performance is very important to
you as well.

 I need to index 10 fields each with 20 bytes long.  I should be
 able to search by just giving any of the field values as criteria. I need to
 get the count that has same field values.

You need just the counts? And you want to do just whole-field matching, not
word matching? In that case, Lucene might be an overkill for you. Or, if you
do use Lucene, make sure to use keyword (untokenized) fields, not
tokenized fields.

-- 
Nadav Har'El|  Thursday, Mar  1 2007, 11 Adar 5767
IBM Haifa Research Lab  |-
|Open your arms to change, but don't let
http://nadav.harel.org.il   |go of your values.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
Hi,

While updating my index I have the following error :

[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException:
Lock obtain timed out:
[EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock
[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.store.Lock.obtain(Lock.java:56)
[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:489
)
[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:514)
[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:541
)


I am using lucene 2.0,

When I execute the code below I find an entry with the specified Term
(it displays One Entry Found)
Then when I try to delete the document, I get the error I apsted above.

What I do is :
open a second index reader,
delete document
close second index reader
close main index reader, 
open new idnexreader

Can anyone help ?

Thank u very much.


  // Open second indexReader
  IndexReader mIndexReaderClone = null;
  try {
mIndexReaderClone = IndexReader.open(mWorkingIndexDir);
  }
  catch (IOException exc) {
exc.printStackTrace();
throw new RegainException(Creating index reader failed, exc);
  }

Term urlTerm = new Term(url, url1);
Query query2 = new TermQuery(urlTerm2);
Document doc2;

  Hits hits2 = search(query2);
  if (hits2.length()  0) {
if (hits2.length()  1) {
  System.out.println(Duplicates Entries);
}
System.out.println(One Entry Found);
  }
  else {
  System.out.println(No Entries);
  }

  
  
  try {
mIndexReaderClone.deleteDocuments(urlTerm);
  } catch (IOException e) {
e.printStackTrace();
throw new RegainException(Deleting old entry failed, e);
  }
  
  // Close the Clone IndexReader
try {
  mIndexReaderClone.close();
}
catch (IOException exc) {
  throw new RegainException(Closing index reader failed, exc);
}

__

   Matt




Internet communications are not secure and therefore Fortis Banque Luxembourg 
S.A. does not accept legal responsibility for the contents of this message. The 
information contained in this e-mail is confidential and may be legally 
privileged. It is intended solely for the addressee. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted to be taken in reliance on it, is prohibited and may be unlawful. 
Nothing in the message is capable or intended to create any legally binding 
obligations on either party and it is not intended to provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Update - IOException

2007-03-01 Thread DECAFFMEYER MATHIEU
I deleted the lock file, now it seems to work ...
 
When can such an error happen ?
 
__ 

   Matt

 



From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 9:56 AM
To: java-user@lucene.apache.org
Subject: Update - IOException


* This message comes from the Internet Network *



Hi, 

While updating my index I have the following error : 

[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R java.io.IOException:
Lock obtain timed out:
[EMAIL PROTECTED]:\TEMP\lucene-b56f455aea0a705baecaa4411d590aa2-write.lock

[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.store.Lock.obtain(Lock.java:56) 
[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:489
)

[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:514)

[3/1/07 9:44:19:214 CET] 76414c82 SystemErr R   at
org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:541
)


I am using lucene 2.0, 

When I execute the code below I find an entry with the specified Term 
(it displays One Entry Found) 
Then when I try to delete the document, I get the error I apsted above. 

What I do is : 
open a second index reader, 
delete document 
close second index reader 
close main index reader, 
open new idnexreader 

Can anyone help ? 

Thank u very much. 


  // Open second indexReader 
  IndexReader mIndexReaderClone = null; 
  try { 
mIndexReaderClone = IndexReader.open(mWorkingIndexDir); 
  } 
  catch (IOException exc) { 
exc.printStackTrace(); 
throw new RegainException(Creating index reader failed, exc); 
  } 

Term urlTerm = new Term(url, url1); 
Query query2 = new TermQuery(urlTerm2); 
Document doc2; 

  Hits hits2 = search(query2); 
  if (hits2.length()  0) { 
if (hits2.length()  1) { 
  System.out.println(Duplicates Entries); 
} 
System.out.println(One Entry Found); 
  } 
  else { 
  System.out.println(No Entries); 
  } 

  
  
  try { 
mIndexReaderClone.deleteDocuments(urlTerm); 
  } catch (IOException e) { 
e.printStackTrace(); 
throw new RegainException(Deleting old entry failed, e); 
  } 
  
  // Close the Clone IndexReader 
try { 
  mIndexReaderClone.close(); 
} 
catch (IOException exc) { 
  throw new RegainException(Closing index reader failed, exc); 
} 

__ 

   Matt 




Internet communications are not secure and therefore Fortis Banque
Luxembourg S.A. does not accept legal responsibility for the contents of
this message. The information contained in this e-mail is confidential
and may be legally privileged. It is intended solely for the addressee.
If you are not the intended recipient, any disclosure, copying,
distribution or any action taken or omitted to be taken in reliance on
it, is prohibited and may be unlawful. Nothing in the message is capable
or intended to create any legally binding obligations on either party
and it is not intended to provide legal advice.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Best way to returning hits after search?

2007-03-01 Thread Antony Bowesman

If you decide to cache stored field value in memory, FieldCache may be
useful for this - so you don't have to implement your own cache - you can
access the field values with something like:
   FieldCache fieldCache = FieldCache.DEFAULT;
   String db_id_field[] =
fieldCache.getStrings(indexReader,DB_ID_FIELD_NAME);
Those values are valid for the lifetime of the index-reader. Once a new
index reader is opened, when GC collects the unused old index reader
object, it would also be able to collect (from the cache) unused field
values.


Thanks for the pointers Doron.  I'll take a look at that.
Antony



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene 2.1: java.io.IOException: Lock obtain timed out: SimpleFSLock@path of index file

2007-03-01 Thread Michael McCandless
Jerome Chauvin [EMAIL PROTECTED] wrote:

 We encounter issues while updating the lucene index, here is the stack
 trace:
  
 Caused by: java.io.IOException: Lock obtain timed out:
 SimpleFSLock@/data/www/orcanta/lucene/store1/write.lock
  at org.apache.lucene.store.Lock.obtain(Lock.java:69)
  at
  org.apache.lucene.index.IndexReader.aquireWriteLock(IndexReader.java:526)
  at
  org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java:551)
  at
  org.apache.lucene.index.IndexReader.deleteDocuments(IndexReader.java:578)
  at
 com.bi.commerce.service.catalog.spring.lucene.LuceneIndex.deleteFromIndex(Luc
 eneIndex.java:692)
  ... 25 more

First off, you have to ensure only one writer (either IndexWriter or
as in this case IndexReader doing deletes) is trying to update the
index at the same time.  Lucene only allows one writer on an index, and if
a second writer tries to open it will receive exactly this exception.

(Note that as of 2.1 you can now do deletes with IndexWriter which
simplifies things because you can use a single IndexWriter for
adds/updates/deletes.)

If you are already doing that (single writer) correctly, the other
common cause is that this is a leftover lock file (for example if the
JVM crashed or was killed or even if you didn't close a previous
writer before the JVM exited).  There is a better locking
implementation (NativeFSLockFactory) that correctly frees the lock
when the JVM crashes so you may want to use that one instead if you
hit this often (but first explain the root cause of your crashes!).

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Update - IOException

2007-03-01 Thread Michael McCandless

DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote:
 I deleted the lock file, now it seems to work ...
  
 When can such an error happen ?

See my response I just sent to java-user on this same error.  Even though
you are running Lucene 2.0, the same causes can lead to that Lock obtain
timed out exception.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Spanned indexes

2007-03-01 Thread Kainth, Sachin
Hi all,

Is it possible in Lucene for an index to span multiple files?  If so
what is the recommendation in this case?  Is it better to span after the
index reaches a particular size?  Furthermore, does Lucene ever span a
single record between two or more index files in this case or does it
ensure that a single record will only appear in one spanned file?

Many thanks for your advice

Sachin


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 


Re: Sorting by Score

2007-03-01 Thread Erick Erickson

Peter:

About a custom ScoreComparator. The problem I couldn't get past was that I
needed to know the max score of all the docs in order to divide the raw
scores into quintiles since I was dealing with raw scores. I didn't see how
to make that work with ScoreComparator, but I confess that I didn't look
very hard after someone on the list turned me on to FieldSortedHitQueue

Erick

On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:


It may well be, but as I said this is efficient enough for my needs
so I didn't pursue it. One of my pet peeves is spending time making
things more efficient when there's no need, and my index isn't
going to grow enough larger to worry about that now G...

Erick

On 2/28/07, Peter Keegan  [EMAIL PROTECTED] wrote:

 Erich,

 Yes, this seems to be the simplest way to implement score
 'bucketization',
 but wouldn't it be more efficient to do this with a custom
 ScoreComparator?
 That way, you'd do the bucketizing and sorting in one 'step'
 (compare()).
 Maybe the savings isn't measurable, though. A comparator might also
 allow
 one to do a more sophisticated rounding or bucketizing since you'd be
 getting 2 scores at a time.

 Peter


 On 2/28/07, Erick Erickson [EMAIL PROTECTED]  wrote:
 
  Empirically, when I insert the elements in the FieldSortedHitQueue
  they get sorted according to the Sort object. The original query
  that gives me a TopDocs applied
  no secondary sorting, only relevancy. Since I normalized
  all the scores into one of only 5 discrete values, and secondary
  sorting was applied to all docs with the same score when I inserted
  them in the FieldSortedHitQueue.
 
  Now popping things of the FieldSortedHitQueue is ordered the
  way I want.
 
  You could just operate on the FieldSortedHitQueue at this point, but
  I decided the rest of my code would be simpler if I stuffed them back
  into the TopDocs, so there's some explanation below that you can
  just skip if I've cleared things up already.
 
  *
  The step I left out is moving the documents from the
  FIeldSortedHitQueue back to topDocs.scoreDocs.
  So the steps are as follows..
 
  1 bucketize the scores. That is, go through the
  TopDocs.scoreDocs and adjust each raw score into
  one of my buckets. This is made easy by the
  existence of topDocs.getMaxScore . TopDocs has
  had no sorting other than relevancy applied so far.
 
  2 assemble the FieldSortedHitQueue by inserting
  each element from scoreDocs into it, with a suitable
  Sort object, relevance is the first field ( SortField.FIELD_SCORE).
 
  3 pop the entries off the FieldSortedHitQueue, overwriting
  the elements in topDocs.scoreDocs.
 
  I left out step 3, although I suppose you could
  operate directly on the FieldSortedHitQueue.
 
  NOTE: in my case, I just put everything back in the
  scoreDocs without attempting any efficiencies. If I
  needed more performance, I'd only put as many items
  back as I needed to display. But as I wrote yesterday,
  performance isn't an issue so there's no point. Although
  I know one place to look if we need to squeeze more QPS.
 
  How efficient this is is an open question. But it's fast enough
  and relatively simple so I stopped looking for more
  efficiencies
 
  Erick
 
  On 2/28/07, Chris Hostetter [EMAIL PROTECTED]  wrote:
  
  
   : The first part was just to iterate through the TopDocs that's
  available
   to
   : my and normalize the scores right in the ScoreDocs. Like this...
  
   Won't that be done after the Lucene does the hitcollecting/sorting?
 ...
  he
   wants the bucketing to happen as part of hte scoring so that the
   secondary sort will determine the ordering within the bucket.
  
   (or am i missing something about your description?)
  
  
  
  
   -Hoss
  
  
  
 -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]
  
  
 





Re: [ANN] ParallelSearcher in multi-node environment

2007-03-01 Thread Sharad Agarwal
yeah I am too looking forward to this feature, using thread pool and 
minimize the remote calls in ParallelSearcher




[EMAIL PROTECTED] wrote:


e.g. I've changed original ParallelSearcher to use thread pool
(java.util.concurrent.ThreadPoolExecutor from jdk 1.5).
But implementing multi-host installation still requires a lot of changes
since ParallelSearcher calles underlying Searchables too many times (e.g.
for separate network call for every document)

Dmitri 
 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Sorting by Score

2007-03-01 Thread Peter Keegan

Erick,

I think you're right because you'd wouldn't know the max score before the
comparisons. I'm just thinking about a rounding algorithm that involves
comparing the raw scores to the theoretical maximum score, which I think
could be computed from the Similarity class and knowing the max boost value
used during indexing.

Peter

On 3/1/07, Erick Erickson [EMAIL PROTECTED] wrote:


Peter:

About a custom ScoreComparator. The problem I couldn't get past was that I
needed to know the max score of all the docs in order to divide the raw
scores into quintiles since I was dealing with raw scores. I didn't see
how
to make that work with ScoreComparator, but I confess that I didn't look
very hard after someone on the list turned me on to
FieldSortedHitQueue

Erick

On 2/28/07, Erick Erickson [EMAIL PROTECTED] wrote:

 It may well be, but as I said this is efficient enough for my needs
 so I didn't pursue it. One of my pet peeves is spending time making
 things more efficient when there's no need, and my index isn't
 going to grow enough larger to worry about that now G...

 Erick

 On 2/28/07, Peter Keegan  [EMAIL PROTECTED] wrote:
 
  Erich,
 
  Yes, this seems to be the simplest way to implement score
  'bucketization',
  but wouldn't it be more efficient to do this with a custom
  ScoreComparator?
  That way, you'd do the bucketizing and sorting in one 'step'
  (compare()).
  Maybe the savings isn't measurable, though. A comparator might also
  allow
  one to do a more sophisticated rounding or bucketizing since you'd be
  getting 2 scores at a time.
 
  Peter
 
 
  On 2/28/07, Erick Erickson [EMAIL PROTECTED]  wrote:
  
   Empirically, when I insert the elements in the FieldSortedHitQueue
   they get sorted according to the Sort object. The original query
   that gives me a TopDocs applied
   no secondary sorting, only relevancy. Since I normalized
   all the scores into one of only 5 discrete values, and secondary
   sorting was applied to all docs with the same score when I inserted
   them in the FieldSortedHitQueue.
  
   Now popping things of the FieldSortedHitQueue is ordered the
   way I want.
  
   You could just operate on the FieldSortedHitQueue at this point, but
   I decided the rest of my code would be simpler if I stuffed them
back
   into the TopDocs, so there's some explanation below that you can
   just skip if I've cleared things up already.
  
   *
   The step I left out is moving the documents from the
   FIeldSortedHitQueue back to topDocs.scoreDocs.
   So the steps are as follows..
  
   1 bucketize the scores. That is, go through the
   TopDocs.scoreDocs and adjust each raw score into
   one of my buckets. This is made easy by the
   existence of topDocs.getMaxScore . TopDocs has
   had no sorting other than relevancy applied so far.
  
   2 assemble the FieldSortedHitQueue by inserting
   each element from scoreDocs into it, with a suitable
   Sort object, relevance is the first field ( SortField.FIELD_SCORE).
  
   3 pop the entries off the FieldSortedHitQueue, overwriting
   the elements in topDocs.scoreDocs.
  
   I left out step 3, although I suppose you could
   operate directly on the FieldSortedHitQueue.
  
   NOTE: in my case, I just put everything back in the
   scoreDocs without attempting any efficiencies. If I
   needed more performance, I'd only put as many items
   back as I needed to display. But as I wrote yesterday,
   performance isn't an issue so there's no point. Although
   I know one place to look if we need to squeeze more QPS.
  
   How efficient this is is an open question. But it's fast enough
   and relatively simple so I stopped looking for more
   efficiencies
  
   Erick
  
   On 2/28/07, Chris Hostetter [EMAIL PROTECTED]  wrote:
   
   
: The first part was just to iterate through the TopDocs that's
   available
to
: my and normalize the scores right in the ScoreDocs. Like this...
   
Won't that be done after the Lucene does the
hitcollecting/sorting?
  ...
   he
wants the bucketing to happen as part of hte scoring so that the
secondary sort will determine the ordering within the bucket.
   
(or am i missing something about your description?)
   
   
   
   
-Hoss
   
   
   
  -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   
   
  
 





Re: Spanned indexes

2007-03-01 Thread Otis Gospodnetic
Sachin,
A lof of the questions you are asking are covered either in the FAQ or on the 
Lucene site somewhere, or in various Lucene articles or in LIA.  You should 
check those places first (the traffic on java-user is already high!), you'll 
save yourself a lot of time.  For this particular question, have a look at the 
File Formats page on Lucene's site.

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Kainth, Sachin [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, March 1, 2007 7:21:52 AM
Subject: Spanned indexes

Hi all,

Is it possible in Lucene for an index to span multiple files?  If so
what is the recommendation in this case?  Is it better to span after the
index reaches a particular size?  Furthermore, does Lucene ever span a
single record between two or more index files in this case or does it
ensure that a single record will only appear in one spanned file?

Many thanks for your advice

Sachin


This email and any attached files are confidential and copyright protected. If 
you are not the addressee, any dissemination of this communication is strictly 
prohibited. Unless otherwise expressly agreed in writing, nothing stated in 
this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered 
in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, 
Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need 
to. 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



retrieve term positions in query

2007-03-01 Thread matpil

Hi!
My problem is to retrieve the term positions in a general query with more
than one terms.
It seems that with the phrase query it's possible (with SpanQuery) but with
AND and OR query I can't get the position for each document I search.
I'm looking for a high level implementation because I don't want to use low
level lucene API (I'm a lucene newbie...) 

Thanx in advance,
Mat
-- 
View this message in context: 
http://www.nabble.com/retrieve-term-positions-in-query-tf3327146.html#a9250330
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Performance in having Multiple Index files

2007-03-01 Thread Mordo, Aviran (EXP N-NANNATEK)
Yes, it will affect the search performance because you need to merge the
results from the different indexes. The best performance is from a
single index. The more indexes you have the more time it takes to
search.

Aviran
http://www.aviransplace.com 

-Original Message-
From: Raaj [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 2:50 AM
To: java-user@lucene.apache.org
Subject: Performance in having Multiple Index files

hi all,

i have requirement where in i create an index file for each xml file . i
have over 100/150 xml files which are all related . 

if create 100/150 index files and query using these indices , will this
affect the performance of the search operation . 

bye
raaj



 
-
Need a quick answer? Get one in minutes from people who know. Ask your
question on Yahoo! Answers.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



question about ScoreDocComparator

2007-03-01 Thread Ulf Dittmer

Hello-

One of the fields in my index is an ID, which maps to a full text  
description behind the scenes. Now I want to sort the search results  
alphabetically according to the description, not the ID. This can be  
done via SortComparatorSource and a ScoreDocComparator without  
problems. But the code needed to do this is quite complicated - it  
involves retrieving the document ID from the ScoreDoc, then looking  
up the Document through an IndexReader, and then retrieving the ID  
field from the document. It seems that there should be an easier way  
to get at the ID field, since that is the one being used for the  
sort. There is a related class FieldDoc, through which it seems  
possible to get at the field values, but that doesn't seem applicable  
here.


I went through the custom sorting example of Lucene In Action, but  
that doesn't deal with this case. Am I missing something obvious?


Thanks in advance,
Ulf


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Fwd: Re: indexing performance]

2007-03-01 Thread Saravana

Hi,

You need just the counts? And you want to do just whole-field matching, not
word matching? In that case, Lucene might be an overkill for you. Or, if you
do use Lucene, make sure to use keyword (untokenized) fields, not
tokenized fields.

Sorry for not elaborating my requirement more. Actually I have some fields
that need word matching and for some fields I do not need word matching. I
have used NO_NORMS for whole fields and TOKENIZED for the fields that need
normalization. I need count as well as I need to show the fields that are
indexed.
For example the following criteria can be given by the user;

USER:john AND MSG:ftp

Here USER is NO_NORMS field and MSG will be tokenized field. Original log
message will be as follows.

2007 Jan 27 10:10:01 User John accessed ftp url images.html

So i cannot identify the count in the memory as the criteria will be

selected by the user or its not predefined. Moreover I have read the
following thread dated 2002



Thread on 2002:

my experiences are that the writing to the index takes the most time except
any parsing done by the user. I have been working on xml indexes and here
the
collection of data takes just as much time as to write. to increase *speed*i
have done three things that reduced my index time from 11hours to 2,5 hours
for the same dataset (1,3gb xml documents).

1: i index 50 documents into a ramdir, then when the limit is reached i
merge
this ramdir into a fsdir and flush the ramdir. this speeds up things
as i then don't have to use the fsdir as much and ramdir is much faster.

2: merging a large index into a large index takes nearly as much time as
merging a small index into a large index, so i have 4 (any number will do)
fsdirs that i write ramdirs to and then i merge these fsdirs into one large
fsdir at the end of a large indexrun.

3: multithreaded my application, create workerthreads that indexes into its
own sepparate ramdir, then flushes these ramdirs into each separate fsdir
(hench i have a fsdir for each workerthread), this because you can only
write
to a dir by one thread.

in the end this imporved my *indexing* time a lot...

hope some of this can help you!

mvh karl �ie



Is this still hold good now ? Thanks for your reply.

regards,
MSK

-- Forwarded message --

From: Nadav Har'El [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Date: Thu, 1 Mar 2007 10:28:07 +0200
Subject: Re: indexing performance
On Tue, Feb 27, 2007, Saravana wrote about indexing performance:
 Hi,

 Is it possible to scale lucene indexing like 2000/3000 documents per
 second?

I don't know about the actual numbers, but one trick I've used in the past
to get really fast indexing was to create several independent indexes in
parallel. Simply, if you have, say, 4 CPUs and perhaps even several
physical
disks, run 4 indexing processes each indexing a 1/4 of the files and
creating
a separate index (on separate disks on separate IO channels, if possible).

At the end, you have 4 indexes which you can actually search together
without
any real need to merge them, unless query performance is very important to
you as well.

 I need to index 10 fields each with 20 bytes long.  I should be
 able to search by just giving any of the field values as criteria. I
need to
 get the count that has same field values.

You need just the counts? And you want to do just whole-field matching,
not
word matching? In that case, Lucene might be an overkill for you. Or, if
you
do use Lucene, make sure to use keyword (untokenized) fields, not
tokenized fields.

--
Nadav Har'El|  Thursday, Mar  1 2007, 11 Adar
5767
IBM Haifa Research
Lab  |-
|Open your arms to change, but don't
let
http://nadav.harel.org.il   |go of your values.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





Re: TextMining.org Word extractor

2007-03-01 Thread Bill Taylor


On Feb 23, 2007, at 2:00 PM, [EMAIL PROTECTED]  
wrote:



Re: TextMining.org Word extractor


Someone noted that textmining.org gets hacked.  There is test- 
mining.org which appears to be a commercial site.  Can someone tell  
me where to get the download of the original GPL textmining.org  
software?


Thanks.






RE: TextMining.org Word extractor

2007-03-01 Thread Bruce Ritchie
I can't speak to where you can get a copy of the original code, but the
modified code I have is not GPL licenced - the license header in at
least one file is as follows:

/*  Copyright 2004 Ryan Ackley
 *
 *  Licensed under the Apache License, Version 2.0 (the License);
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *  http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an AS IS BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */


Regards,

Bruce Ritchie  

 -Original Message-
 From: Bill Taylor [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, March 01, 2007 11:00 AM
 To: java-user@lucene.apache.org
 Subject: Re: TextMining.org Word extractor
 
 
 On Feb 23, 2007, at 2:00 PM, [EMAIL PROTECTED]
 wrote:
 
  Re: TextMining.org Word extractor
 
 Someone noted that textmining.org gets hacked.  There is 
 test- mining.org which appears to be a commercial site.  Can 
 someone tell me where to get the download of the original GPL 
 textmining.org software?
 
 Thanks.
 
 
 
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document field updates

2007-03-01 Thread Erik Hatcher


On Feb 28, 2007, at 8:59 AM, Steven Parkes wrote:


Are unindexed fields stored seperately from the main inverted
index?
If so then, one could implement the field value change as a
delete and
re-add of just that value?

The short answer is that won't work. Field values are stored in a
different data structure than the postings lists but docids are
consistent across all contents of a segment. Deleting something and
readding it is going to put it into a different segment which is going
to keep this from working. (Not to mention that you want the postings
lists updated if you want it to be searchable ...)

Are you aware of some implementation of Lucene that solves this
need
well with a second index for 'tags' complete with multi-index
boolean
queries?

I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text?


Nutch does do this sort of thing, but I'm not quite sure how.  It  
isn't doing any operations to the Lucene index beyond what plain ol'  
Lucene does.



I don't know if Solr has anything like this but
if I remember correctly, Collex has tags but as far as I can tell,  
it's

not been open sourced (yet?)


Collex is quite open source, its just ugly source :)  We're the  
'patacriticism' project at SourceForge, under the collex directory  
in Subversion.


Collex implements tagging by implementing JOIN cross-references  
between user/tag documents and regular object documents.  It's  
scalability is not going to be good at bigger numbers in its current  
architecture, but it works quite well for our 60k or so objects at  
the moment.


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document field updates

2007-03-01 Thread Andrzej Bialecki

Erik Hatcher wrote:


I'm pretty sure this has been done, I'm just not 100% sure where. Does
Nutch index link text?


Nutch does do this sort of thing, but I'm not quite sure how.  It 
isn't doing any operations to the Lucene index beyond what plain ol' 
Lucene does.




Nutch maintains a set of separate DBs (using Hadoop 
MapFile/SequenceFile), where inlinks are stored (together with their 
anchor text). During indexing this data is pulled in from the DBs piece 
by piece using the URLs as primary keys.


Nutch doesn't update _any_ data structures in-place - all update 
operations involve creating new data files and optionally deleting old 
data files. This includes also indexes - new indexes are being created 
from newly updated pages, and then only individual Lucene documents are 
deleted from older indexes to get rid of duplicates. After a while, 
really old indexes are removed completely, because their content is 
likely to be present in one of the newer indexes.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document field updates

2007-03-01 Thread Neal Richter

Collex is quite open source, its just ugly source :)  We're the
'patacriticism' project at SourceForge, under the collex directory
in Subversion.

Collex implements tagging by implementing JOIN cross-references
between user/tag documents and regular object documents.  It's
scalability is not going to be good at bigger numbers in its current
architecture, but it works quite well for our 60k or so objects at
the moment.


Have you implemented and code that enforces a Boolean query across
these two indexes?
Has anyone implemented a BooleanQuery class that operates across a set
of Fields that may live in different Indexes?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [Fwd: Re: indexing performance]

2007-03-01 Thread Mike Klaas

On 3/1/07, Saravana [EMAIL PROTECTED] wrote:


Is this still hold good now ? Thanks for your reply.


Probably most of that still applies to some extent.  However, it is
unclear whether it will speed up your application.

First thing is to find out what your bottleneck is.  Looking at the
stats on your machine during indexing, is io-bound? cpu-bound? mixed?

There are various possible strategies, but they will come from
finely-tuning your proceed to meet the bottlenecks you are
experiencing.  If you are cpu-bound, then perhaps you can use less
intensive analyzers, or purchase a multi-cpu machine and index
threadedly.  If you are i/o bound, you could 1) buy faster disks, 2)
use a faster i/o backend (e.g. RAID-0), 3) created indexes on multiple
independent disks and merge later.

regards,
-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: document field updates

2007-03-01 Thread Erik Hatcher


On Mar 1, 2007, at 1:35 PM, Neal Richter wrote:

Collex is quite open source, its just ugly source :)  We're the
'patacriticism' project at SourceForge, under the collex directory
in Subversion.

Collex implements tagging by implementing JOIN cross-references
between user/tag documents and regular object documents.  It's
scalability is not going to be good at bigger numbers in its current
architecture, but it works quite well for our 60k or so objects at
the moment.


Have you implemented and code that enforces a Boolean query across
these two indexes?


Actually its a single index, with a type field that separates the  
two different types of documents (archive objects, or collectable  
objects).


A pointer to this code is here: http:// 
patacriticism.svn.sourceforge.net/viewvc/patacriticism/collex/trunk/ 
src/solr/org/nines/CollectableCache.java?view=markup  It's a hack  
that leverages some of Solr's facilities (but not near enough!).


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



More long running queries

2007-03-01 Thread Tim Johnson
I'm still having issues with long running queries. 

I'm using a custom HitCollector to bring back ALL docs that match a search
has suggested in a previous post/relpy (e.g. Nutch LuceneQueryOptimizer).

This solution works most of the time; however, in testing a very complex
query using several range queries and term queries, we're seeing times in
the 40 sec range with NO HITS returned.

The index contains approx. one million docs and the number of Boolean
expressions created is well over 100,000

Tim


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Soliciting Design Thoughts on Date Searching

2007-03-01 Thread Steven Parkes
If all you want to do is find docs containing dates within a range, it
probably doesn't make much difference whether you give dates their own
field or put them into your content field. It'll probably be easier to
just add them into the token stream since that's the way the analyzer
architecture wants to work (analyzers generally don't know anything
about fields.) You can make the position increment work if you want, and
it'll make phrase/span queries work better, if you need those to work.

What is going to matter in either case is how you format dates.
Everything in Lucene is text, so if you want to do date ranges (which
you mentioned in your first e-mail), you need to be careful how you
format the dates and what kinds of queries you use. See, for example,
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/DateTo
ols.html
(tinyurl: http://tinyurl.com/ejlvx)
and
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
(tinyurl: http://tinyurl.com/2pubaq)
There are also date filters (as opposed to date queries) that have
different tradeoffs.

Dates are kinda tricky in Lucene.

-Original Message-
From: Walt Stoneburner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 7:54 AM
To: java-user@lucene.apache.org
Subject: Re: Soliciting Design Thoughts on Date Searching

Thank you all for the suggestions steering me down the right path.

As an aside, the easy part, at least for me, is extracting the dates
-- Peter was dead on about how doing that: heuristics, multiple
regular expressions, and data structures.  As Steve pointed out, this
isn't as trivial as it sounds - there are a lot of formats, some
ambiguous.

I love writing parsers (guess I'm sick in the head, eh?), so getting
the data isn't the problem, it's knowing what format to convert it
into and how to hand it to Lucene in a way that it'll find meaningful
for searching.

I had pondered making a single field with a value like:
document.add( Field.Text( dates, 27-Feb-1968,04-Jul-1776,01-Mar-2007
));
...but I wasn't convinced that the Lucene date Range was going to work
on anything other than a Date type, rather than a string of text that
just coincidently happened to contain dates.

Drawing back on my title example, I was under the incorrect impression
that if I had a field and provided another value that it replaced the
prior value.  Hoss is indicating this is not so, and that I'm safe
adding additional values.
document.add( Field.Text( title, Thanks Thomas ));
document.add( Field.Text( title, Thanks Hoss ) );  // Does not
stomp on Thomas. Yay!

If I can use this technique to pile in a ton of dates, then I'm
totally happy, you guys have pointed me in the right direction;
celebrations all around.

The brain scratcher, for me, was Peter's treating the dates like a
synonym -- a clever way of looking at the problem.  Unfortunately,
that'd be giving me too much credit, as I haven't played with that
feature set of Lucene.  So, without trying to, Peter's sent me
scrambling back to the API for something I wasn't aware was there.

Steve adds to the mystery by suggesting a delimited field list, much
like the example at the top of this message, and likewise doing some
trickery with the token stream and a position increment of zero --
again, a clever solution, and likewise beyond my limited Lucene
experience.

While I know, intellectually, that Lucene is digesting positioned
tokens, it is so well designed that fools like me can legitimately use
Lucene for long periods of time without actually being exposed to
what's happening under the hood.

The ponderance I now contemplate as a newbie (I've downgraded my self
assessment after this discussion) is knowing whether the token-stream
solution or the multiple-add solution is the pedantic one.  Are there
performance advantages to one way over the other?  I'll be totally
stunned if someone offers up that they're logically the same thing.

I swear, conversing with you guys is giving me a very deep sense of
appreciation for your skills and Lucene's capabilities.

-wls

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Field Selector in Searcher interface

2007-03-01 Thread Mark Miller
What are the odds (or reasons against) bubbling up doc(int, 
fieldSeclector) to Searcher? I would love to take advantage of the 
selective field loading but I am working with MultiSearchers and 
Searchers so I cannot count on getReader (in IndexSearcher) for access.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Field Selector in Searcher interface

2007-03-01 Thread Grant Ingersoll
The odds increase significantly in correlation to patches  
submitted!  :-)  The odds increase slightly by at least filing an  
enhancement issue in JIRA.  They increase a tiny bit by bringing it  
up here!  I may have some time in the not too distant future for  
this, but we always appreciate the help.


Looking briefly at the Searchable interface, it does seem to make  
sense, but that is just my quick glance take on it.


-Grant

On Mar 1, 2007, at 8:15 PM, Mark Miller wrote:

What are the odds (or reasons against) bubbling up doc(int,  
fieldSeclector) to Searcher? I would love to take advantage of the  
selective field loading but I am working with MultiSearchers and  
Searchers so I cannot count on getReader (in IndexSearcher) for  
access.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating index

2007-03-01 Thread Daniel Noll

Doron Cohen wrote:

Once indexing the database_id field this way, also the newly added
API IndexWriter.updateDocument() may be useful.


Whoa, nice convenience method.

I don't suppose the new document happens to be given the same ID as the 
old one.  That would make many people's lives much easier. :-)


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
Web: http://nuix.com/   Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating index

2007-03-01 Thread Doron Cohen

Daniel Noll [EMAIL PROTECTED] wrote on 01/03/2007 22:10:15:

  API IndexWriter.updateDocument() may be useful.

 Whoa, nice convenience method.

 I don't suppose the new document happens to be given the same ID as the
 old one.  That would make many people's lives much easier. :-)

Oh no, this aspect is as it was - the document(s) is deleted, and re-added.
However due to the buffering of deletes in IndexWriter, the application no
longer needs to take care of batching the deletes for performance
considerations - this is taken care of by IndexWriter.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]