Increasing Linux kernel open file limits.

2004-07-09 Thread Kevin A. Burton
Don't know if anyone knew this:
http://www.hp-eloquence.com/sdb/html/linux_limits.html
The kernel allocates filehandles dynamically up to a limit specified 
by file-max.

The value in file-max denotes the maximum number of file- handles that 
the Linux kernel will allocate. When you get lots of error messages 
about running out of file handles, you might want to increase this limit.

The three values in file-nr denote the number of allocated file 
handles, the number of used file handles and the maximum number of 
file handles. When the allocated filehandles come close to the 
maximum, but the number of actually used ones is far behind, you've 
encountered a peak in your filehandle usage and you don't need to 
increase the maximum.

So while root you can allocate as many file handles without any limits 
enforced by glibc you still have to fight against the kernel

Just doing a echo 100  /proc/sys/fs/file-max works fine.
Then I can keep track of my file limit by doing a
cat /proc/sys/fs/file-nr
At least this works on 2.6.x...
Think this is going to save me a lot of headache!
Kevin
--
Please reply using PGP.
   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: boolean operators and score

2004-07-09 Thread Brisbart Franck
There's no need to sort the words here.
You just have to ensure that the lucene query built is the same for the 
requests that you consider as equivalent.
I mean that if a request 'word1 word2' gives results different than 
'word2 word1', the problem is in your query parser or in the way you 
give the requests to it.
I keep on saying that with the lucene query parser, the requests 'word1 
and word2'  the request 'word2 and word1' are different because of the 
'required' flag.

Franck
Niraj Alok wrote:
Hi Don,
After months of struggling with lucene and finally achieving the complex
relevancy desired, the client would kill me if i now make that relevancy all
lost.
I am trying to do it with the way Franck suggested by sorting the words the
user has entered, but otherwise, isn't this a bug of lucene ?
Regards,
Niraj
- Original Message -
From: Don Vaillancourt [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 7:15 PM
Subject: Re: boolean operators and score

What could actually be done is perhaps sort the search result by document
id.  Of course your relevancy will be all shot, but at least you would
have
control over the sorting order.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Browse by Letter within a Category

2004-07-09 Thread Daniel Naber
On Friday 09 July 2004 04:27, O'Hare, Thomas wrote:

 Searcher.search(category:\Products\ AND title:\A*\, new
 Sort(title));

You can only sort on fields which are not tokenized I think. So add an extra 
field with the title, but untokenized, just for sorting. Also, A* might 
slow down the query execution so you might want to add another field which 
just contains the first letter so there's no need for the asterisk.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How to acces informations from a part of the index

2004-07-09 Thread clibois
Hello,
for my thesis I have to use Lucene index for a Text categorization program.
For that I need to split the index in two. So i have a learning set and a 
validation set. The problem is that I don't know how to ask lucene to give 
me,for exemple, the number of documents IN ONLY ONE of these subsets 
containing a specific term.
For example, I would to get number of document containing term hello in a 
subset of document. This subset is a set of the document number({5,3} and the 
complete index would contains document {0,1,2,3,4,5})
How can I do this in an efficient way?
I tried to get all document containing the term and then verify which document 
belong to my subset. However, it appears that it's very slow to do this.
Thanks in advance
Claude Libois


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: How to acces informations from a part of the index

2004-07-09 Thread Karsten Konrad

Hi,

Why don't you just use two indexes? You probably do not hate to index the test set at 
all.

If you have two or more subsets, just use filters that only matches the subsets you 
are interested in. Counting documents and such that do contain a certain term in one 
of the subset becomes then a search over the filtered document index and counting the 
number of results. Filters are quite
efficient.

Hope this helps,

Karsten


--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

Xtramind Technologies GmbH 
Stuhlsatzenhausweg 3 
D-66123 Saarbrücken

Phone +49 (681) 3 02-51 13 
Fax +49 (681) 3 02-51 09
[EMAIL PROTECTED] 
www.xtramind.com




-Ursprüngliche Nachricht-
Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Gesendet: Freitag, 9. Juli 2004 11:22
An: [EMAIL PROTECTED]
Betreff: How to acces informations from a part of the index


Hello,
for my thesis I have to use Lucene index for a Text categorization program. For that I 
need to split the index in two. So i have a learning set and a 
validation set. The problem is that I don't know how to ask lucene to give 
me,for exemple, the number of documents IN ONLY ONE of these subsets 
containing a specific term.
For example, I would to get number of document containing term hello in a 
subset of document. This subset is a set of the document number({5,3} and the 
complete index would contains document {0,1,2,3,4,5})
How can I do this in an efficient way?
I tried to get all document containing the term and then verify which document 
belong to my subset. However, it appears that it's very slow to do this. Thanks in 
advance Claude Libois


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



why same query different execution time?

2004-07-09 Thread iouli . golovatyi
Can somebody explain me following:

I execute search with  the same query, on the same index using the same pc 
an get always different execution time, for example:

1st run:
LuceneItems: Search for : +contents:vasella, Documents found : 121, 
Documents age : [15.09.04 -10.08.02]
LuceneItems: Last document retrieved - 39, Search time(ms) - 1362
2d run:
LuceneItems: Search for : +contents:vasella, Documents found : 121, 
Documents age : [15.09.04 -10.08.02]
LuceneItems: Last document retrieved - 39, Search time(ms) - 584

Regards
Joel

Re:how to ensure that AND occurs, pl. help

2004-07-09 Thread jitender ahuja
On Friday, July 09, 2004 1:57, Daniel Naber wrote

For fields title, body and query aaa bbb this will lead to
 +(title:aaa title:bbb) +(body:aaa body:bbb)
 
 So the clauses are required, but not the individual terms in a clause. I don't 
 know a (simple) clean solution, but you could parse the query twice, first to 
 get the AND right (queryParser.setOperator()), then again to get the fields 
 right.

 Thanks for your reply. I think you mean to say that  for the case of 2 fields title, 
body and query aaa bbb
what the AND should look in the query data item is:
+(title:aaa + title:bbb) +(body:aaa + body:bbb)
and not
+(title:aaa title:bbb) +(body:aaa body:bbb)

But will this apply if on eof the fields is null as my original query had missed one 
issue, i.e. if one field is null 
for a given Hit object even it ( the given Hit object  ) is there.

Regards,
Jitender

RE: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Armbrust, Daniel C.
The problem I ran into the other day with the new lock location is that Person A had 
started an index, ran into problems, erased the index and asked me to look at it.  I 
tried to rebuild the index (in the same place on a Solaris machine) and found out that 
A) - her locks still existed, B) - I didn't have a clue where it put the locks on the 
Solaris machine (since no full path was given with the error - has this been fixed?) 
and C) - I didn't have permission to remove her locks.

I think the locks should go back in the index, and we should fall back or give an 
option to put them elsewhere for the case of the read-only index.

Dan 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Daniel Naber
On Friday 09 July 2004 16:15, Armbrust, Daniel C. wrote:

 (since no full path was given with the error - has this been fixed?) and C)

That's fixed in Lucene 1.4.

 I think the locks should go back in the index, and we should fall back or
 give an option to put them elsewhere for the case of the read-only index.

There's already a Java system property that let's you specify the lock 
directory.

Regards
 Daniel


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



NullAnalyzer still tokenizes fields

2004-07-09 Thread Polina Litvak
I tried to create my own analyzer so it returns fields as they are
(without any tokenizing done), using code posted on lucene-user a short
while a go:
 
private static class NullAnalyzer
   extends Analyzer
{
  public TokenStream tokenStream(String fieldName, Reader reader)
  {
return new CharTokenizer(reader)
{
  protected boolean isTokenChar(char c)
  {
return true;
  }
};
 }
 }
 
 
After testing this analyzer I found out that fields I pass to it still
get tokenized. 
E.g. I have a field with value ABCD-EF. When passing it through the
analyzer, the only characters that end up in isTokenChar() are A, B, C,
D, E, F. So looks like - gets filtered out before it even gets to
isTokenChar().
 
Did anyone encounter this problem ? 
Any help will be greatly appreciated!
 
Thanks,
Polina 
 
 


Underscore tokenization

2004-07-09 Thread Jim Downing
Hi,

I'm trying to put together an Analyzer that doesn't separate tokens on
the underscore character. What's the best / easiest way to achieve this?

I've tried removing the references to char code 95 in
StandardTokenizerTokenManager, but it doesn't seem to cut the mustard.
Should I be looking at modifying StandardTokenizer.jj and having javacc
generate my own tokenizer classes?

thanks,
jim

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Way to repair an index broking during 1/2 optimize?

2004-07-09 Thread Doug Cutting
Kevin A. Burton wrote:
With the typical handful of fields, one should never see more than 
hundreds of files.

We only have 13 fields... Though to be honest I'm worried that even if I 
COULD do the optimize that it would run out of file handles.
Optimization doesn't open all files at once.  The most files that are 
ever opened by an IndexWriter is just:

4 + (5 + numIndexedFields) * (mergeFactor-1)
This includes during optimization.
However, when searching, an IndexReader must keep most files open.  In 
particular, the maximum number of files an unoptimized, non-compound 
IndexReader can have open is:

(5 + numIndexedFields) * (mergeFactor-1) * 
(log_base_mergeFactor(numDocs/minMergeDocs))

A compound IndexReader, on the other hand, should open at most, just:
(mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs))
An optimized, non-compound IndexReader will open just (5 + 
numIndexedFields) files.

And an optimized, compound IndexReader should only keep one file open.
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene shouldn't use java.io.tmpdir

2004-07-09 Thread Doug Cutting
Armbrust, Daniel C. wrote:
The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it.  I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks.
I think these problems have been fixed.  When an index is created, all 
old locks are first removed.  And when a lock cannot be obtained, it's 
full pathname is printed.  Can you replicate this with 1.4-final?

I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index.
Changing the lock location is risky.  Code which writes an index would 
not be required to alter the lock location, but code which reads it 
would be.  This can easily lead to uncoordinated access.

So it is best if the default lock location works well in most cases.  We 
try to use a temporary directory writable by all users, and attempt to 
handle situations like those you describe above.  Please tell me if you 
continue to have problems with locking.

Thanks,
Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Role of Operator in QueryParser

2004-07-09 Thread Otis Gospodnetic
Moving to lucene-user list.

Correct.
If I remember correctly, setting it to AND will turn query like: foo
bar into foo AND bar (or +foo +bar).

Otis

--- jitender ahuja [EMAIL PROTECTED] wrote:
 Hi All,
 
  Can anyone, particularly those more enlightened in the inner
 details of Lucene, tell the specific role of Operator in
 QueryParser class.
 Can it be used to set an AND operator for multiple terms having
 query.
 
 
 Regards,
 Jitender


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Problem with match on a non tokenized field.

2004-07-09 Thread Polina Litvak
Thanks a lot for your help. I've done what you suggested and it works
great except in this particular case:

I am trying to search for something like abc-ef* - i.e. I want to find
all fields that start with: abc-ef.
I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure
this field doesn't get tokenized on the -, but at the same time I need
the analyzer to realize that '*' is the wildcard search, not part of the
field value itself.

Would you know how to work around this ?

Thank you,
Polina

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 8, 2004 1:10 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

The PerFieldAnalyzerWrapper is constructed with your default analyzer,
suppose this is the analyzer you use to tokenize.  You then call the
addAnalyzer method for each non-tokenized/keyword fields.

In the case below, url is a keyword, all other fields are tokenized:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer(url, new NullAnalyzer());
query = QueryParser.parse(searchQuery,contents,analyzer);



-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 10:19 AM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help.
I have one more question:

How would you handle a query consisting of two fields combined with a
Boolean operator, where one field is only indexed and stored (a Keyword)
and another is tokenized, indexed and store ?
Is it possible to have parts of the same query analyzed with different
analyzers ?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 7, 2004 4:38 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer(url, new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery,
contents,
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Problem with match on a non tokenized field.

2004-07-09 Thread wallen
I do not know how to work around that.

It is indeed an interesting situation that would require more understanding
as to how the analyzer (in this case NullAnalyzer) interacts with the
special characters such as the * and ~.

You could try using the whitespace analyzer instead of the nullanalyzer!

-Will

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Friday, July 09, 2004 4:45 PM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help. I've done what you suggested and it works
great except in this particular case:

I am trying to search for something like abc-ef* - i.e. I want to find
all fields that start with: abc-ef.
I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure
this field doesn't get tokenized on the -, but at the same time I need
the analyzer to realize that '*' is the wildcard search, not part of the
field value itself.

Would you know how to work around this ?

Thank you,
Polina

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 8, 2004 1:10 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

The PerFieldAnalyzerWrapper is constructed with your default analyzer,
suppose this is the analyzer you use to tokenize.  You then call the
addAnalyzer method for each non-tokenized/keyword fields.

In the case below, url is a keyword, all other fields are tokenized:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer(url, new NullAnalyzer());
query = QueryParser.parse(searchQuery,contents,analyzer);



-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Thursday, July 08, 2004 10:19 AM
To: 'Lucene Users List'
Subject: RE: Problem with match on a non tokenized field.


Thanks a lot for your help.
I have one more question:

How would you handle a query consisting of two fields combined with a
Boolean operator, where one field is only indexed and stored (a Keyword)
and another is tokenized, indexed and store ?
Is it possible to have parts of the same query analyzed with different
analyzers ?


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: July 7, 2004 4:38 PM
To: [EMAIL PROTECTED]
Subject: RE: Problem with match on a non tokenized field.

Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper

Here is how I use it:

PerFieldAnalyzerWrapper analyzer  = new
org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer());
analyzer.addAnalyzer(url, new NullAnalyzer());
try 
{
query = QueryParser.parse(searchQuery,
contents,
analyzer);

-Original Message-
From: Polina Litvak [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 07, 2004 4:20 PM
To: [EMAIL PROTECTED]
Subject: Problem with match on a non tokenized field.


I have a Lucene Document with a field named Code which is stored 
and indexed but not tokenized. The value of the field is ABC5-LB.
The only way I can match the field when searching is by entering 
Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried
using breaks my
query into Code:ABC5 -Code:LB.
 
I need to be able to match this field by doing something like
Code:ABC5-L*, therefore always using quotes is not an option.
 
How would I go about writing my own analyzer that will not tokenize the
query ?
 
Thanks,
Polina
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]