Increasing Linux kernel open file limits.
Don't know if anyone knew this: http://www.hp-eloquence.com/sdb/html/linux_limits.html The kernel allocates filehandles dynamically up to a limit specified by file-max. The value in file-max denotes the maximum number of file- handles that the Linux kernel will allocate. When you get lots of error messages about running out of file handles, you might want to increase this limit. The three values in file-nr denote the number of allocated file handles, the number of used file handles and the maximum number of file handles. When the allocated filehandles come close to the maximum, but the number of actually used ones is far behind, you've encountered a peak in your filehandle usage and you don't need to increase the maximum. So while root you can allocate as many file handles without any limits enforced by glibc you still have to fight against the kernel Just doing a echo 100 /proc/sys/fs/file-max works fine. Then I can keep track of my file limit by doing a cat /proc/sys/fs/file-nr At least this works on 2.6.x... Think this is going to save me a lot of headache! Kevin -- Please reply using PGP. http://peerfear.org/pubkey.asc NewsMonster - http://www.newsmonster.org/ Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965 AIM/YIM - sfburtonator, Web - http://peerfear.org/ GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: boolean operators and score
There's no need to sort the words here. You just have to ensure that the lucene query built is the same for the requests that you consider as equivalent. I mean that if a request 'word1 word2' gives results different than 'word2 word1', the problem is in your query parser or in the way you give the requests to it. I keep on saying that with the lucene query parser, the requests 'word1 and word2' the request 'word2 and word1' are different because of the 'required' flag. Franck Niraj Alok wrote: Hi Don, After months of struggling with lucene and finally achieving the complex relevancy desired, the client would kill me if i now make that relevancy all lost. I am trying to do it with the way Franck suggested by sorting the words the user has entered, but otherwise, isn't this a bug of lucene ? Regards, Niraj - Original Message - From: Don Vaillancourt [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Thursday, July 08, 2004 7:15 PM Subject: Re: boolean operators and score What could actually be done is perhaps sort the search result by document id. Of course your relevancy will be all shot, but at least you would have control over the sorting order. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Browse by Letter within a Category
On Friday 09 July 2004 04:27, O'Hare, Thomas wrote: Searcher.search(category:\Products\ AND title:\A*\, new Sort(title)); You can only sort on fields which are not tokenized I think. So add an extra field with the title, but untokenized, just for sorting. Also, A* might slow down the query execution so you might want to add another field which just contains the first letter so there's no need for the asterisk. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
How to acces informations from a part of the index
Hello, for my thesis I have to use Lucene index for a Text categorization program. For that I need to split the index in two. So i have a learning set and a validation set. The problem is that I don't know how to ask lucene to give me,for exemple, the number of documents IN ONLY ONE of these subsets containing a specific term. For example, I would to get number of document containing term hello in a subset of document. This subset is a set of the document number({5,3} and the complete index would contains document {0,1,2,3,4,5}) How can I do this in an efficient way? I tried to get all document containing the term and then verify which document belong to my subset. However, it appears that it's very slow to do this. Thanks in advance Claude Libois - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: How to acces informations from a part of the index
Hi, Why don't you just use two indexes? You probably do not hate to index the test set at all. If you have two or more subsets, just use filters that only matches the subsets you are interested in. Counting documents and such that do contain a certain term in one of the subset becomes then a search over the filtered document index and counting the number of results. Filters are quite efficient. Hope this helps, Karsten -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab Xtramind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone +49 (681) 3 02-51 13 Fax +49 (681) 3 02-51 09 [EMAIL PROTECTED] www.xtramind.com -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Gesendet: Freitag, 9. Juli 2004 11:22 An: [EMAIL PROTECTED] Betreff: How to acces informations from a part of the index Hello, for my thesis I have to use Lucene index for a Text categorization program. For that I need to split the index in two. So i have a learning set and a validation set. The problem is that I don't know how to ask lucene to give me,for exemple, the number of documents IN ONLY ONE of these subsets containing a specific term. For example, I would to get number of document containing term hello in a subset of document. This subset is a set of the document number({5,3} and the complete index would contains document {0,1,2,3,4,5}) How can I do this in an efficient way? I tried to get all document containing the term and then verify which document belong to my subset. However, it appears that it's very slow to do this. Thanks in advance Claude Libois - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
why same query different execution time?
Can somebody explain me following: I execute search with the same query, on the same index using the same pc an get always different execution time, for example: 1st run: LuceneItems: Search for : +contents:vasella, Documents found : 121, Documents age : [15.09.04 -10.08.02] LuceneItems: Last document retrieved - 39, Search time(ms) - 1362 2d run: LuceneItems: Search for : +contents:vasella, Documents found : 121, Documents age : [15.09.04 -10.08.02] LuceneItems: Last document retrieved - 39, Search time(ms) - 584 Regards Joel
Re:how to ensure that AND occurs, pl. help
On Friday, July 09, 2004 1:57, Daniel Naber wrote For fields title, body and query aaa bbb this will lead to +(title:aaa title:bbb) +(body:aaa body:bbb) So the clauses are required, but not the individual terms in a clause. I don't know a (simple) clean solution, but you could parse the query twice, first to get the AND right (queryParser.setOperator()), then again to get the fields right. Thanks for your reply. I think you mean to say that for the case of 2 fields title, body and query aaa bbb what the AND should look in the query data item is: +(title:aaa + title:bbb) +(body:aaa + body:bbb) and not +(title:aaa title:bbb) +(body:aaa body:bbb) But will this apply if on eof the fields is null as my original query had missed one issue, i.e. if one field is null for a given Hit object even it ( the given Hit object ) is there. Regards, Jitender
RE: Lucene shouldn't use java.io.tmpdir
The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it. I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks. I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index. Dan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
On Friday 09 July 2004 16:15, Armbrust, Daniel C. wrote: (since no full path was given with the error - has this been fixed?) and C) That's fixed in Lucene 1.4. I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index. There's already a Java system property that let's you specify the lock directory. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
NullAnalyzer still tokenizes fields
I tried to create my own analyzer so it returns fields as they are (without any tokenizing done), using code posted on lucene-user a short while a go: private static class NullAnalyzer extends Analyzer { public TokenStream tokenStream(String fieldName, Reader reader) { return new CharTokenizer(reader) { protected boolean isTokenChar(char c) { return true; } }; } } After testing this analyzer I found out that fields I pass to it still get tokenized. E.g. I have a field with value ABCD-EF. When passing it through the analyzer, the only characters that end up in isTokenChar() are A, B, C, D, E, F. So looks like - gets filtered out before it even gets to isTokenChar(). Did anyone encounter this problem ? Any help will be greatly appreciated! Thanks, Polina
Underscore tokenization
Hi, I'm trying to put together an Analyzer that doesn't separate tokens on the underscore character. What's the best / easiest way to achieve this? I've tried removing the references to char code 95 in StandardTokenizerTokenManager, but it doesn't seem to cut the mustard. Should I be looking at modifying StandardTokenizer.jj and having javacc generate my own tokenizer classes? thanks, jim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Way to repair an index broking during 1/2 optimize?
Kevin A. Burton wrote: With the typical handful of fields, one should never see more than hundreds of files. We only have 13 fields... Though to be honest I'm worried that even if I COULD do the optimize that it would run out of file handles. Optimization doesn't open all files at once. The most files that are ever opened by an IndexWriter is just: 4 + (5 + numIndexedFields) * (mergeFactor-1) This includes during optimization. However, when searching, an IndexReader must keep most files open. In particular, the maximum number of files an unoptimized, non-compound IndexReader can have open is: (5 + numIndexedFields) * (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) A compound IndexReader, on the other hand, should open at most, just: (mergeFactor-1) * (log_base_mergeFactor(numDocs/minMergeDocs)) An optimized, non-compound IndexReader will open just (5 + numIndexedFields) files. And an optimized, compound IndexReader should only keep one file open. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene shouldn't use java.io.tmpdir
Armbrust, Daniel C. wrote: The problem I ran into the other day with the new lock location is that Person A had started an index, ran into problems, erased the index and asked me to look at it. I tried to rebuild the index (in the same place on a Solaris machine) and found out that A) - her locks still existed, B) - I didn't have a clue where it put the locks on the Solaris machine (since no full path was given with the error - has this been fixed?) and C) - I didn't have permission to remove her locks. I think these problems have been fixed. When an index is created, all old locks are first removed. And when a lock cannot be obtained, it's full pathname is printed. Can you replicate this with 1.4-final? I think the locks should go back in the index, and we should fall back or give an option to put them elsewhere for the case of the read-only index. Changing the lock location is risky. Code which writes an index would not be required to alter the lock location, but code which reads it would be. This can easily lead to uncoordinated access. So it is best if the default lock location works well in most cases. We try to use a temporary directory writable by all users, and attempt to handle situations like those you describe above. Please tell me if you continue to have problems with locking. Thanks, Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Role of Operator in QueryParser
Moving to lucene-user list. Correct. If I remember correctly, setting it to AND will turn query like: foo bar into foo AND bar (or +foo +bar). Otis --- jitender ahuja [EMAIL PROTECTED] wrote: Hi All, Can anyone, particularly those more enlightened in the inner details of Lucene, tell the specific role of Operator in QueryParser class. Can it be used to set an AND operator for multiple terms having query. Regards, Jitender - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem with match on a non tokenized field.
Thanks a lot for your help. I've done what you suggested and it works great except in this particular case: I am trying to search for something like abc-ef* - i.e. I want to find all fields that start with: abc-ef. I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure this field doesn't get tokenized on the -, but at the same time I need the analyzer to realize that '*' is the wildcard search, not part of the field value itself. Would you know how to work around this ? Thank you, Polina -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 8, 2004 1:10 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. The PerFieldAnalyzerWrapper is constructed with your default analyzer, suppose this is the analyzer you use to tokenize. You then call the addAnalyzer method for each non-tokenized/keyword fields. In the case below, url is a keyword, all other fields are tokenized: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer(url, new NullAnalyzer()); query = QueryParser.parse(searchQuery,contents,analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Thursday, July 08, 2004 10:19 AM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I have one more question: How would you handle a query consisting of two fields combined with a Boolean operator, where one field is only indexed and stored (a Keyword) and another is tokenized, indexed and store ? Is it possible to have parts of the same query analyzed with different analyzers ? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 7, 2004 4:38 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer(url, new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, contents, analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Problem with match on a non tokenized field.
I do not know how to work around that. It is indeed an interesting situation that would require more understanding as to how the analyzer (in this case NullAnalyzer) interacts with the special characters such as the * and ~. You could try using the whitespace analyzer instead of the nullanalyzer! -Will -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Friday, July 09, 2004 4:45 PM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I've done what you suggested and it works great except in this particular case: I am trying to search for something like abc-ef* - i.e. I want to find all fields that start with: abc-ef. I use PerFieldAnalyzerWrapper together with NullAnalyzer to make sure this field doesn't get tokenized on the -, but at the same time I need the analyzer to realize that '*' is the wildcard search, not part of the field value itself. Would you know how to work around this ? Thank you, Polina -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 8, 2004 1:10 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. The PerFieldAnalyzerWrapper is constructed with your default analyzer, suppose this is the analyzer you use to tokenize. You then call the addAnalyzer method for each non-tokenized/keyword fields. In the case below, url is a keyword, all other fields are tokenized: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer(url, new NullAnalyzer()); query = QueryParser.parse(searchQuery,contents,analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Thursday, July 08, 2004 10:19 AM To: 'Lucene Users List' Subject: RE: Problem with match on a non tokenized field. Thanks a lot for your help. I have one more question: How would you handle a query consisting of two fields combined with a Boolean operator, where one field is only indexed and stored (a Keyword) and another is tokenized, indexed and store ? Is it possible to have parts of the same query analyzed with different analyzers ? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: July 7, 2004 4:38 PM To: [EMAIL PROTECTED] Subject: RE: Problem with match on a non tokenized field. Use org.apache.lucene.analysis.PerFieldAnalyzerWrapper Here is how I use it: PerFieldAnalyzerWrapper analyzer = new org.apache.lucene.analysis.PerFieldAnalyzerWrapper(new MyAnalyzer()); analyzer.addAnalyzer(url, new NullAnalyzer()); try { query = QueryParser.parse(searchQuery, contents, analyzer); -Original Message- From: Polina Litvak [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 07, 2004 4:20 PM To: [EMAIL PROTECTED] Subject: Problem with match on a non tokenized field. I have a Lucene Document with a field named Code which is stored and indexed but not tokenized. The value of the field is ABC5-LB. The only way I can match the field when searching is by entering Code:ABC5-LB because when I drop the quotes, every Analyzer I've tried using breaks my query into Code:ABC5 -Code:LB. I need to be able to match this field by doing something like Code:ABC5-L*, therefore always using quotes is not an option. How would I go about writing my own analyzer that will not tokenize the query ? Thanks, Polina - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]