RE: Score
Thank u Chris for your support. __ Matt -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 12:54 AM To: java-user@lucene.apache.org Subject: RE: Score * This message comes from the Internet Network * : Have you looked at the constructor for BooleanQuery and : tried passing true to disable the Coord factor? : : Thanks Chris, this is exactly what I want, : but I am working with lucene 1.4.3 because I have to for some reasons, : : Is there any equivalent ?! if you look atteh source for it, it's fairly trivial ... you should be able to putthe same logic into a simple little helper function you use when making BooleanQueries. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost/Scoring question
Hi Chris, : If I search for a document where the field boost is 0.0 then the document is not : found I just search that field. Is this expected??? you mean you search on: A^0and get no results even though documents contain A, and if you search on: +A^0 B^1 you see those documents? It's the index time boost, rather than query time boost. This short example shows the behaviour of searches for +A +A +B +B where A was indexed with boost 0.0 and B with 1.0 IndexWriter writer = new IndexWriter(TestTools.getRoot(), new StandardAnalyzer(), true); Field f1 = new Field(subject, subject - boost factor 0.0F, Field.Store.YES, Field.Index.TOKENIZED); f1.setBoost(0.0F); Field f2 = new Field(body, body - boost factor 1.0F, Field.Store.YES, Field.Index.TOKENIZED); f2.setBoost(1.0F); Document doc = new Document(); doc.add(f1); doc.add(f2); writer.addDocument(doc); writer.close(); IndexSearcher searcher = new IndexSearcher(TestTools.getRoot()); QueryParser qp; Query query; Hits hits; Explanation explanation = null; // Match on a single zero boost field qp = new QueryParser(subject, new StandardAnalyzer()); query = qp.parse(+subject); hits = searcher.search(query); System.out.println(Search just subject field, no hit found with boost 0); // Match on both fields qp = new QueryParser(subject, new StandardAnalyzer()); query = qp.parse(+subject:subject +body:body); hits = searcher.search(query); System.out.println(Search +subject +body, match found, score on hit= + hits.score(0)); explanation = searcher.explain(query, 0); System.out.println(explanation); // Match on a single non zero boost field qp = new QueryParser(body, new StandardAnalyzer()); query = qp.parse(+body:body); hits = searcher.search(query); System.out.println(Search just on body field= + hits.score(0)); explanation = searcher.explain(query, 0); System.out.println(explanation); if you plan on using Hits, i would suggest requiring that boosts be 0 .. if you want to start dealing with raw scores, then boosts can definitely be 0. Hits is sufficient for now, but that may change. Thanks Antony - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Deleting document by file name
Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper programming ... Is there a way to delete the document in the index instead with this information ? Thank u. __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
problem with field.setboost(5.0f) on lucene 2.00
iam building up a search engine using lucene 2.0, and iam having problem using the term boost setboost a part of my code is : and my code is : doc.add(new Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED )); doc.getField(title).setboost(5.0f);// === the boost wont update to 5.0 it remain 1.0 writer.addDocument(doc); writer.optimize(); writer.close(); but when i look up in the index created the field title is still 1.0 can some one help me thx -- View this message in context: http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Deleting document by file name
Believe it or not, you delete items with IndexReader G. You can either delete by document ID or by Term. Be aware that currently open searchers will still find these documents (even after they have been deleted) until the *searcher* is closed and reopened. Erick On 2/1/07, DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote: Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper programming ... Is there a way to delete the document in the index instead with this information ? Thank u. *__* * Matt*** Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with field.setboost(5.0f) on lucene 2.00
I haven't played with boosts, but I suspect your ordering is wrong. You've already added the field to the document before you set the boost. Try Field f = new Field()... f.setBoost() doc.add(f). writer.addDoc(doc).. Best Erick On 2/1/07, liquideshark [EMAIL PROTECTED] wrote: iam building up a search engine using lucene 2.0, and iam having problem using the term boost setboost a part of my code is : and my code is : doc.add(new Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED )); doc.getField(title).setboost(5.0f);// === the boost wont update to 5.0 it remain 1.0 writer.addDocument(doc); writer.optimize(); writer.close(); but when i look up in the index created the field title is still 1.0 can some one help me thx -- View this message in context: http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with field.setboost(5.0f) on lucene 2.00
Yes you are right but i have change it to: Field tiTle = new Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED ); tiTle.setBoost(6.1f); doc.add(tiTle); --- it still dont make any change on the boost value, for information i use luke.jar to see if the value had change nice reading you again Tandina Erick Erickson wrote: I haven't played with boosts, but I suspect your ordering is wrong. You've already added the field to the document before you set the boost. Try Field f = new Field()... f.setBoost() doc.add(f). writer.addDoc(doc).. Best Erick On 2/1/07, liquideshark [EMAIL PROTECTED] wrote: iam building up a search engine using lucene 2.0, and iam having problem using the term boost setboost a part of my code is : and my code is : doc.add(new Field(title,httpd.getTitle(),Field.Store.YES,Field.Index.TOKENIZED )); doc.getField(title).setboost(5.0f);// === the boost wont update to 5.0 it remain 1.0 writer.addDocument(doc); writer.optimize(); writer.close(); but when i look up in the index created the field title is still 1.0 can some one help me thx -- View this message in context: http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8746530 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- View this message in context: http://www.nabble.com/problem-with-field.setboost%285.0f%29-on-lucene-2.00-tf3154250.html#a8748508 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Deleting document by file name
If I have the path of the document, I cannot find the ID ? __ Matt -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 2:09 PM To: java-user@lucene.apache.org Subject: Re: Deleting document by file name * This message comes from the Internet Network * Believe it or not, you delete items with IndexReader G. You can either delete by document ID or by Term. Be aware that currently open searchers will still find these documents (even after they have been deleted) until the *searcher* is closed and reopened. Erick On 2/1/07, DECAFFMEYER MATHIEU [EMAIL PROTECTED] wrote: Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper programming ... Is there a way to delete the document in the index instead with this information ? Thank u. *__* * Matt*** Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Building lucene index using 100 Gb Mobile HardDisk
Dear All, I was indexing 660,000 XML documents.The unoptimized index file was successfully built in about 17 hrs...This index file resides in my D drive which has the free space 38 Gb.This space is insufficient for optimizing the index file --I read Lucene documentation said about its requirement of substantial temporary free disk space Hence,I now put my index file in the Mobile HardDisk which has the capacity of 100 Gb,and restart indexing. However...there is an exception saying that : java.io.IOException: There is not enough space on the disk when I index the 120,000th documents It is a pretty weird thing!!...I keep thinking what is happening...Is there anything to do with the file system used?my D drive use NTFS file system whilst this Mobile HDD use FAT32... Any comments/suggestion pls? THanks and best Regards, Maureen - Food fight? Enjoy some healthy debate in the Yahoo! Answers Food Drink QA.
RE: Deleting document by file name
do something like this: public class Index extends IndexModifier { ... public int deleteDocuments(String field, String value) throws IOException { return super.deleteDocuments(new Term(field, value)); } use like this : index.deleteDocuments(filed name, field value); _ From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 09:53 To: java-user@lucene.apache.org Subject: Deleting document by file name Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper programming ... Is there a way to delete the document in the index instead with this information ? Thank u. __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice.
RE: Building lucene index using 100 Gb Mobile HardDisk
FAT 32 imposes a lower file size limitation than NTF. Attempts to create files greater that 4Gig on FAT32 will throw error you are seeing. -Original Message- From: maureen tanuwidjaja [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 14:22 To: java-user@lucene.apache.org Subject: Building lucene index using 100 Gb Mobile HardDisk Dear All, I was indexing 660,000 XML documents.The unoptimized index file was successfully built in about 17 hrs...This index file resides in my D drive which has the free space 38 Gb.This space is insufficient for optimizing the index file --I read Lucene documentation said about its requirement of substantial temporary free disk space Hence,I now put my index file in the Mobile HardDisk which has the capacity of 100 Gb,and restart indexing. However...there is an exception saying that : java.io.IOException: There is not enough space on the disk when I index the 120,000th documents It is a pretty weird thing!!...I keep thinking what is happening...Is there anything to do with the file system used?my D drive use NTFS file system whilst this Mobile HDD use FAT32... Any comments/suggestion pls? THanks and best Regards, Maureen - Food fight? Enjoy some healthy debate in the Yahoo! Answers Food Drink QA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Deleting document by file name
I see now :) Thank u all for your support __ Matt -Original Message- From: WATHELET Thomas [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 3:28 PM To: java-user@lucene.apache.org Subject: RE: Deleting document by file name * This message comes from the Internet Network * do something like this: public class Index extends IndexModifier { ... public int deleteDocuments(String field, String value) throws IOException { return super.deleteDocuments(new Term(field, value)); } use like this : index.deleteDocuments(filed name, field value); _ From: DECAFFMEYER MATHIEU [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 09:53 To: java-user@lucene.apache.org Subject: Deleting document by file name Hi, I have a list of filenames like Corporate.htm Logistics.htm Merchant.htm that need to be deleted. For now on I give this list to my Search application that reads the idnex and give the results, and if the path contains one of the filenames, I don't display this hit ... Not really proper programming ... Is there a way to delete the document in the index instead with this information ? Thank u. __ Matt Internet communications are not secure and therefore Fortis Banque Luxembourg S.A. does not accept legal responsibility for the contents of this message. The information contained in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Nothing in the message is capable or intended to create any legally binding obligations on either party and it is not intended to provide legal advice. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Please Help me on Lucene
I am also only novice, but that should work for you. One row in your table == one doc in lucene: I would indice it like that for one row/document: Document doc = new Document(); doc.add(new Field(prod_Id doc.add(new Field(prod_name... ... writer.addDocument(doc); Now check your index using Luke: http://www.getopt.org/luke/ Just start the web version quickly and look at your index. (Luke is simple and nice to use, mostly self-explanatory) You can search for your value and look at the docs. So use it to check your query... If you search for a value, in your class you get the hits-object. Hits hits = searcher.search(query); Document doc = hits.doc(i);//gives you the document (rank i), which is one row. doc.get(prod_Id); to retrieve the fieldvalues Cheers, Christoph - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Building lucene index using 100 Gb Mobile HardDisk
Oh is it?I didn't know about that...so Is it means I cant use this Mobile HDD.. Damien McCarthy [EMAIL PROTECTED] wrote: FAT 32 imposes a lower file size limitation than NTF. Attempts to create files greater that 4Gig on FAT32 will throw error you are seeing. -Original Message- From: maureen tanuwidjaja [mailto:[EMAIL PROTECTED] Sent: 01 February 2007 14:22 To: java-user@lucene.apache.org Subject: Building lucene index using 100 Gb Mobile HardDisk Dear All, I was indexing 660,000 XML documents.The unoptimized index file was successfully built in about 17 hrs...This index file resides in my D drive which has the free space 38 Gb.This space is insufficient for optimizing the index file --I read Lucene documentation said about its requirement of substantial temporary free disk space Hence,I now put my index file in the Mobile HardDisk which has the capacity of 100 Gb,and restart indexing. However...there is an exception saying that : java.io.IOException: There is not enough space on the disk when I index the 120,000th documents It is a pretty weird thing!!...I keep thinking what is happening...Is there anything to do with the file system used?my D drive use NTFS file system whilst this Mobile HDD use FAT32... Any comments/suggestion pls? THanks and best Regards, Maureen - Food fight? Enjoy some healthy debate in the Yahoo! Answers Food Drink QA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Never Miss an Email Stay connected with Yahoo! Mail on your mobile. Get started!
Locking in Lucene 2.0
Hi, I am starting to work with Lucene 2.0 and I noticed that we can no longer create an FSDirectory using a LockFactory. Could someone point me to some discussion or documentation related to locking and what has changed in terms of best practices? It appears that the only way to build custom locking is to write my own Directory implementation. Thanks -hareesh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Locking in Lucene 2.0
Kadlabalu, Hareesh wrote: Hi, I am starting to work with Lucene 2.0 and I noticed that we can no longer create an FSDirectory using a LockFactory. Could someone point me to some discussion or documentation related to locking and what has changed in terms of best practices? It appears that the only way to build custom locking is to write my own Directory implementation. I think you mean Lucene trunk (not 2.0)? LockFactory hasn't been released yet. With the trunk, you can still instantiate an FSDirectory with your own LockFactory using FSDirectory.getDirectory(*)? Maybe you are referring to the boolean create argument? That was deprecated/removed from FSDirectory.getDirectory as part of: http://issues.apache.org/jira/browse/LUCENE-773 Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Advices on a replacement of Lucene gap encoding scheme?
Dear all, I am happy to send my first email to Lucene community as after subscribing to the mailing list, I haven't actually joined the community, just standing aside and following many intersting threads. As part of my school project, I am intending to make some improvements in Lucene source code, and I need some advices on how significance my modification work would be. What I am interested so far is the gap encoding scheme in Lucene which is used in DocumentWriter.writePostings() to record the gap positions of a term within a document. The writePostings(), in turn, calls the writeVInt() method to record the gap, which is the byte-aligned coding scheme. I'm thinking of replace the byte-aligned scheme with the fixed binary coding scheme mentioned in the paper Index compression using Fixed Binary Codewords by Vo Ngoc Anh and Alistair Moffat (the abstract can be found here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.htmlhttp://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html). This scheme basically breaks the list of gaps into segments whose gaps (in one segment) will be coded in a fixed data width w (bits). The number of gaps in each segment is recored in a span variable s, and the pair (w,s) form a selector assigned for that segment. By effectively decompose the list, reduce the number of selectors into 16 combinations of relative data size (vs. previous segment) and span, and use greedy algorithm to find suboptimal solutions, the authors claimed that they could achieve better compression effectiveness (measured in bits per pointer averaged across the wholde index), and retrieval time compared to Golomb, interpolative, byte-aligned, and word aligned code schemes. What I wonder at this time is that in the case of Lucene, how possible it is to implement the fixed binary scheme that could enhance the performance, and whether there are other parts which I could also consider replacing the gap-encoding scheme. As I've started playing around with Lucene recently, I hope to have your helps to understand Lucene better ^_^ Best regards, Luong Minh Thang
Re: Please Help me on Lucene
Please, do not ever, under any circumstances at all, cross post a message to all of these lists -- there is absolutely no reason for it, and doing so will most likely only make people mad and uncooporative. if you are trying to use Java Lucene, then post your message to java-user list. if you are trying to use solr, then post your message to the solr-user list, etc If you know you want to do something with Lucene, but you aren't sure which list ot mail to, email [EMAIL PROTECTED] and ask what the appropriate list is for your question. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Advices on a replacement of Lucene gap encoding scheme?
Dear all I am happy to send my first email to Lucene community after some time standing aside, following many interesting discussions. As part of my school project, I am intending to make some improvements in Lucene source code, and I need some advices on how significance my modification work would be. What I am interested so far is the gap encoding scheme in Lucene which is used in DocumentWriter.writePostings() to record the gap positions of a term within a document. The writePostings(), in turn, calls the writeVInt() method to record the gap, which is the byte-aligned coding scheme. I'm thinking of replace the byte-aligned scheme with the fixed binary coding scheme mentioned in the paper Index compression using Fixed Binary Codewords by Vo Ngoc Anh and Alistair Moffat (the abstract can be found here http://www.cs.mu.oz.au/~vo/abstracts/am04:adc.htmlhttp://www.cs.mu.oz.au/%7Evo/abstracts/am04:adc.html). This scheme basically breaks the list of gaps into segments whose gaps (in one segment) will be coded in a fixed data width w (bits). The number of gaps in each segment is recored in a span variable s, and the pair (w,s) form a selector assigned for that segment. By effectively decompose the list, reduce the number of selectors into 16 combinations of relative data size (vs. previous segment) and span, and use greedy algorithm to find suboptimal solutions, the authors claimed that they could achieve better compression effectiveness (measured in bits per pointer averaged across the wholde index), and retrieval time compared to Golomb, interpolative, byte-aligned, and word aligned code schemes. What I wonder at this time is that in the case of Lucene, how possible it is to implement the fixed binary scheme that could enhance the performance, and whether there are other parts which I could also consider replacing the gap-encoding scheme. As I've started playing around with Lucene recently, I hope to have your helps to understand Lucene better ^_^ PS: for this type of discussion, which mailing list is most appropriate for my emai? Best regards, Luong Minh Thang
bad queryparser bug
I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Looking for crawler recommendations.
Has anyone integrated a crawler with lucene that they had success with? I cannot use Nutch, since 60% of our searchable content is contained in a database. I need to do a hybrid between database indexing and website crawling. I would be just crawling one domain with a given set of directories. I found this list of crawlers, but nothing that quite seems to fit my needs. One problem with a couple of the libraries that may work is that they use a GNU license. http://www.manageability.org/blog/stuff/open-source-web-crawlers-java/view Thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
trouble with permissions?
i seem to be having a problem analogous to this one (no answer that i see): http://www.gossamer-threads.com/lists/lucene/java-user/32268? search_string=cannot%20overwrite;#32268 trouble is, i just put lucene on my new macbook pro and am having the problem that if i build a large index, i get an I/O error due to something like java.io.IOException: Cannot overwrite: /data/reuters/indexes/reuters/ deleteable.new same code worked fine on my previous machine (still running on a G4 powerbook and a linux machine). sometimes it has trouble writing the segments file instead... has anyone seen and solved this problem? thoughts on what might be behind it? thanks, -Miles On Feb 1, 2007, at 2:57 PM, Peter Keegan wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: trouble with permissions?
Miles Efron wrote: i seem to be having a problem analogous to this one (no answer that i see): http://www.gossamer-threads.com/lists/lucene/java-user/32268?search_string=cannot%20overwrite;#32268 trouble is, i just put lucene on my new macbook pro and am having the problem that if i build a large index, i get an I/O error due to something like java.io.IOException: Cannot overwrite: /data/reuters/indexes/reuters/deleteable.new same code worked fine on my previous machine (still running on a G4 powerbook and a linux machine). sometimes it has trouble writing the segments file instead... has anyone seen and solved this problem? thoughts on what might be behind it? Are you running Windows on your macbook pro? There are known issues like this, but only on Windows, eg: http://issues.apache.org/jira/browse/LUCENE-665 We believe such cases are now fixed by lockless commits, on the trunk of Lucene (which is not yet released). If you could try the trunk (but beware that API, file formats, can change) and see if this still happens that'd be great! Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: trouble with permissions?
Mike, You rule. Swapping out the nightly build seems to have fixed the problem... tried it on two problematic cases and both worked. For the record, I'm running mac os 10.4.8. Do you know if the lockless commits will be included in the next stable release? Thanks so much! -Miles On Feb 1, 2007, at 3:33 PM, Michael McCandless wrote: Miles Efron wrote: i seem to be having a problem analogous to this one (no answer that i see): http://www.gossamer-threads.com/lists/lucene/java-user/32268? search_string=cannot%20overwrite;#32268 trouble is, i just put lucene on my new macbook pro and am having the problem that if i build a large index, i get an I/O error due to something like java.io.IOException: Cannot overwrite: /data/reuters/indexes/ reuters/deleteable.new same code worked fine on my previous machine (still running on a G4 powerbook and a linux machine). sometimes it has trouble writing the segments file instead... has anyone seen and solved this problem? thoughts on what might be behind it? Are you running Windows on your macbook pro? There are known issues like this, but only on Windows, eg: http://issues.apache.org/jira/browse/LUCENE-665 We believe such cases are now fixed by lockless commits, on the trunk of Lucene (which is not yet released). If you could try the trunk (but beware that API, file formats, can change) and see if this still happens that'd be great! Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: trouble with permissions?
Miles Efron wrote: You rule. Swapping out the nightly build seems to have fixed the problem... tried it on two problematic cases and both worked. Phew! For the record, I'm running mac os 10.4.8. Uh-oh, I can't explain why you would hit these errors on OS X 10.4.8; we have only seen these one Windows. Are you sure switching to trunk has fixed it? Lockless commits makes Lucene write once so this works around a number of file system quirks. Still it'd be good to get to your root cause. Is the index stored on a remote (Windows CIFS) mount? Or is it stored on a local (Mac OS HFS+) drive? Do you know if the lockless commits will be included in the next stable release? Yes this will be included in 2.1 -- I think 2.1 will be released soon (there's been discussions on the dev list to get the release process started soon). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bad queryparser bug
This is a ton of discussion on this if you search the lucene user list (QueryParser and precendence and the 'binary' operators). I have seen many mentions of the precedence parser still having open issues but no mention of what those issues are. Peter Keegan wrote: OK, I see that I'm not the first to discover this behavior of QueryParser. Can anyone vouch for the integrity of the PrecedenceQueryParser here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/ Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: trouble with permissions?
I really don't know why os x could have induced those kinds of filesystem issues. i assumed that since i had switched over to the intel architecture that perhaps something was going on with the JVM...everything involved in the process was mac; local filesystem, etc. but i'm fairly sure that the trunk code has fixed the problem. i ran two 'offending' bits of code and checked their results. not only did they finish (quite a feat today), but they did so correctly. -Miles On Feb 1, 2007, at 4:19 PM, Michael McCandless wrote: Miles Efron wrote: You rule. Swapping out the nightly build seems to have fixed the problem... tried it on two problematic cases and both worked. Phew! For the record, I'm running mac os 10.4.8. Uh-oh, I can't explain why you would hit these errors on OS X 10.4.8; we have only seen these one Windows. Are you sure switching to trunk has fixed it? Lockless commits makes Lucene write once so this works around a number of file system quirks. Still it'd be good to get to your root cause. Is the index stored on a remote (Windows CIFS) mount? Or is it stored on a local (Mac OS HFS+) drive? Do you know if the lockless commits will be included in the next stable release? Yes this will be included in 2.1 -- I think 2.1 will be released soon (there's been discussions on the dev list to get the release process started soon). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: trouble with permissions?
Miles Efron wrote: I really don't know why os x could have induced those kinds of filesystem issues. i assumed that since i had switched over to the intel architecture that perhaps something was going on with the JVM...everything involved in the process was mac; local filesystem, etc. but i'm fairly sure that the trunk code has fixed the problem. i ran two 'offending' bits of code and checked their results. not only did they finish (quite a feat today), but they did so correctly. OK I will keep my fingers crossed that there isn't another issue lurking :) Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searching by field's TF vector (not MoreLikeThis)
I'm looking for a way to search by a field's internal TF vector representation. MoreLikeThis does not seem to be what I want-- it constructs a text query based on the top scoring TF-IDF terms. I want to query by TF vector directly, bypassing the tokens. Lucene understandably has knowledge of the cosine dist of these vectors -- does it expose it in such a way that I can query the topN results from a field's TF vector? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Javadoc Exception - cause?
Hi, I was implementing some calls to Lucene, though was curious if there was some documentation I was missing that indicated why a method throws an exception. Example, IndexReader - deleteDocuments() - what is the root cause as to why it throws IOException? I'm trying to utilize this info to determine my exception handling strategy for all my Lucene API calls (should I fail, retry, ignore, etc) Thanks, Josh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bad queryparser bug
Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: bad queryparser bug
OK, I see that I'm not the first to discover this behavior of QueryParser. Can anyone vouch for the integrity of the PrecedenceQueryParser here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/miscellaneous/src/java/org/apache/lucene/queryParser/precedence/ Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: Correction: The query parser produces the correct query with the parenthesis. But, I'm still looking for a fix for this. I could use some advice on where to look in QueryParser to fix this. Thanks, Peter On 2/1/07, Peter Keegan [EMAIL PROTECTED] wrote: I have discovered a serious bug in QueryParser. The following query: contents:sales contents:marketing || contents:industrial contents:sales is parsed as: +contents:sales +contents:marketing +contents:industrial +contents:sales The same parsed query occurs even with parenthesis: (contents:sales contents:marketing) || (contents:industrial contents:sales) Is there any way around this bug? Thanks, Peter
Re: Lucene Javadoc Exception - cause?
Well, in the normal course of events, things like deleteDocuments(Term) shouldn't throw an exception unless I've screwed up. In my experience, Lucene usually gracefully handles normal error cases. In this case, there not being any underlying documents that match on Term is, I believe, handled by just returning 0 for the documents deleted. If the underlying Directory is closed, say, or the index is corrupted or you might get an error thrown. Neither would be safe to ignore. Except for a single case in my experience, exceptions are thrown by Lucene because of a failure that retrying won't solve and ignoring would be a bad idea. Good programming practice precludes throwing exceptions for *recoverable* problems. The only case I can remember where Lucene threw what I thought was an inappropriate exception was calling a WildcardEnum with a term that had no wildcard, and that's since been fixed in the trunk. catching and ignoring errors is not something I'd recommend unless there is *good* reason to believe that the underlying cause will fix itself. That's much more common in, say, communications programs where the network can be flaky than it is in programs like Lucene. Best Erick On 2/1/07, Josh Joy [EMAIL PROTECTED] wrote: Hi, I was implementing some calls to Lucene, though was curious if there was some documentation I was missing that indicated why a method throws an exception. Example, IndexReader - deleteDocuments() - what is the root cause as to why it throws IOException? I'm trying to utilize this info to determine my exception handling strategy for all my Lucene API calls (should I fail, retry, ignore, etc) Thanks, Josh - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Use of only a prohibit search
Adding a MatchAllDocsQuery instance to your boolean query if all clauses are prohibited is in fact still the best way to do a purely negative query. the trunk makes this easier by adding MatchAllDocsQuery syntax to the query parser... *:* -description:plot -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bad queryparser bug
please do not cross post questions about using the Lucene API to both the user and dev mailing lists -- the user list is the correct place to ask questions about behavior you are seeing that you think may be a bug. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: bad queryparser bug
: The query parser produces the correct query with the parenthesis. : But, I'm still looking for a fix for this. I could use some advice on where : to look in QueryParser to fix this. the best advice i can give you: don't use the binary operators. * Lucene is not a boolean logic system * BooleanQuery does not impliment boolean logic * QueryParser is not a boolean language parser (If i could go back in time and stop the AND/OR/NOT//|| aliases from being added to the QueryParser -- i would) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boost/Scoring question
: It's the index time boost, rather than query time boost. This short example : shows the behaviour of searches for A... index boosts! ... totally didn't occur to me that was what you were talking about. Yes: it makes sense that if you give a field an index boost of 0.0f you won't be able to query with Hits on that field for that doc: with a field boost of 0 your fieldNorm is going to be 0, which is going to make any score from that field a 0. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with field.setboost(5.0f) on lucene 2.00
: it still dont make any change on the boost value, for information i use : luke.jar to see if the value had change i'm not sure what you mean you're using luke to see if hte value has changed ... boosts aren't stored in the index (they are used to compute a fieldNorm) so there's nothing for luke to show you (but i haven't played with luke extensively so i'm not sure what you might be looking at that relates to boosts) please note the javadocs for setBoost and getBoost... public void setBoost(float boost) Sets the boost factor hits on this field. This value will be multiplied into the score of all hits on this this field of this document. The boost is multiplied by Document.getBoost() of the document containing this field. If a document has multiple fields with the same name, all such values are multiplied together. This product is then multipled by the value Similarity.lengthNorm(String,int), and rounded by Similarity.encodeNorm(float) before it is stored in the index. One should attempt to ensure that this product does not overflow the range of that encoding. public float getBoost() Returns the boost factor for hits for this field. The default value is 1.0. Note: this value is not stored directly with the document in the index. Documents returned from IndexReader.document(int) and Hits.doc(int) may thus not have the same value present as when this field was indexed. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]