indexing pdfs
hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin
RE: indexing pdfs
Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing pdfs
For DOC files you can use the Jakarta POI library. Text extraction is outlined here: http://jakarta.apache.org/poi/hwpf/quick-guide.html Ulf On 08.03.2007, at 10:37, ashwin kumar wrote: hi can some one help me by giving any sample programs for indexing pdfs and .doc files - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing pdfs
Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: indexing pdfs
Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory. The only other way I have heard of is to use Ifilters. I believe SeekAFile does indexing of pdfs. Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 11:35 To: java-user@lucene.apache.org Subject: Re: indexing pdfs Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing pdfs
hi again do we have to download any jar files to run this program if so can u give me the link pls ashwin On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory. The only other way I have heard of is to use Ifilters. I believe SeekAFile does indexing of pdfs. Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 11:35 To: java-user@lucene.apache.org Subject: Re: indexing pdfs Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Ranking/scoring
I'm looking at how ReciprocalFloatFuncion and ReverseOrdFieldSource can be used to rank documents by score and date (solr.search.function contains great stuff!). The values in the date field that are used for the ValueSource are not actually used as 'floats', but rather their ordinal term values from the FieldCache string index. This means that if the 'date' field has 3000 unique string 'values' in the index, the values for 'x' in ReciprocalFloatFuncion could be 0-2999. So if I want the most recent 'date' to return a score of 1.0, one could set 'a' and 'b' in the function to 2999. Do I have this right? I got bit confused at first because I assumed that the actual field values were being used in the computation, but you really need to know the unique term count in order to get the score 'right'. By the way, as I try to get my head around the Score, Weight, and Boolean* classes (and next(), skipTo()), I nominate these for discussion in Lucene In Action II. Peter On 3/9/06, Yonik Seeley [EMAIL PROTECTED] wrote: On 3/9/06, Yang Sun [EMAIL PROTECTED] wrote: Hi Yonik, Thanks very much for your suggestion. The query boost works great for keyword matching. But in my case, I need to rank the results by date and title. For example, title:foo^2 abstract:foo^1.5 date:2004^3 will only boost the document with date=2004. What I need is boosting the distance from the specified date If all you need to do is boost more recent documents (and a single fixed boost will always work), then you can do that boosting at index time. which means 2003 will have a better ranking than 2002, 20022001, etc. I implemented a customized ScoreDocComparator class which works fine for one field. But I met some trouble when trying to combine other fields together. I'm still looking at FunctionQuery. Don't know if I can figure out something. FunctionQuery support is integrated into Solr (or currently hacked-in, as the case may be), and can be useful for debugging and trying out query types even if you don't use it for your runtime. ReciprocalFloatFunction might meet your needs for increasing the score of more recent documents: http://incubator.apache.org/solr/docs/api/org/apache/solr/search/function/ReciprocalFloatFunction.html The SolrQueryParser can make ReciprocalFloatFunction(new ReverseOrdFieldSource(my_date),1,1000,1000) out of _val_:recip(rord(my_date),1,1000,1000) -Yonik http://incubator.apache.org/solr Solr, The Open Source Lucene Search Server - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: indexing pdfs
Hi, Here it is: http://www.seekafile.org/ -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 13:07 To: java-user@lucene.apache.org Subject: Re: indexing pdfs hi again do we have to download any jar files to run this program if so can u give me the link pls ashwin On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory. The only other way I have heard of is to use Ifilters. I believe SeekAFile does indexing of pdfs. Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 11:35 To: java-user@lucene.apache.org Subject: Re: indexing pdfs Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Index a source, but not store it... can it be done?
Have an interesting scenario I'd like to get your take on with respect to Lucene: A data provider (e.g. someone with a private website or corporately shared directory of proprietary documents) has requested their content be indexed with Lucene so employees can be redirected to it, but provisionally -- under no circumstance should that content be stored or recreated from the index. Is that even possible? The data owner's request makes sense in the context of them wanting to retain full access control via logins as well as collecting access metrics. If the token 'CAT' points to C:\Corporate\animals.doc and the token 'DOG' points also points there, then great, CAT AND DOG will give that document a higher rating, though it is not possible to reconstruct (with any great accuracy) what the actual document content is. However, if for the sake of using the NEAR operator with Lucene the tokens are stored as LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7 THIS:8 DECEMBER:9 ... then someone could pull all tokens for animal.doc and reconstitute the token stream. Does Lucene have any kind of trade off for working with secure (and I use this term loosely) data? -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene 2.1, inconsistent phrase query results with slop
In a nutshell, reversing the order of the terms in a phrase query can result in different hit counts. That is, person place~3 may return different results from place person~3, depending on the number of intervening terms. There's a self-contained program below that illustrates what I'm seeing, along with output. SpanNear does not exhibit this behavior, so I can make things work. I didn't find anything in my (admittedly brief) search of the archives or the open issues that directly spoke to this. Several questions: 1 is this a bug or not? 2 is anyone working on it or should I dig into it? It looks like it may be related to LUCENE-736. 3 does the phrase from LIA (pg 208) Given enough slop, PhraseQuery will match terms out of order in the original text. apply here? 4 Do you want me to post this on the developers list (I can hear it now... not unless you also post a patch too G) Thanks Erick import org.apache.lucene.analysis.WhitespaceAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.spans.SpanNearQuery; import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.search.spans.SpanTermQuery; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; public class PhraseProblem { public static void main(String[] args) { try { PhraseProblem pp = new PhraseProblem(); pp.tryIt(); } catch (Exception e) { e.printStackTrace(); } } private void tryIt() throws Exception { Directory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new WhitespaceAnalyzer()); Document doc = new Document(); doc.add( new Field( field, person space space space place, Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close(); IndexSearcher searcher = new IndexSearcher(dir); System.out.println(trying phrase queries); this.trySlop(searcher, 2); this.trySlop(searcher, 3); //FAILS this.trySlop(searcher, 4); //FAILS this.trySlop(searcher, 5); this.trySlop(searcher, 6); this.trySlop(searcher, 7); System.out.println(trying SpanNear queries); this.trySpan(searcher, 2); this.trySpan(searcher, 3); this.trySpan(searcher, 4); this.trySpan(searcher, 5); this.trySpan(searcher, 6); this.trySpan(searcher, 7); } private void trySpan(IndexSearcher searcher, int slop) throws Exception { SpanQuery sq1 = new SpanTermQuery(new Term(field, person)); SpanQuery sq2 = new SpanTermQuery(new Term(field, place)); SpanNearQuery sqn1 = new SpanNearQuery( new SpanQuery[] {sq1, sq2}, slop, false); SpanNearQuery sqn2 = new SpanNearQuery( new SpanQuery[] {sq2, sq1}, slop, false); Hits hits1 = searcher.search(sqn1); Hits hits2 = searcher.search(sqn2); this.printResults(hits1, hits2, slop); } private void trySlop(IndexSearcher searcher, int slop) throws Exception { QueryParser qp = new QueryParser(field, new WhitespaceAnalyzer()); Query query1 = qp.parse(String.format(\person place\~%d, slop)); Query query2 = qp.parse(String.format(\place person\~%d, slop)); Hits hits1 = searcher.search(query1); Hits hits2 = searcher.search(query2); this.printResults(hits1, hits2, slop); } private void printResults(Hits hits1, Hits hits2, int slop) { if (hits1.length() != hits2.length()) { System.out.println( String.format( Unequeal hit counts. hits1.length %d, hits2.length %d slop : %d, hits1.length(), hits2.length(), slop)); } else { System.out.println( String.format( Found identical hit counts of %d, slop: %d, hits1.length(), slop)); } } } output trying phrase queries Found identical hit counts of 0, slop: 2 Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 3 Unequeal hit counts. hits1.length 1, hits2.length 0 slop : 4 Found identical hit counts of 1, slop: 5 Found identical hit counts of 1, slop: 6 Found identical hit counts of 1, slop: 7 trying SpanNear queries Found identical hit counts of 0, slop: 2 Found identical hit counts of 1,
Re: Lucene 2.1, inconsistent phrase query results with slop
On 3/8/07, Erick Erickson [EMAIL PROTECTED] wrote: In a nutshell, reversing the order of the terms in a phrase query can result in different hit counts. That is, person place~3 may return different results from place person~3, depending on the number of intervening terms. I think that's working as designed. Although I could understand someone wanting it to work differently. The slop is sort of like the edit distance from the current given phrase, hence the order of terms in the phrase matters. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1, inconsistent phrase query results with slop
: I think that's working as designed. Although I could understand : someone wanting it to work differently. The slop is sort of like the : edit distance from the current given phrase, hence the order of terms : in the phrase matters. correct ... LIA has a great diagram explaining this ... the slop refers to how many positions you have to move the terms in the PhraseQuery to match. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Multiple segments
Hi all, I have been performing some tests on index segments and have a problem. I have read the file formats document on the official website and from what I can see it should be possible to create as many segments for an index as there are documents (though of course this is not a great idea). Having searched around it occurred to be that the way to do this is to set maxMergeDocs to 1. Having tried this I found that it doesn't work. All documents still get put into one segment. Any idea what I should do? Thanks This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to.
Plural word search
All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to me we need to build a dictionary to support it. Does Lucene support it? appreciate your help. Tony _ Dont miss your chance to WIN 10 hours of private jet travel from Microsoft® Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
Token positions are used also for phrase search. You could probably compromise this by setting all token positions to 0 - this would appear as if a document is a *set* of words (rather than a *list*). An adversary would be able to know/guess what words are in each document, (and, with (API) access to the index itself, how many times each word appear in each document), but would not be able to reconstruct a good approximation of that document, because term positions are all 0. If this is sufficient, I think you can do it by writing your own Analyzer with a TokenFilter that takes care of the position - see Token. setPositionIncrement(). Hope this helps, Doron Walt Stoneburner [EMAIL PROTECTED] wrote on 08/03/2007 07:28:59: Have an interesting scenario I'd like to get your take on with respect to Lucene: A data provider (e.g. someone with a private website or corporately shared directory of proprietary documents) has requested their content be indexed with Lucene so employees can be redirected to it, but provisionally -- under no circumstance should that content be stored or recreated from the index. Is that even possible? The data owner's request makes sense in the context of them wanting to retain full access control via logins as well as collecting access metrics. If the token 'CAT' points to C:\Corporate\animals.doc and the token 'DOG' points also points there, then great, CAT AND DOG will give that document a higher rating, though it is not possible to reconstruct (with any great accuracy) what the actual document content is. However, if for the sake of using the NEAR operator with Lucene the tokens are stored as LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7 THIS:8 DECEMBER:9 ... then someone could pull all tokens for animal.doc and reconstitute the token stream. Does Lucene have any kind of trade off for working with secure (and I use this term loosely) data? -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Plural word search
Hi Tony, Lucene certainly does support it. It just requires you to use a tokeniser that performs stemming such as any analyzer that uses PorterStemFilter. Sachin -Original Message- From: Tony Qian [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 16:52 To: java-user@lucene.apache.org Subject: Plural word search All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to me we need to build a dictionary to support it. Does Lucene support it? appreciate your help. Tony _ Don't miss your chance to WIN 10 hours of private jet travel from Microsoft(r) Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Plural word search
Sachin, Thanks for quick response. Is there any code example i can take look? I'm not familiar with the technique you mentioned. My question is how the analyzer knows buss is not a plural and bears is a plural. Lucene supports wildcard. However, we can not use wildcard at the beginning of search term such as *bear. is there a way to match *bear* (bear, bears, forbearance etc.) by search tern bear? thanks From: Kainth, Sachin [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: RE: Plural word search Date: Thu, 8 Mar 2007 17:14:02 - Hi Tony, Lucene certainly does support it. It just requires you to use a tokeniser that performs stemming such as any analyzer that uses PorterStemFilter. Sachin -Original Message- From: Tony Qian [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 16:52 To: java-user@lucene.apache.org Subject: Plural word search All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to me we need to build a dictionary to support it. Does Lucene support it? appreciate your help. Tony _ Don't miss your chance to WIN 10 hours of private jet travel from Microsoft(r) Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Find a local pizza place, movie theater, and more .then map the best route! http://maps.live.com/?icid=hmtag1FORM=MGAC01 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multiple segments
margeDocs only limits the merging of already saved segments as result of calling addDocument(). If there are added documents not yet saved but rather still buffered in memory (by IndexWriter), once their number exceeds maxBufferedDocs they are saved, but as a single merged segment. So you could set maxBufferedDocs to 2 (that's the minimal value) and maxMergedDocs to 1 and add N documents to the index, - that would likely result in N/2 segments. You could probably force N segments by closing the index after each add and reopen it before the next add. Note that while such settings might be interesting for learning purposes, that would have an unpleasant performance impact... Last, calling optimize(), no matter what above settings are, a single segment is created. Regards, Doron Kainth, Sachin [EMAIL PROTECTED] wrote on 08/03/2007 08:37:27: Hi all, I have been performing some tests on index segments and have a problem. I have read the file formats document on the official website and from what I can see it should be possible to create as many segments for an index as there are documents (though of course this is not a great idea). Having searched around it occurred to be that the way to do this is to set maxMergeDocs to 1. Having tried this I found that it doesn't work. All documents still get put into one segment. Any idea what I should do? Thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Term Frequency within Hits
Term Frequency in Lucene parlance = number of occurences of the term within a single document. If you're looking for how many documents have term x where x is unknown, see SimpleFacets in Solr http://lucene.apache.org/solr/api/org/apache/solr/request/SimpleFacets.html - Original Message From: Erick Erickson [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, March 7, 2007 2:29:14 PM Subject: Re: Term Frequency within Hits See TermFreqVector, HitCollector, perhaps TopDocs, perhaps TermEnum. Make sure you create your index such that frequencies are stored (see the FAQ). Erick On 3/7/07, teramera [EMAIL PROTECTED] wrote: So after I execute a search I end up with a 'Hits' object. The number of Hits is the order of a million. What I want to do is from these Hits is extract term frequencies for a few known fields. I don't have a global list of terms for any of the fields but want to generate the term frequency based on terms from the Hits. Iterating over the hits and doing this later is of course turning out to be very expensive. Is there a known Lucene way of solving such a problem so that this calculation happens as the hits are being accumulated? Appreciate any pointers, -- View this message in context: http://www.nabble.com/Term-Frequency-within-Hits-tf3364987.html#a9362169 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Plural word search
: Thanks for quick response. Is there any code example i can take look? I'm : not familiar with the technique you mentioned. My question is how the : analyzer knows buss is not a plural and bears is a plural. Stemming is a vast topic of text analysis .. some stemmers work using dictionaries, some based on algorithmic appraoches ... almost any Stemmer you can imagine can be implimented as a ToeknFilter in Lucene -- and a few already are out of the box. You might wnat to read up a little bit on the different Stemming approaches out there (google: stemming) and then take a look at some of the Lucene analysis classes that provide implementations. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene 2.1, inconsistent phrase query results with slop
Sorry about that. I think II found the diagram you're talking about on page 89. It even addresses the exact problem I'm talking about. It's not the first time I've looked like a fool, you'd think I'd be getting used to it by now G. So, it seems like the most reasonable solution to this issue would be for me to re-write the phrase queries as SpanNear queries, no? Erick On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I think that's working as designed. Although I could understand : someone wanting it to work differently. The slop is sort of like the : edit distance from the current given phrase, hence the order of terms : in the phrase matters. correct ... LIA has a great diagram explaining this ... the slop refers to how many positions you have to move the terms in the PhraseQuery to match. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural word search
as of 2.1, as I remember, you can use leading wildcards but ONLY you set a flag (see setAllowLeadingWildcard in QueryParser). Be aware of the TooManyClauses issue though (search the mail archive and you'll find many discussions of this issue). Erick On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote: Sachin, Thanks for quick response. Is there any code example i can take look? I'm not familiar with the technique you mentioned. My question is how the analyzer knows buss is not a plural and bears is a plural. Lucene supports wildcard. However, we can not use wildcard at the beginning of search term such as *bear. is there a way to match *bear* (bear, bears, forbearance etc.) by search tern bear? thanks From: Kainth, Sachin [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: RE: Plural word search Date: Thu, 8 Mar 2007 17:14:02 - Hi Tony, Lucene certainly does support it. It just requires you to use a tokeniser that performs stemming such as any analyzer that uses PorterStemFilter. Sachin -Original Message- From: Tony Qian [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 16:52 To: java-user@lucene.apache.org Subject: Plural word search All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to me we need to build a dictionary to support it. Does Lucene support it? appreciate your help. Tony _ Don't miss your chance to WIN 10 hours of private jet travel from Microsoft(r) Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Find a local pizza place, movie theater, and more….then map the best route! http://maps.live.com/?icid=hmtag1FORM=MGAC01 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Plural word search
Erick, thanks for information. Tony From: Erick Erickson [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Plural word search Date: Thu, 8 Mar 2007 13:42:00 -0500 as of 2.1, as I remember, you can use leading wildcards but ONLY you set a flag (see setAllowLeadingWildcard in QueryParser). Be aware of the TooManyClauses issue though (search the mail archive and you'll find many discussions of this issue). Erick On 3/8/07, Tony Qian [EMAIL PROTECTED] wrote: Sachin, Thanks for quick response. Is there any code example i can take look? I'm not familiar with the technique you mentioned. My question is how the analyzer knows buss is not a plural and bears is a plural. Lucene supports wildcard. However, we can not use wildcard at the beginning of search term such as *bear. is there a way to match *bear* (bear, bears, forbearance etc.) by search tern bear? thanks From: Kainth, Sachin [EMAIL PROTECTED] Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: RE: Plural word search Date: Thu, 8 Mar 2007 17:14:02 - Hi Tony, Lucene certainly does support it. It just requires you to use a tokeniser that performs stemming such as any analyzer that uses PorterStemFilter. Sachin -Original Message- From: Tony Qian [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 16:52 To: java-user@lucene.apache.org Subject: Plural word search All, I'm evaluating Lucene as a full-text search engine for a project. I got one of the requirements as following: 4) Plural Literal Search If you use the plural of a term such as bears the results will include matches to the plural term bears as well as the singular term bear. it seems to me we need to build a dictionary to support it. Does Lucene support it? appreciate your help. Tony _ Don't miss your chance to WIN 10 hours of private jet travel from Microsoft(r) Office Live http://clk.atdmt.com/MRT/go/mcrssaub0540002499mrt/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Find a local pizza place, movie theater, and more .then map the best route! http://maps.live.com/?icid=hmtag1FORM=MGAC01 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Find a local pizza place, movie theater, and more .then map the best route! http://maps.live.com/?icid=hmtag1FORM=MGAC01 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
If you store a hash code of the word rather then the actual word you should be able to search for stuff but not be able to actually retrieve it; you can trade precision for security based on the number of bits in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be a reasonable midpoint. hash64(dog) = 4312311231123121; body:4312311231123121 returns document with dog, but also any other document with a word that hashes to the same value. Walt Stoneburner wrote: Have an interesting scenario I'd like to get your take on with respect to Lucene: A data provider (e.g. someone with a private website or corporately shared directory of proprietary documents) has requested their content be indexed with Lucene so employees can be redirected to it, but provisionally -- under no circumstance should that content be stored or recreated from the index. Is that even possible? The data owner's request makes sense in the context of them wanting to retain full access control via logins as well as collecting access metrics. If the token 'CAT' points to C:\Corporate\animals.doc and the token 'DOG' points also points there, then great, CAT AND DOG will give that document a higher rating, though it is not possible to reconstruct (with any great accuracy) what the actual document content is. However, if for the sake of using the NEAR operator with Lucene the tokens are stored as LET'S:1 SELL:2 CAT:3 AND:4 DOG:5 ROBOT:6 TOYS:7 THIS:8 DECEMBER:9 ... then someone could pull all tokens for animal.doc and reconstitute the token stream. Does Lucene have any kind of trade off for working with secure (and I use this term loosely) data? -wls - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: A solution to HitCollector-based searches problems
Hello, I have just added some search implementation samples based on this collector solution, to easy the use and understanding or it: - KeywordSearch: Extract the terms (and frequency) found in a list of fields from the results of a query/filter search - GoogleSearch: Return an ordered search result grouped a la Google, based on the terms found in a list of fields - GetFieldNamesOp: Operation to mimic the getFieldNames method of IndexReader but using a searcher. With it, it is possible to explore the fields of remote indexes. See http://sourceforge.net/projects/lucollector/ for the source code ( lu-collector-src-sampleop-0.8.zip). Regards, José L. Oramas On 2/26/07, oramas martín [EMAIL PROTECTED] wrote: Hello, As you probably know, the HitCollector-based search API is not meant to work remotely, because it will generate a RPC-callback for every non-zero score. There is another problem with MultiSearcher-HitCollector-based search which knows nothing about mix HitCollector based searches (not to say it has hardcode the way to mix TopDocs for the score and for the Sort searches). Also the ParallelMultiSearcher inherits this problems and is unable to parallelize the HitCollector-based searcher. A final problem with the HitCollector-based search is related to the lost of a limit in the results, as the Hits class implements thought the getMoreDocs() function, and lazy loading and caching of documents it does. To solve those problems it is necessary a factory (HitCollectorSource) able to generate collectors for single (SingleHitCollector) an multi (MultiHitCollector) searches, and a new search method in the Searchable interface for it. To avoid modifications to the lucene core, the later requirement is NOT IMPLEMENTED in the library I have just created. Instead, an ugly solution, a wrapper for those searchers (SearcherHCSourceWrapper) and a Filter wrapper (FilterHitCollectorSource) to carry the factory-based searches, is provided. Each collector is based in a two steps sequence, one for collecting hits or subsearcher results, and another for generating the final result. Also, just in case you don't want to add a wrapper to each searcher of your project, there is an adapted version of IndexSearcher, MultiSearcher and ParallelMultiSearcher (only for version 2.1) modified exactly the same way the wrapper class SearcherHCSourceWrapper does. Just put them in your class-path (before the Lucene core jar) and you will be using the new collector interfaces without modifying your code. There are some unit testing (copied and adapted from the Lucene 2.1distribution). See http://sourceforge.net/projects/lucollector/ for the jar files and the code. If you find it interesting to complement the Lucene project, tell me how to put it in the contribution area. Regards, José L. Oramas
Re: Negative Filtering (such as for profanity)
I _think_ Lucene 2.1 (or is it trunk?, I lose track) has the ability to delete all documents containing a term. So, every time you update your profanity list, you could iterate over it and remove all documents that contain the terms. If a user can never get these documents via a query, then I don't see any reason to allow them in the index to begin with. Also, I don't use QueryFilters much, but I'm curious as to how they perform on that many docs. On Mar 7, 2007, at 5:38 PM, Greg Gershman wrote: I thought about this, as I think overall the resources required would be less than creating a filter. Ultimately I decided against it for a few reasons: 1) I'm working with an existing index of ~50 million documents, I don't want to reindex the whole thing, or even just the documents that contain profanity, if I can avoid it. 2) Filtering at indexing time means I can't effectively add new words to the profanity list without reindexing. Good suggestion, though, I appreciate it. Greg - Original Message From: Grant Ingersoll [EMAIL PROTECTED] To: java-user@lucene.apache.org Sent: Wednesday, March 7, 2007 2:07:38 PM Subject: Re: Negative Filtering (such as for profanity) Not sure if this helpful given your proposed solution, but could you do something on the indexing side, such as: 1. Remove the profanity from the token stream, much like a stopword. This would also mean stripping it from the display text 2. If your TokenFilter comes across a profanity, somehow mark the document as containing a profanity via a profanity Field (not sure if there is a way, in Lucene, to add another Field while you are in the analysis phase, but you could also have it update a table in a db or something.) Then on search, you could just say (regular query) +profanity:false HTH, Grant On Mar 7, 2007, at 10:07 AM, Greg Gershman wrote: I'm attempting to create a profanity filter. I thought to use a QueryFilter created with a Query of (-$#!+ AND [EMAIL PROTECTED] AND etc). The problem I have run into is that, as a pure negative query is not supported (a query for (-term) DOES NOT return the inverse of a query for (term)), I believe the bit set returned by a purely negative QueryFilter is empty, so no matter how many results returned by the initial query, the result after filtering is always zero documents. I was wondering if anyone had suggestions as to how else to do this. I've considered simply amending the query string submitted by the user to include a pre-generated String that would exclude the query terms, but I consider this a non-elegant solution. I had also thought about creating a new sub-class of QueryFilter, NegativeQueryFilter. Basically, it would works just like a QueryFilter, taking a positive query (so, I would pass it an OR'ed list of profane words), then the resulting bits are simply flipped. I think this would work, unless I'm missing something. I'm going to experiment with it, I'd appreciate anyone's thoughts on this. Thanks, Greg _ _ __ It's here! Your new message! Get new email alerts with the free Yahoo! Toolbar. http://tools.search.yahoo.com/toolbar/features/mail/ -- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ __ Need Mail bonding? Go to the Yahoo! Mail QA for great tips from Yahoo! Answers users. http://answers.yahoo.com/dir/?link=listsid=396546091 -- Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
one Field in many documents
Hi, I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one Field in many documents
In general I would say this is not safe, because it seems to assume too much about the implementation - and while it might in most cases currently work, the implementation could change and the program assuming this would stop working. It would most probably not work correctly right from the start for fields constructed with a Reader. Regards, Doron [EMAIL PROTECTED] wrote on 08/03/2007 12:56:33: Hi, I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... Thank you - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: one Field in many documents
[EMAIL PROTECTED] wrote on 08/03/2007 12:56:33: I have to index many documents with the same fields (only one or two fields are different). Can I add a field (Field instance) to many documents? It seams to work but I'm not sure if this is the right way... What does many mean in this context? If it means most, or all, perhaps it would be better not to index those fields at all -- they would be adding little or nothing, in terms of information content. --MDC - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
: If you store a hash code of the word rather then the actual word you : should be able to search for stuff but not be able to actually retrieve that's a really great solution ... it could even be implemented asa TokenFilter so none of your client code would ever even need to know that it was being used (just make sure it comes last after any stemming or what not) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote: : If you store a hash code of the word rather then the actual word you : should be able to search for stuff but not be able to actually retrieve that's a really great solution ... it could even be implemented asa TokenFilter so none of your client code would ever even need to know that it was being used (just make sure it comes last after any stemming or what not) I don't know... hashing individual words is an extremely weak form of security that should be breakable without even using a computer... all the statistical information is still there (somewhat like 'encrypting' a message as a cryptoquote). Doron's suggestion is preferable: eliminate token position information from the index entirely. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing pdfs
hi sachin the link wat u gave me only a zip file and an exe file for downoad. and this zip file also contains no class files.but wouldn't we be requiring a jar file or class file ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi, Here it is: http://www.seekafile.org/ -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 13:07 To: java-user@lucene.apache.org Subject: Re: indexing pdfs hi again do we have to download any jar files to run this program if so can u give me the link pls ashwin On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Well you don't need to actually save the text to disk and then index the saved index file, you can directly index that text in-memory. The only other way I have heard of is to use Ifilters. I believe SeekAFile does indexing of pdfs. Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 11:35 To: java-user@lucene.apache.org Subject: Re: indexing pdfs Is the only way index pdfs is to convert it into a text and then only index it ??? On 3/8/07, Kainth, Sachin [EMAIL PROTECTED] wrote: Hi Aswin, You can try pdfbox to convert the pdf documents to text and then use Lucene to index the text. The code for turning a pdf to text is very simple: private static string parseUsingPDFBox(string filename) { // document reader PDDocument doc = PDDocument.load(filename); // create stripper (wish I had the power to do that - wouldn't leave the house) PDFTextStripper stripper = new PDFTextStripper(); // get text from doc using stripper return stripper.getText(doc); } Sachin -Original Message- From: ashwin kumar [mailto:[EMAIL PROTECTED] Sent: 08 March 2007 09:37 To: java-user@lucene.apache.org Subject: indexing pdfs hi can some one help me by giving any sample programs for indexing pdfs and .doc files thanks regards ashwin This message has been scanned for viruses by MailControl - (see http://bluepages.wsatkins.co.uk/?6875772) This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding. The ultimate parent company of the Atkins Group is WS Atkins plc. Registered in England No. 1885586. Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW. Consider the environment. Please don't print this e-mail unless you really need to. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
: I don't know... hashing individual words is an extremely weak form of : security that should be breakable without even using a computer... all : the statistical information is still there (somewhat like 'encrypting' : a message as a cryptoquote). : : Doron's suggestion is preferable: eliminate token position information : from the index entirely. i guess i wasn't thinking about this as a security issue, more a discouragement issue ... reconstructing a doc from term vectors is easy, reconstructing it from just term positions is harder but not impossible, reconstructing from hashed tokens requires a lot of hard work. if the issue is thta you want to be abel to ship an index that people can manipulate as much as they want and you want to garuntee they can never reconstruct the original docs you're pretty much screwed ... even if you eliminate all of the position info statistical info about language structure can help you gleen a lot about hte source data. i'm not crypto expert, but i imagine it would probably take the same amount of statistical guess work to reconstruct meaningful info from either approach (hashing hte individual words compared to eliminating the positions) so i would think the trade off of supporting phrase queries would make the hasing approach more worthwhile. i mean afterall: you still wnat the index to be useful for searching right? ... if you are really paranoid don't just strip the positions, strip all duplicate terms as well to prevent any attempt at statistical sampling ... but now all you relaly have is a lookup table of word to docid with no tf/idf or position info to improve scoring, so why bother with Lucene, jsut use a BerkleyDB file to do your lookups. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Ranking/scoring
: Do I have this right? I got bit confused at first because I assumed that the : actual field values were being used in the computation, but you really need : to know the unique term count in order to get the score 'right'. you can use the actual values in FunctionQueries, except that: 1) dates aren't numeric values that lend themselves well to functions 2) the ReverseOrdinalValueSource comes in handy when you want the docs with the highest value (ie: most recent date) to be special (ie: to plug into your reciprical function and get the max value. i suppose you could write a ValueSource that finds the max value of a field and then a ValueSource that normalizes all the values of one valuesource against the value(s) of another value source ... but no one has done that yet (and it still wouldn't have a lot of meaning for dates) -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Index a source, but not store it... can it be done?
On 3/8/07, Chris Hostetter [EMAIL PROTECTED] wrote: if the issue is thta you want to be abel to ship an index that people can manipulate as much as they want and you want to garuntee they can never reconstruct the original docs you're pretty much screwed ... even if you eliminate all of the position info statistical info about language structure can help you gleen a lot about hte source data. True. i'm not crypto expert, but i imagine it would probably take the same amount of statistical guess work to reconstruct meaningful info from either approach (hashing hte individual words compared to eliminating the positions) so i would think the trade off of supporting phrase queries would make the hasing approach more worthwhile. I suppose it also depends on how much access the user has to the index. If they have access to the physical index and means of querying it, then they have access to the hashing algo (and/or key) and so it is worthless. If they don't, and their access is strictly through queries, then I don't see what help hashing will provide, as the result of any given query should be the same, hashing or not. i mean afterall: you still wnat the index to be useful for searching right? ... if you are really paranoid don't just strip the positions, strip all duplicate terms as well to prevent any attempt at statistical sampling ... but now all you relaly have is a lookup table of word to docid with no tf/idf or position info to improve scoring, so why bother with Lucene, jsut use a BerkleyDB file to do your lookups. You could also do both. Another thing that might help is relatively aggressive stop word removal. All these measures will raise the discouragement bar slightly. -Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
FieldCache: flush cache explicitly
I think the api should allow for explicitly flush the fieldcache. I have a setup where new readers are being loaded very some period of time. I don't want to rely on Java WeakHashMap to free the cache, I want to be able to do it in a deterministic way. It would be great if this can be added to Lucene, I can create a bug if the Lucene gods agree to it :) Thanks -John - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]