Re: Lucene 2.9

2009-03-17 Thread Luis Alves
Mark Miller wrote: Hmmm - you can probably get qsol to do it: http://myhardshadow.com/qsol. I think you can setup any token to expand to anything with a regex matcher and use group capturing in the replacement (I don't fully remember though, been a while since I've used it). So you could do

Re: Scores between words. Boosting?

2009-03-17 Thread liat oren
Thanks for all the answers. I am new to Lucene and in the emails its the first time I heard of the bigrams and thus read about them a bit. Question - if I query for cat animal - or use boosting - cat^2 animal^0.5 - will the results return ONLY documents that contain both? From what I saw until

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. Adrian. On Mon, Mar 16, 2009 at 11:37 AM, Michael McCandless luc...@mikemccandless.com wrote: Be careful: docFreq does not take deletions into account.

Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
Hi, I edited Luke's code so it also uses my classes (I added the jar to the class-path and put it in the lib folder). When I run from java it works good. Now I try to build it and invoke Luke's jar outside java and get the following error: Exception in thread main java.lang.NoClassDefFoundError:

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread Ian Lea
Well, assuming that when you say invoke Luke's jar outside java you mean that you are trying to run Luke from the command line e.g. $ java -jar lukexxx.jar, it simply sounds like your classes are not on the classpath. Add them. -- Ian. On Tue, Mar 17, 2009 at 10:20 AM, liat oren

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
Hi Ian, Thanks for the answer. Yes, I meant running in from command line. They are already in the classpath - I added this part: classpathentry kind=lib path=lib/myJar.jar/ 2009/3/17 Ian Lea ian@gmail.com Well, assuming that when you say invoke Luke's jar outside java you mean that you

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Adrian Dimulescu wrote: Thank you. I suppose the solution for this is to not create an index but to store co-occurence frequencies at Analyzer level. I don't understand how this would address the docFreq does not reflect deletions. You can use the shingles analyzer (under

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
This is all getting very complicated! Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Standard questions: have you warmed up the searcher? How large is the index? How many occurrences of

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: I don't understand how this would address the docFreq does not reflect deletions. Bad mail-quoting, sorry. I am not interested by document deletion, I just index Wikipedia once, and want to get a co-occurrence-based similarity distance between words called NGD

Re: Different analyzer per field ?

2009-03-17 Thread Ian Lea
org.apache.lucene.analysis.PerFieldAnalyzerWrapper There's plenty of info about it on the web, even some recent discussion on this list which will be in the archives. -- Ian. On Tue, Mar 17, 2009 at 11:17 AM, Raymond Balmès raymond.bal...@gmail.com wrote: I was looking for calling a different

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Ian Lea wrote: Adrian - have you looked any further into why your original two term query was too slow? My experience is that simple queries are usually extremely fast. Let me first point out that it is not too slow in absolute terms, it is only for my particular needs of attempting the

Re: Different analyzer per field ?

2009-03-17 Thread Raymond Balmès
OK thank's a lot, I must be very poor about searching ;-)... I kind of missed these information. Thx again. -Ray- On Tue, Mar 17, 2009 at 12:25 PM, Uwe Schindler u...@thetaphi.de wrote: It is possible in two ways: 1. Use the analyzer class and generate a TokenStream/Tokenizer from it. Then

Different analyzer per field ?

2009-03-17 Thread Raymond Balmès
I was looking for calling a different analyzer for each field of a document... looks like it is not possible. Do I have it right ? -Ray-

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread Ian Lea
Added that classpathentry to what? That means nothing to me. I'd run it from the command line as $ java -cp whatever -jar whatever.jar or $ export CLASSPATH=whatever $ java -jar whatever.jar Those examples are unix based. If you're on Windows I imagine there are equivalents. Or maybe your

Re: Luke - Build a jar that uses my own jar

2009-03-17 Thread liat oren
I work on windows. I copied my jar to the lib directory - so it is now together with the other jars Luke uses (Lucene, etc) And added the text below to the classpath file (exists in the luke-src-0.9.1 directory). 2009/3/17 Ian Lea ian@gmail.com Added that classpathentry to what? That

Re: number of hits of pages containing two terms

2009-03-17 Thread Ian Lea
OK - thanks for the explanation. So this is not just a simple search ... I'll go away and leave you and Michael and the other experts to talk about clever solutions. -- Ian. On Tue, Mar 17, 2009 at 11:35 AM, Adrian Dimulescu adrian.dimule...@gmail.com wrote: Ian Lea wrote: Adrian - have

Re: number of hits of pages containing two terms

2009-03-17 Thread Michael McCandless
Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? With the simple approach (doing 100 million 2-term AND queries), how long do you estimate it'd take? I think you could do this with your own analyzer (as you suggested)... it would run

get terms of a field and its frequences during indexing the document

2009-03-17 Thread Ильдар Аширбаев
Hello. Can I get access to the terms of a field and its frequency during indexing the document? Thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

Lucene-contrib maven artifact id?

2009-03-17 Thread Paul Libbrecht
Hello Luceners, what is the official pom.xml fragment to be used for the contribs package of lucene? It seems to be only of type pom inside the maven repository... does it mean that I have to fetch sub-contribs ? paul smime.p7s Description: S/MIME cryptographic signature

Re: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Glen Newton
You might try looking in a list that talks about recommender systems. Google hits: - http://en.wikipedia.org/wiki/Recommendation_system - ACM Recommender Systems 2009 http://recsys.acm.org/ - A Guide to Recommender Systems http://www.readwriteweb.com/archives/recommender_systems.php 2009/3/17

RE: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Max Metral
I'm not sure this would fall primarily under recommenders... I would assume Facebook is doing look-ahead on connections. i.e. A-B, B-C, so suggest A-C. Then they weight the suggestions by the number of indirect links between A and C and probably other factors (which is where the generic

Re: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Grant Ingersoll
Have a look at the Lucene sister project: Mahout: http://lucene.apache.org/mahout . In there is the Taste collaborative filtering project which is all about recommendations. On Mar 17, 2009, at 9:32 AM, Aaron Schon wrote: Hi all, Apologies if this question is off-topic, but I was

RE: Lucene-contrib maven artifact id?

2009-03-17 Thread Steven A Rowe
Hi Paul, On 3/17/2009 at 9:18 AM, Paul Libbrecht wrote: what is the official pom.xml fragment to be used for the contribs package of lucene? It seems to be only of type pom inside the maven repository... does it mean that I have to fetch sub-contribs ? Your POM should include dependencies

What is the best way to modify a Document object

2009-03-17 Thread Paul Taylor
I am using lucene to index rows in a spreadsheet , each row is a Document, and the document indexes 10 fields from the row plus the row number used to relate thethe Document to the row number So when someone modifies one of the 10 fields I am interested in a row I have to update the document

Re: What is the best way to modify a Document object

2009-03-17 Thread Simon Willnauer
Hi Paul, If you do not store all the data inside lucene you have to get you updated data from you spreadsheet again. Even if you would store all the data you would have to update the document by creating a new one and adding it to the index using updateDocument(). You can not update just one

Error using DuplicateFilter in contrib/queries

2009-03-17 Thread Densel Santhmayor
Hello I was trying to use the DuplicateFilter api in contrib/queries for Lucene in an application but it doesn't seem to be accepted as a valid argument to the searcher.search function. I'm using Apache Lucene 2.4.0. Here's what I did. DuplicateFilter df=new DuplicateFilter(NAME);

Re: number of hits of pages containing two terms

2009-03-17 Thread Adrian Dimulescu
Michael McCandless wrote: Is this a one-time computation? If so, couldn't you wait a long time for the machine to simply finish it? The final production computation is one-time, still, I have to recurrently come back and correct some errors, then retry... With the simple approach (doing 100

Re: number of hits of pages containing two terms

2009-03-17 Thread Paul Elschot
You may want to try Filters (starting from TermFilter) for this, especially those based on the default OpenBitSet (see the intersection count method) because of your interest in stop words. 10k OpenBitSets for 39 M docs will probably not fit in memory in one go, but that can be worked around by

Re: People you might know ( a la Facebook) - *slightly offtopic*

2009-03-17 Thread Petite Abeille
On Mar 17, 2009, at 2:32 PM, Aaron Schon wrote: how would I go about recommending Jane Doe connecting to Frank Jones?. Hope you can help a newbie by pointing where I should be looking? You might as well read something about it to get you started: Programming Collective Intelligence

NPE in MultiSegmentReader$MultiTermDocs.doc

2009-03-17 Thread Comron Sattari
I've recently upgraded to Solr 1.3 using Lucene 2.4. One of the reasons I upgraded was because of the nicer SearchComponent architecture that let me add a needed feature to the default request handler. Simply put, I needed to filter a query based on some additional parameters. So I subclassed

Re: sloppyFreq question

2009-03-17 Thread Chris Hostetter
: I suppose SpanTermQuery could override the weight/scorer methods so that : it behaved more like a TermQuery if it was executed directly ... but : that's really not what it's intended for. : : This is currently the only way to boost a term via payloads. : BoostingTermQuery extends

Re: number of hits of pages containing two terms

2009-03-17 Thread Chris Hostetter
: The final production computation is one-time, still, I have to recurrently : come back and correct some errors, then retry... this doesn't really seem like a problem ideally suited for Lucene ... this seems like the type of problem sequential batch crunching could solve better... first

Re: Scores between words. Boosting?

2009-03-17 Thread Grant Ingersoll
On Mar 17, 2009, at 5:44 AM, liat oren wrote: Thanks for all the answers. I am new to Lucene and in the emails its the first time I heard of the bigrams and thus read about them a bit. Question - if I query for cat animal - or use boosting - cat^2 animal^0.5 - will the results return ONLY