Re: Boosting results

2008-11-11 Thread Erik Hatcher
On Nov 11, 2008, at 8:32 AM, Stefan Trcek wrote: On Tuesday 11 November 2008 02:18:39 Erik Hatcher wrote: The integration won't be too painful... the main thing is that Solr requires* some configuration files, literally on the filesystem, in order to fire up and be happy. And you'll need to

Re: Boosting results

2008-11-11 Thread Stefan Trcek
On Monday 10 November 2008 14:58:15 Mark Miller wrote: But: it's slow to load a field for the first time.  LUCENE-1231 (column-stride fields) aims to greatly speed up the load time. Test it out though. In some recent testing I was doing it was *way* faster than I thought it would be based

Using DeletionPolicy to roll back to previous commit point

2008-11-11 Thread mark harwood
Probably a question for Mike M. Is it possible/sensible to use IndexDeletionPolicy to remove the *newest* commit points (as opposed to the usual scenario of deleting old commit points). I experimented with this: class RollbackDeletionPolicy implements IndexDeletionPolicy {

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 11:29:27 schreef Michael McCandless: The other part of your proposal was to somehow number term text such that term range comparisons can be implemented fast int comparison. ... http://fontoura.org/papers/paramsearch.pdf However that'd be quite a bit deeper

Re: Boosting results

2008-11-11 Thread Stefan Trcek
On Tuesday 11 November 2008 02:18:39 Erik Hatcher wrote: The integration won't be too painful... the main thing is that Solr requires* some configuration files, literally on the filesystem, in order to fire up and be happy. And you'll need to craft Solr's schema.xml to jive with how you

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Also, one nice optimization we could do with the term number column- stride array is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the array that maps docID to term number we could use exactly the

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
It seems like for many of your examples (age, zip code, country), simply computing storing the mapping yourself (your first option below) would actually be viable? Also: I think in fact you never need to merge the term numbering for many segments during searching? Ie, the search runs one

Re: Term numbering and range filtering

2008-11-11 Thread Paul Elschot
Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the term number column- stride array is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when populating the

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
Paul Elschot wrote: Op Tuesday 11 November 2008 21:55:45 schreef Michael McCandless: Also, one nice optimization we could do with the term number column- stride array is do bit packing (borrowing from the PFOR code) dynamically. Ie since we know there are X unique terms in this segment, when

Re: Term numbering and range filtering

2008-11-11 Thread Michael McCandless
The other part of your proposal was to somehow number term text such that term range comparisons can be implemented fast int comparison. I like the idea of building dynamic filters on top of a column-stride array of field values. You could extend it to be a real Scorer, too. EG I could imagine

IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
Hi, I'm pretty new to Lucene, so please bear with me if this has been covered before. The wiki suggests sharing a single IndexSearcher between threads for best performance (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've tested running the same set of queries with: multiple

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller
Nice! An 8 core machine with a test ready to go! How about trying the read only mode that was added to 2.4 on your IndexReader? And if you you are on unix and could try trunk and use the new NIOFSDirectory implementation...that would be awesome. Those two additions are our current hope for

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller
And if you you are on unix and could try trunk and use the new NIOFSDirectory implementation...that would be awesome. Woah...that made 2.4 too. A 2.4 release will allow both optimizations. Many thanks! - To unsubscribe,

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Michael McCandless
Nice results, thanks! The poor disk-based scaling may be fixed by NIOFSDirectory, if you are on Unix. If you are on Windows it won't help (and will likely be worse than FSDirectory), because of an apparently bug in Sun's JVM on Windows whereby NIO positional file reads seem to share a

Re: term offsets wrong depending on analyzer

2008-11-11 Thread Michael McCandless
Just to followup... I opened these three issues: https://issues.apache.org/jira/browse/LUCENE-1441 (fixed in 2.9) https://issues.apache.org/jira/browse/LUCENE-1442 (fixed in 2.9) https://issues.apache.org/jira/browse/LUCENE-1448 (still iterating) Mike Christian Reuschling wrote: Hi

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
I re-ran the no-readonly ram tests: thread shared 1 64043 53610 2 26999 25260 3 27173 17265 4 22205 13222 5 20795 11098 6 17593 9852 7 17163 8987 8 17275 9052 9 19392 10266 10 27809 10397 11 25987 10724

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Dmitri Bichko
32 cores, actually :) I reran the test with readonly turned on (I changed how the time is measured a little, it should be more consistent): fs-thread ram-thread fs-shared ram-shared 1 71877 54739 73986 61595 2 34949 26735 43719 28935 3 25581

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller
Dmitri Bichko wrote: 32 cores, actually :) Glossed over that - even better! Killer machine to be able to test this on. I reran the test with readonly turned on (I changed how the time is measured a little, it should be more consistent): fs-thread ram-thread fs-shared

Re: IndexSearcher and multi-threaded performance

2008-11-11 Thread Mark Miller
Mark Miller wrote: Thats a good point, and points out a bug in solr trunk for me. Frankly I don't see how its done. There is no code I can see/find to use it rather than FSDirectory. Still assuming there must be a way, but I don't see it... Ah - brain freeze. What else is new :) You have to

Re: Order the index by timestamp field and Get n documents

2008-11-11 Thread 黄成
I think you should use NumberTools to format timestamp first, otherwise sort will not work correctly On Mon, Nov 10, 2008 at 8:00 PM, Cool The Breezer [EMAIL PROTECTED]wrote: Could able to do that using range query String end = 25337325126;//i.e. 11/30/, assume that this is max end

Grouping of Boolean opeartors in Lucene..?

2008-11-11 Thread Santosh Urs
How can i use multiple Boolean operators in a search query.? For example , from the search text field , i usually get the queries which looks like Any (word or phrase) and ( a list of URI's) example:: rice land http\://www.wtr.org/wordlist#c_2379 http\://www.wtr.org/wordlist#c_65748

Parsing MSWord

2008-11-11 Thread dipesh
Hello, I wanted to know if there are classes in Lucene that support parsing MSWord documents. Many thanks, Dipesh Help Ever Hurt Never- Baba

Re: Parsing MSWord

2008-11-11 Thread Dave Newton
--- On Tue, 11/11/08, dipesh wrote: I wanted to know if there are classes in Lucene that support parsing MSWord documents. Searching the web might help: http://www.google.com/search?q=lucene+%2Bword The Apache Tika project (http://incubator.apache.org/tika/) might also be of interest. Dave

Re: Grouping of Boolean opeartors in Lucene..?

2008-11-11 Thread prabin meitei
Hi, You can use Boolean query for the same. Boolean query is meant for having a series of queries with boolean operators defined. For eg. lets say you have 3 diff queries A, B, C and you want a final query which behaves as A (B || C) BooleanQuery query = new BooleanQuery(); BooleanQuery

RE: Parsing MSWord

2008-11-11 Thread John Griffin
Dipesh, Start here. http://poi.apache.org/ John G. -Original Message- From: dipesh [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 11, 2008 8:38 PM To: java-user@lucene.apache.org Subject: Parsing MSWord Hello, I wanted to know if there are classes in Lucene that support parsing

Re: Parsing MSWord

2008-11-11 Thread dipesh
Thank you, It was really helpful. I also found some similar work being done in the Nutch project. Regards, Dipesh On Wed, Nov 12, 2008 at 12:52 PM, Dave Newton [EMAIL PROTECTED] wrote: --- On Tue, 11/11/08, dipesh wrote: I wanted to know if there are classes in Lucene that support parsing

Re: Grouping of Boolean opeartors in Lucene..?

2008-11-11 Thread Santosh Urs
Hi Prabin, Thanks for suggestion . it worked for me.. Thanks I'm not aware of Boolean Query , since I'm new to lucene technology i modified the code like this.. BooleanQuery textQuery = new BooleanQuery(); BooleanQuery uriQuery = new BooleanQuery();

Re: Feasibility question

2008-11-11 Thread Otis Gospodnetic
Yes, I think it is. I think the only catch will be those log timestamps, how fine you really need them to be, and if you want them very fine what happens when you do range queries on timestamps. If you have a pile of log files lying around, it should be pretty easy to get them indexed. You