Re: TermFrequencies vector limits?

2005-11-21 Thread Paul Elschot
On Monday 21 November 2005 02:16, [EMAIL PROTECTED] wrote: Hi. I was wondering if anyone else has seen this before. I'm using lucene 1.4.3 and have indexed about 3000 text documents using the statement: doc.add(Field.Text(contents, new FileReader(f), true)); When I go and retrieve the

Re: TermFrequencies vector limits?

2005-11-21 Thread Erik Hatcher
By default, documents get truncated at 10,000 terms (maybe there is an off-by-one where it is going to 10,001 though?). To increase this, and I always do, set the max field length on your IndexWriter, and re-index. In 1.4.3, you set the maxFieldLength variable of IndexWriter directly.

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
Yonik, Thanks for your carefully thought out and detailed reply. On 20 Nov 2005, at 12:00, Yonik Seeley wrote: Does it make sense to add an IndexWriter setting to specify a default position increment gap to use when multiple fields are added in this way? Per-field might be nice... The good

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 04:26, Erik Hatcher wrote: What about adding an offset to Field, setPositionOffset(int offset)? Looking at DocumentWriter, it looks like this would be the simplest thing that could work, without precluding the interesting option of modifying Analyzer to allow with flags

Grouping results on the basis of a field

2005-11-21 Thread Samarendra Pratap
Hi, I am using lucene 1.4.3. The basic functionality of the search is simple, put in the keyword as “java” and it will display you all the books having java keyword. Now I have to add a feature which also shows the name of top authors (lets say top 5 authors) with the number of

Re: TermFrequencies vector limits?

2005-11-21 Thread marigoldcc
Just to make sure that I understand this correctly, the docs say: By default, no more than 10,000 terms will be indexed for a field. Given your note, then the docs do not mean that no more than 10,000 terms will be indexed, but that some smaller number of terms will be indexed and only the

Re: TermFrequencies vector limits?

2005-11-21 Thread Michael Curtin
When I go and retrieve the term frequency vectors, for any document under about 90k, everything looks as expected. However for larger documents (I haven't found the exact point, but I know that those over 128k qualify) the sum of the term frequencies in the vector seems to max out at 10001.

Re: TermFrequencies vector limits?

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 08:37, Michael Curtin wrote: That's probably because there is a limit built into Lucene where it ignores any tokens in a field past the first 10,000. There is a property you can set to increase this limit. I dont' have the source in front of me right now, but if you go

Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Oren Shir
Hi, I tried stressing Lucene in a controlled environment: one static IndexSearcher for an index that doesn't change, and in same process I create a number of Threads that call this Searcher concurrently for a limited time. I expected the number of successful queries to increase when using more

Re: TermFrequencies vector limits?

2005-11-21 Thread Michael Curtin
To get a higher limit. Of course, you could also change the Lucene source file and recompile it. Note that you CANNOT just set the property in your code, in general, as the Lucene class puts it into a static final int, meaning it examines the value of the property (once) at class

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread gekkokid
Oren Shir wrote: I tested this in version 1.4.3 and 1.9rc1, and they are both the same in this aspect. 1.9rc1 is faster, but does not benefit from multi threading. some newbie questions i have, does 1.4.3 benefit from multi-threading? is 1.9 the version in the source repository? _gk

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
This is expected behavior: you are probably quickly becoming CPU bound (which isn't a bad thing). More threads only help when some threads are waiting on IO, or if you actually have a lot of CPUs in the box. -Yonik Now hiring -- http://forms.cnet.com/slink?231706 On 11/21/05, Oren Shir [EMAIL

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Oren Shir
gekkokid, does 1.4.3 benefit from multi-threading? Sorry for not being clear. My tests show that both version does not benefit from multi threading, but it is possible that I'm CPU bound, as Yonik kindly reminded me. is 1.9 the version in the source repository? 1.9 is the version in source

Re: Throughput doesn't increase when using more concurrent threads

2005-11-21 Thread Yonik Seeley
On 11/21/05, Oren Shir [EMAIL PROTECTED] wrote: It is rather sad if 10 threads reach the CPU limit. I'll check it and get back to you. It's about performance and throughput though, not about number of threads it takes to reach saturation. In a 2 CPU box, I would say that the ideal situation is

Re: Urgent - File Lock in Lucene 1.2

2005-11-21 Thread jian chen
Hi, Karl, Therer have been quite some discussions regarding the too many open files problem. From my understanding, it is due to Lucene trying to open multiple segments at the same time (during search/merging segments), and the operating system wouldn't allow opening that many file handles. If

Re: Spans, appended fields, and term positions

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher [EMAIL PROTECTED] wrote: Modifying Analyzer as you have suggested would require DocumentWriter additionally keep track of the field names and note when one is used again. For position increments, it doesn't have to be tracked. The patch to DocumentWriter could also

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 12:55, Yonik Seeley wrote: On 11/21/05, Erik Hatcher [EMAIL PROTECTED] wrote: Modifying Analyzer as you have suggested would require DocumentWriter additionally keep track of the field names and note when one is used again. For position increments, it doesn't have to be

Re: TermFrequencies vector limits?

2005-11-21 Thread Chris Hostetter
: By default, no more than 10,000 terms will be : indexed for a field. : : Given your note, then the docs do not mean that no : more than 10,000 terms will be indexed, but that some : smaller number of terms will be indexed and only the : first 10,000 occurrances will be tallied. It means that

Lucene Index Changed event

2005-11-21 Thread Aigner, Thomas
Hi all, Is there an index changed event that I can jump on that will tell me when my index has been updated so I can close and reopen my searcher to get the new changes? I can't seem to find the event, but see some tools that might accomplish this (DLESE DPC software components?).

How does lucene choose a field for sort?

2005-11-21 Thread John Powers
If I sort on a field called sequence, but at document creation time I add in //create doc A doc.add(Field.Text(sequence, 32)); doc.add(Field.Text(sequence, 3)); doc.add(Field.Text(sequence, 932)); //create doc B doc.add(Field.Text(sequence, 1)); doc.add(Field.Text(sequence, 300));

Re: Spans, appended fields, and term positions

2005-11-21 Thread Erik Hatcher
On 21 Nov 2005, at 16:09, Yonik Seeley wrote: The Analyzer extensions seem fine, but much more general purpose than my need. For your need (a global increment), isn't expanding analyzer actually easier? analyser = new OldAnalyzer() { public int getPositionIncrementGap(String field) {

Re: How does lucene choose a field for sort?

2005-11-21 Thread Yonik Seeley
On 11/21/05, Erik Hatcher [EMAIL PROTECTED] wrote: Neither. It'll throw an exception. Just don't rely on it to throw an exception either though... the checking is not comprehensive. One should treat sorting on a field with more than one value per document as undefined. -Yonik Now hiring --