Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
Otis Gospodnetic wrote: The most obvious answer is that the full-text indexing features of RDBMS's are not as good (as fast) as Lucene. MySQL, PostgreSQL, Oracle, MS SQL Server etc. all have full-text indexing/searching features, but I always hear people complaining about the speed. A person

Re: Lucene vs. in-DB-full-text-searching

2005-02-24 Thread Kevin A. Burton
David Sitsky wrote: On Sat, 19 Feb 2005 09:31, Otis Gospodnetic wrote: You are right. Since there are C++ and now C ports of Lucene, it would be interesting to integrate them directly with DBs, so that the RDBMS full-text search under the hood is actually powered by one of the Lucene ports.

Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. You know... it looks like the problem is that TermInfosReader uses INDEX_INTERVAL during seeks and is probably just jumping RIGHT past the offsets that

Re: Term Weights and Clustering

2005-02-24 Thread Dawid Weiss
Hi Owen, I'm from the Carrot2 project, so I feel called to the blackboard: One source for how to do this is the thesis of Stanislaw Osinski and others like it: http://www.dcs.shef.ac.uk/teaching/eproj/msc2004/abs/m3so.htm And the Carrot2 project which uses similar techniques.

ngramj

2005-02-24 Thread Gusenbauer Stefan
Does anyone know a good tutorial or the javadoc for ngramj because i need it for guessing the language of the documents which should be indexed? thx stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: ngramj

2005-02-24 Thread petite_abeille
On Feb 24, 2005, at 14:50, Gusenbauer Stefan wrote: Does anyone know a good tutorial or the javadoc for ngramj because i need it for guessing the language of the documents which should be indexed? http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/plugin/ languageidentifier/ Cheers -- PA,

RE: Custom filters document numbers

2005-02-24 Thread Vanlerberghe, Luc
An IndexReader will always see the same set of documents. Even if another process deletes some documents, adds new ones or optimizes the complete index, your IndexReader instance will not see those changes. If you detect that the Lucene index changed (e.g. by calling

Re: Custom filters document numbers

2005-02-24 Thread Stanislav Jordanov
The first statement is clear to me: I know that an IndexReader sees a 'snapshot' of the document set that was taken in the moment of the Reader's creation. What I don't know is whether this 'snapshot' has also its doc numbers fixed or they may change asynchronously. And another thing I don't know

Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. It looks like you're using a pre-1.4 version of Lucene. Since 1.4 this is no longer called TermInfosWriter.INDEX_INTERVAL, but rather

Re: sorted search

2005-02-24 Thread Daniel Naber
On Thursday 24 February 2005 19:01, Yura Smolsky wrote:       sort.setSort( new SortField[] { new SortField (modified, SortField.STRING, true) } ); You should store the date as a number, e.g. days since 1970 (or weeks if that is precise enough) and then tell the sort that it's an integer.

Re: sorted search

2005-02-24 Thread Erik Hatcher
Sorting by String uses up lots more RAM than a numeric sort. If you use a numeric (yet lexicographically orderable) date format (e.g. MMDD) you'll see better performance most likely. Erik On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote: Hello, lucene-user. I have index with many

Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik. if i need to store hour and minute then I need to place date into following integer format: MMDDHHII ? Will it be faster than current solution? And will I have ability to do Ranged queries (from Date A to Date B)? EH Sorting by String uses up lots more RAM than a numeric sort.

Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote: Kevin A. Burton wrote: I finally had some time to take Doug's advice and reburn our indexes with a larger TermInfosWriter.INDEX_INTERVAL value. It looks like you're using a pre-1.4 version of Lucene. Since 1.4 this is no longer called TermInfosWriter.INDEX_INTERVAL, but

Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik. about memory usage... DateField takes string of 9 bytes in memory ('000ic64p7') How much memory will be taken by this string? How much memory will be taken by integer? EH Sorting by String uses up lots more RAM than a numeric sort. If you EH use a numeric (yet lexicographically

Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Doug Cutting
Kevin A. Burton wrote: Is this setting incompatible with older indexes burned with the lower value? Prior to 1.4, yes. After 1.4, no. What happens after 1.4? Can I take indexes burned with 256 (a greater value) in 1.3 and open them up correctly with 1.4? Not without hacking things. If your

Re: Possible to mix/match indexes with diff TermInfosWriter.INDEX_INTERVAL ??

2005-02-24 Thread Kevin A. Burton
Doug Cutting wrote: Not without hacking things. If your 1.3 indexes were generated with 256 then you can modify your version of Lucene 1.4+ to use 256 instead of 128 when reading a Lucene 1.3 format index (SegmentTermEnum.java:54 today). Prior to 1.4 this was a constant, hardwired into the

1.4.x TermInfosWriter.indexInterval not public static ?

2005-02-24 Thread Kevin A. Burton
Whats the desired pattern of using of TermInfosWriter.indexInterval ? Do I have to compile my own version of Lucene to change this? The last API was public static final but this is not public nor static. I'm wondering if we should just make this a value that can be set at runtime.

Re: Not entire document being indexed?

2005-02-24 Thread [EMAIL PROTECTED]
Hi Otis Thanks for the reply, what exactly should I be looking for with Luke? What would setting the max value to maxInteger do? Is this some arbitrary value or...? -pedja Otis Gospodnetic said the following on 2/24/2005 2:24 PM: Use Luke to peek in your index and find out what really got