> file seeks instead of array lookups I'm with you now. So you do seeks in your comparator. For a large index you might as well use java.io.RandomAccessFile for the "array", because there would be little value in buffering when the comparator is liable to jump all around the file. This sounds very expensive, though. If you don't open a Searcher to frequently, it makes sense (in my muddled mind) to pre-sort to reduce the number of seeks. That was the half-baked idea of the third file, which essentially orders document IDs.
> Bear in mind, there have been some improvements recently to the ability to grab individual stored fields per document.... I can't see anything like that in 2.0. Is that something in the Lucene HEAD build? -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 01 August 2006 09:37 To: java-user@lucene.apache.org Subject: RE: Sorting : I take your point that Berkley DB would be much less clumsy, but an : application that's already using a relational database for other purposes : might as well use that relational database, no? if you already have some need to access data about each matching doc from a relational DB, then sure you might as well let it sort for you -- but just bcause your APP has some DB connections open doesn't mean that's a worthwhile reason to ask it to do the sort ... your app might have some netowrk connections open to an IMAP server as well .. that doesn't mean you should convert the docs to email messages and ask the IMAP server to sort them :) : I'm not really with you on the random access file, Chris. Here's where I am : up to with my [mis-]understanding... : : I want to sort on 2 terms. Happily these can be ints (the first is an INT : corresponding to a 10 minute timestamp "YYMMDDHHI" and the second INT is a : hash of a string, used to group similar documents together within those 10 : minute timestamps). When I initially warm up the FieldCache (first search : after opening the Searcher), I start by generating two random access files : with int values at offsets corresponding to document IDs for each of these; : the first file would have ints corresponding to the timestamp and the second : would have integers corresponding to the hash. I'd then need to generate a : third file which is equivalent to an array dimensioned by document ID, with : document IDs in compound sort order?? i'm not sure why you think you need the third file ... you should be able to use the two files you created exactly the way the existing code would use the two arrays if you were using an in memory FieldCache (with file seeks instead of array lookups) .. i think the class you want to look at is FieldSortedHitQueue : In a big index, it will take a while to walk through all of the documents to : generate the first two random access files and the sort process required to : generate the sorted file is going to be hard work. well .. yes. but that's the trade off, the reason for the RAM based FieldCache is speed .. if you don't have that RAM to use, then doing the same things on disk gets slower. Bear in mind, there have been some improvements recently to the ability to grab individual stored fields per document (FieldSelector is the name of the class i think) ... i haven't tried those out yet, but they could make Sorting on a stored field (which wouldn't require building up any cache - RAM or Disk based) feasible regardless of the size of your result sets ... but i haven't tried that yet. -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature