On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson <erickerick...@gmail.com> wrote: > I'm always skeptical of storing the doc IDs since they can > change out from underneath you (just delete even a single > document and optimize).
We never delete documents. Even when a feature request came in to update documents (i.e. delete the old one and add a new version), we ended up keeping the old version around, partially because we didn't want the IDs to shift (which is a bit of a recursive argument), but also because it's forensically sound to have the previous versions around so people can see what edits were made. > What is it you're doing with the doc ID that you couldn't do with the guid? > If your "guid list" > were ordered, I can imagine building filters quite quickly from > it using TermDocs.skipTo for instance.. The main problem with filters is that DocIdBitSet's iterator has to return the doc IDs in order. Even if our GUIDs are in order (they would be, as it would be the primary key on tables using them), they won't be in the same order as the IDs of the docs they came from. So for each row in the ResultSet, you need to do a TermDocs.seek(Term). This not only costs the additional I/O (and it's a lot more than the original database query was), but you have to read every row in the ResultSet just to get the first doc ID. Contrast this with using doc IDs for the database query. You don't need to hit the index at all since you already have the result. And the docs come back in order, so you don't even have to iterate the entire result set - you can read the first 100 rows and then read more rows if/when they are needed. And if the caller is using skipTo then this can be incorporated into the database query to avoid returning rows which are only going to be discarded anyway. Integer fields should have improved things a little in terms of the amount of I/O required to do the query (at least I would hope that this is the case - I haven't done any tests yet and we can't use them yet for backwards compatibility reasons) but they don't remove the problem of needing to iterate every document in the result set up-front. TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org