On 8/1/06, Pedro Côrte-Real <[EMAIL PROTECTED]> wrote:
> On Tue, 2006-08-01 at 09:24 +0900, David Balmain wrote:
> > How many documents and what is the date range (eg 2001-01-01 ->
> > 2006-08-01). These are the critical variables for sort performance.
> > Once I know these numbers I'll be able to replicate the task here and
> > I'll see what I can do.
>
> I have around 600_000 documents and the date range is rather large,
> something like from year 1000 to now. I don't know for sure but I can
> check if it makes a difference.
>
> But not all my sort fields are dates. I also have regular text fields
> that I have now made untokenized (by using separate fields for sorting
> and searching). Got to check if that made them faster.

Hmmm. Sounds like an interesting application. One solution would be to
cache the sort index on disk. The problem with this is that the cache
would still need to be recalculated every time you add more documents
to the index so you'll still have the long wait occasionally. I'll
look into it anyway at a later stage.

Another idea that I can implement now is to add a BYTES sort type
which would basically sort by the order the terms appear in the index.
Let's say you index dates in the format "YYYYMMDD" and you sort by
INTEGER. Everytime you load the sort index you need to go through
every single date and convert it from string to integer. But this is
unnecessary since the dates are already in order in the index. A BYTES
sort type would take advantage of this. You'd get an even bigger
benefit for ascii strings. strcoll is used to sort strings but this is
unnecessary for ascii strings as they are already correctly ordered in
the index. Also, the index needs to keep each string in memory which
would also be unnessary.

Sorry if this isn't very clear. I'm not sure how much it will help.
We'll have to wait and see.

Dave
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to