On Thu, May 26, 2011 at 6:22 PM, Patrick Totzke <patricktotzke at googlemail.com> wrote: > Excerpts from Austin Clements's message of Thu May 26 22:43:02 +0100 2011: >> > > Though, Patrick, that solution doesn't address your problem.? On the >> > > other hand, it's not clear to me what concurrent access semantics >> > > you're actually expecting.? I suspect you don't want the remaining >> > > iteration to reflect the changes, since your changes could equally >> > > well have affected earlier iteration results. >> > That's right. >> > > But if you want a >> > > consistent view of your query results, something's going to have to >> > > materialize that iterator, and it might as well be you (or Xapian >> > > would need more sophisticated concurrency control than it has).? But >> > > this shouldn't be expensive because all you need to materialize are >> > > the document ids; you shouldn't need to eagerly fetch the per-thread >> > > information. >> > I thought so, but it seems that Query.search_threads() already >> > caches more than the id of each item. Which is as expected >> > because it is designed to return thread objects, not their ids. >> > As you can see above, this _is_ too expensive for me. >> >> I'd forgotten that constructing threads on the C side was eager about >> the thread tags, author list and subject (which, without Istvan's >> proposed patch, even requires opening and parsing the message file). >> This is probably what's killing you. >> >> Out of curiosity, what is your situation that you won't wind up paying >> the cost of this iteration one way or the other and that the latency >> of doing these tag changes matters? > > I'm trying to implement a terminal interface for notmuch in python > that resembles sup. > For the search results view, i read an initial portion from a Threads iterator > to fill my teminal window with threadline-widgets. Obviously, for a > large number of results I don't want to go through all of them. > The problem arises if you toggle a tag on the selected threadline and > afterwards > continue to scroll down.
Ah, that makes sense. >> > > Have you tried simply calling list() on your thread >> > > iterator to see how expensive it is? ?My bet is that it's quite cheap, >> > > both memory-wise and CPU-wise. >> > Funny thing: >> > ?q=Database().create_query('*') >> > ?time tlist = list(q.search_threads()) >> > raises a NotmuchError(STATUS.NOT_INITIALIZED) exception. For some reason >> > the list constructor must read mere than once from the iterator. >> > So this is not an option, but even if it worked, it would show >> > the same behaviour as my above test.. >> >> Interesting. ?Looks like the Threads class implements __len__ and that >> its implementation exhausts the iterator. ?Which isn't a great idea in >> itself, but it turns out that Python's implementation of list() calls >> __len__ if it's available (presumably to pre-size the list) before >> iterating over the object, so it exhausts the iterator before even >> using it. >> >> That said, if list(q.search_threads()) did work, it wouldn't give you >> better performance than your experiment above. >> >> > would it be very hard to implement a Query.search_thread_ids() ? >> > This name is a bit off because it had to be done on a lower level. >> >> Lazily fetching the thread metadata on the C side would probably >> address your problem automatically. ?But what are you doing that >> doesn't require any information about the threads you're manipulating? > Agreed. Unfortunately, there seems to be no way to get a list of thread > ids or a reliable iterator thereof by using the current python bindings. > It would be enough for me to have the ids because then I could > search for the few threads I actually need individually on demand. There's no way to do that from the C API either, so don't feel left out. ]:--8) It seems to me that the right solution to your problem is to make thread information lazy (effectively, everything gathered in lib/thread.cc:_thread_add_message). Then you could probably materialize that iterator cheaply. In fact, it's probably worth trying a hack where you put dummy information in the thread object from _thread_add_message and see how long it takes just to walk the iterator (unfortunately I don't think profiling will help much here because much of your time is probably spent waiting for I/O). I don't think there would be any downside to doing this for eager consumers like the CLI.