On Wed, 28 May 2003, Sjoerd Mullender wrote:
> On Tue, May 27 2003 "Alexander R. Pruss" wrote:
[...]
> > > If you look at the source of the class Mapper, in the file
> > > PyPlucker/Writer.py, you will see the code which does the actual
> > > record-id assignment.
> >
> > Yes. I am looking at the fragment:
> >
> > for (url, doc) in collection.items():
> > self._get_id_for_doc(doc)
> >
> > _get_id_for_doc assigns record numbers sequentially as it is called. What I
> > guess I don't understand is how collection.items() gets to have the order it
> > does. I don't think the items in it are in the order in which they were
> > added to the collection.
>
> Assuming collection is a Python dictionary, the order of the tuples the
> items() method returns is basically random, although consistent.
[...]
This is correct from my memory and my reading of the code. The collection
is a dictionary and therefore the ordering of records is basically random
(some special records being the exception).
As far as I can see, the only operation called on the collection is
items(), i.e. as a quick fix you could change Spider.py to pass in a
different object that returns the desired order of records (I don't know
what ordering you want). I.e. where it says:
collection = spider.get_collected ()
you could write
collection = MyCollection(spider.get_collected ())
and then add a new class somewhere at the top. E.g.:
class MyCollection:
def __init__(self, collection):
self._items = collection.items()
self._items.sort()
def items(self):
return self.items
That should write out the records in alphabetical orer of the URLs...
If you need some other ordering you will have to do something more fancy
than the sort().
Holger
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev