Re: Is there a breadcrumb trail?

Nikolaos-Digenis Karagiannis Wed, 12 Feb 2014 21:52:09 -0800

Yes, Referer survived as a typo. You may want to skip the setting in 
settings.py though
https://scrapy.readthedocs.org/en/latest/topics/settings.html#std:setting-SPIDER_MIDDLEWARES_BASE
Enabled by default.
After seeing the above link you probably notice the bug in your settings. 
Most people use integers for middleware sorting keys.
However because True has a __cmp__ method it will be used for sorting:
https://github.com/scrapy/scrapy/blob/c886d7459f0e259606255812102caf77e40aa7e7/scrapy/utils/conf.py#L15-L16
In a python shell try:
1 == True
sorted([2, True, '0',[]])
This allows you to accidentally introduce such bugs, using types you didn't 
mean to sort. And your "True" just did, it moved the RefererMiddleware to 
the bottom of the spider middleware stack.
One the other hand, because build_component_list() doesn't check the types 
of the sorting keys you can use real numbers and theoretically have 
infinite positions between middlewares.


SPIDER_MIDDLEWARES = {

    'project.downloadermiddlewares.keyoccupier.Above': 740,
    'georgcantor.uncountability.InfiniteInfinities': 740.5,
    'project.downloadermiddlewares.keyoccupier.Bellow': 741,
}

The documentation doesn't specify a type: "their values are the middleware 
orders"
You could even use classes with their own __cmp__ method and do some magic.
Classifying this as a bug or feature remains an open discussion.
On Thursday, 13 February 2014 01:14:44 UTC+2, Michael Pastore wrote:
>
> Nikolaos,
>
> Perfect! The Referer Middleware was just what I was looking for (I only 
> needed to capture the referring url and not the entire breadcrumb trail).
>
> It took me a bit of reading through posts to figure out how to actually 
> retrieve the referring url, and the basics are below:
>
> Add to your settings file:
>
> SPIDER_MIDDLEWARES = {
>
> 'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
> }
>
>
> Then in your spider parser use the following to access the referring url:
>
> response.request.headers.get('Referer', None) #btw: 'Referer' is the 
> correct usage, 'Referrer' will not work
>
> Thanks again!
>
> On Monday, February 10, 2014 3:00:10 PM UTC-5, Michael Pastore wrote:
>>
>> I am writing a crawling spider but for each url visited and parsed, the 
>> saved item needs to include the originating url.  
>>
>> For example, lets say given the start_urls = ["http://www.A.com";] and 
>> the initial list of urls to follow that are extracted by the 
>> SgmlLinkExtractor
>> are ["http://www.B.com";, "http://www.C.com";], the spider engine would 
>> then schedule a visit to www.B.com then www.C.com.  When the spider 
>> crawls 
>> to www.B.com and the parse method extracts some data, I need the 
>> processed item to include a field with the originating url, which in this 
>> case is
>> www.A.com.  
>>
>> Like a breadcrumb trail, for each call to the parse method I need to look 
>> back on step. Is there an existing way to get this information? 
>>
>> Much thanks
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Is there a breadcrumb trail?

Reply via email to