Re: Creating linkmap with link status when dealing with redirections

Antoine Brunel Tue, 30 Aug 2016 01:55:37 -0700


To catch all redirection paths, including when the final url was already 
crawled, I wrote a custom duplicate filter:


from scrapy.dupefilters import RFPDupeFilter
from myscraper.items import RedirectionItem

class CustomURLFilter(RFPDupeFilter):

    def __init__(self, path=None, debug=False):
        super(CustomURLFilter, self).__init__(path, debug)

    def request_seen(self, request):
        request_seen = super(CustomURLFilter, self).request_seen(request)

        if request_seen is True:
            item = RedirectionItem()
            item['sources'] = [ u for u in request.meta.get('redirect_urls', 
u'') ]
            item['destination'] = request.url

        return request_seen

Now, how can I send the RedirectionItem directly to the pipeline? 

Is there a way to instantiate the pipeline from the custom filter so that I 
can send data directly? Or shall I also create a custom scheduler and get 
the pipeline from there but how?
Thanks!

On Monday, August 29, 2016 at 12:48:12 PM UTC+2, Antoine Brunel wrote:
>
> Hello,
>
> 1. I use scrapy to create a *linkmap *table, i.e. listing not only all 
> links on crawled pages but also the link status.
>
>> url                                                  link                 
>>                                 anchor
>> http://example.com/                       http://example.com/link_A     
>>          Link A
>> http://example.com/                       http://example.com/link_B     
>>          Link B
>> http://example.com/                       http://example.com/link_C     
>>          Link C
>
>
> To do that, I extract links from the response then create a request with a 
> specific callback
> for href_element in response.xpath("//a['@href']")
>     link = urljoin( response.url, href_element.xpath(attribute).
> extract_first() )
>     yield Request(link, callback=self.parse_item)
>
> When urls & their links are crawled, they are stored in a table named 
> *urls*:
>
>> url                                            status
>> http://example.com/link_A       200
>> http://example.com/link_B       200
>> http://example.com/link_C       404
>
>
> Then I extract the status in the first table based on link and url. 
> """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from 
> linkemap, urls where linkmap.link=urls.url"""
>
> And the result is the following: 
>
>> url                                                  link                 
>>                                  anchor                                     
>>              status
>> http://example.com/                       http://example.com/link_A     
>>          Link A                                                    200
>> http://example.com/                       http://example.com/link_B     
>>          Link B                                                    200
>> http://example.com/                       http://example.com/link_C     
>>          Link C                                                    404
>
>
> 2a. Problems arise when redirections are getting in the way... 
> If http://example.com/link_D has a 301 pointing to 
> http://example.com/link_D1: 
> Since the link url and the final url are different, the query based on the 
> url returns nothing, so I cannot connect the status
>
> Table *linkmap*
>
>> url                                                  link                 
>>                                 anchor
>> http://example.com/                       http://example.com/*link_D*   
>>           Link D
>
>
> Table *urls*
>
>> url                                            status
>
> http://example.com/*link_D1*     200
>>
>  
> => And the query result is the following: 
>
>> url                                                  link                 
>>                                  anchor                                     
>>              status 
>
> http://example.com/                       http://example.com/link_D       
>>        Link D                                                   *EMPTY*
>>
>
>
> 2b. So, I add the redirection path in the item with 
> *response.meta.get('redirect_urls'). 
> *Then, in pipelines.py, link_final is added to the structure based on the 
> initial url:
> links = item.get('redirect_urls', [])
>
> if (len(links) > 0 and url != ''):
>     link_final = item.get('url', '')
>     cursor.execute("""UPDATE pagemap SET link_final=%s WHERE link=%s AND 
> url=%s""", (link_final, links[0], url))
>
> And now we have:
> Table *linkmap*
>
>> url                                                  link                 
>>                                 link_final                                 
>>                 anchor
>
> http://example.com/                       http://example.com/link_D       
>>       *http://example.com/link_D1 <http://example.com/link_D1>*          
>>       Link D
>
>
> Table *urls*
>
>> url                                            status
>> http://example.com/link_D1     200
>
>
> => The query is updated to: 
> """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from 
> linkemap, urls where linkmap.link=urls.url *or 
> linkmap.link_final=urls.url*"""
>
> And the query result is now the following and life is beautiful: 
>
>> url                                                  link                 
>>                                  link_final                                 
>>                  anchor                                                 
>>  status 
>
> http://example.com/                       http://example.com/link_D       
>>        *http://example.com/link_D1 <http://example.com/link_D1> *        
>>          Link D                                                   *200*
>>
>  
>
> 3. However, life is tough, and more problems arise when more redirections 
> get in the way... 
> Let's say http://example.com/link_E also redirects to 
> http://example.com/link_D1, and it is crawled after 
> http://example.com/link_D ... 
>
> Because of RFPDupeFilter, link_final is not updated for 
> http://example.com/link_E, so the linkmap for link_E of them cannot be 
> updated:
> Table linkmap
>
>> url                                                  link                 
>>                                 link_final                                 
>>                 anchor
>> http://example.com/                       http://example.com/link_D     
>>         *http://example.com/link_D1 <http://example.com/link_D1>*        
>>          Link D
>> http://example.com/                       http://example.com/link_E     
>>         *EMPTY*                                                   Link E
>
>
> Table urls
> url                                            status
> http://example.com/link_D1     200
>
> => And the query result is the following: 
>
>> url                                                  link                 
>>                                  link_final                                 
>>                  anchor                                                 
>>  status 
>
> http://example.com/                       http://example.com/link_D       
>>        http://example.com/link_D1                     Link D             
>>                                       *200* 
>
> http://example.com/                       http://example.com/link_E       
>>       *EMPTY*                                                     Link E 
>>                                                    *EMPTY*
>
>
> Short term and super ugly solution: Disable RFPDupeFilter but no, I don't 
> want to do that, I want to find another way!
>
> Conclusion: This approach failed again. Still, I have the feeling that 
> there is a simpler way, like capturing an already crawled url 
> from RFPDupeFilter or something like that? 
> I mean, Scrapy is crawling all these urls so, it must be possible to 
> create the linkmap with the status, the question is what is the right way 
> to do it. 
> Maybe I am wrong to wait until the info gets to the pipeline?
>
> Thanks for your help, reflexions, critics and ideas!
> Antoine.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Creating linkmap with link status when dealing with redirections

Reply via email to