To catch all redirection paths, including when the final url was already crawled, I wrote a custom duplicate filter:
from scrapy.dupefilters import RFPDupeFilter from myscraper.items import RedirectionItem class CustomURLFilter(RFPDupeFilter): def __init__(self, path=None, debug=False): super(CustomURLFilter, self).__init__(path, debug) def request_seen(self, request): request_seen = super(CustomURLFilter, self).request_seen(request) if request_seen is True: item = RedirectionItem() item['sources'] = [ u for u in request.meta.get('redirect_urls', u'') ] item['destination'] = request.url return request_seen Now, how can I send the RedirectionItem directly to the pipeline? Is there a way to instantiate the pipeline from the custom filter so that I can send data directly? Or shall I also create a custom scheduler and get the pipeline from there but how? Thanks! On Monday, August 29, 2016 at 12:48:12 PM UTC+2, Antoine Brunel wrote: > > Hello, > > 1. I use scrapy to create a *linkmap *table, i.e. listing not only all > links on crawled pages but also the link status. > >> url link >> anchor >> http://example.com/ http://example.com/link_A >> Link A >> http://example.com/ http://example.com/link_B >> Link B >> http://example.com/ http://example.com/link_C >> Link C > > > To do that, I extract links from the response then create a request with a > specific callback > for href_element in response.xpath("//a['@href']") > link = urljoin( response.url, href_element.xpath(attribute). > extract_first() ) > yield Request(link, callback=self.parse_item) > > When urls & their links are crawled, they are stored in a table named > *urls*: > >> url status >> http://example.com/link_A 200 >> http://example.com/link_B 200 >> http://example.com/link_C 404 > > > Then I extract the status in the first table based on link and url. > """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from > linkemap, urls where linkmap.link=urls.url""" > > And the result is the following: > >> url link >> anchor >> status >> http://example.com/ http://example.com/link_A >> Link A 200 >> http://example.com/ http://example.com/link_B >> Link B 200 >> http://example.com/ http://example.com/link_C >> Link C 404 > > > 2a. Problems arise when redirections are getting in the way... > If http://example.com/link_D has a 301 pointing to > http://example.com/link_D1: > Since the link url and the final url are different, the query based on the > url returns nothing, so I cannot connect the status > > Table *linkmap* > >> url link >> anchor >> http://example.com/ http://example.com/*link_D* >> Link D > > > Table *urls* > >> url status > > http://example.com/*link_D1* 200 >> > > => And the query result is the following: > >> url link >> anchor >> status > > http://example.com/ http://example.com/link_D >> Link D *EMPTY* >> > > > 2b. So, I add the redirection path in the item with > *response.meta.get('redirect_urls'). > *Then, in pipelines.py, link_final is added to the structure based on the > initial url: > links = item.get('redirect_urls', []) > > if (len(links) > 0 and url != ''): > link_final = item.get('url', '') > cursor.execute("""UPDATE pagemap SET link_final=%s WHERE link=%s AND > url=%s""", (link_final, links[0], url)) > > And now we have: > Table *linkmap* > >> url link >> link_final >> anchor > > http://example.com/ http://example.com/link_D >> *http://example.com/link_D1 <http://example.com/link_D1>* >> Link D > > > Table *urls* > >> url status >> http://example.com/link_D1 200 > > > => The query is updated to: > """select linkmap.url, linkmap.link, linkmap.anchor, urls.status from > linkemap, urls where linkmap.link=urls.url *or > linkmap.link_final=urls.url*""" > > And the query result is now the following and life is beautiful: > >> url link >> link_final >> anchor >> status > > http://example.com/ http://example.com/link_D >> *http://example.com/link_D1 <http://example.com/link_D1> * >> Link D *200* >> > > > 3. However, life is tough, and more problems arise when more redirections > get in the way... > Let's say http://example.com/link_E also redirects to > http://example.com/link_D1, and it is crawled after > http://example.com/link_D ... > > Because of RFPDupeFilter, link_final is not updated for > http://example.com/link_E, so the linkmap for link_E of them cannot be > updated: > Table linkmap > >> url link >> link_final >> anchor >> http://example.com/ http://example.com/link_D >> *http://example.com/link_D1 <http://example.com/link_D1>* >> Link D >> http://example.com/ http://example.com/link_E >> *EMPTY* Link E > > > Table urls > url status > http://example.com/link_D1 200 > > => And the query result is the following: > >> url link >> link_final >> anchor >> status > > http://example.com/ http://example.com/link_D >> http://example.com/link_D1 Link D >> *200* > > http://example.com/ http://example.com/link_E >> *EMPTY* Link E >> *EMPTY* > > > Short term and super ugly solution: Disable RFPDupeFilter but no, I don't > want to do that, I want to find another way! > > Conclusion: This approach failed again. Still, I have the feeling that > there is a simpler way, like capturing an already crawled url > from RFPDupeFilter or something like that? > I mean, Scrapy is crawling all these urls so, it must be possible to > create the linkmap with the status, the question is what is the right way > to do it. > Maybe I am wrong to wait until the info gets to the pipeline? > > Thanks for your help, reflexions, critics and ideas! > Antoine. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.