I have spider for site openlife.pl. It is my pension fund site, where I log-in and can view history of payments, value of funds etc. I wanted to scrap history of my payments and fees taken - and succeeded. About 400 operations total. Operations are listed in < table > with all basic info and url to details page (for now unused). Table lists only 25 entries, rest is on subsequent "next page" pages - I read table, find url for next page, go there with the same handler. Works 100%. Later I wanted to scrap data from details urls. Seems straightforward and actually was - I managed to write this too. Except it didn't work 100%. When downloading details is enabled, I get 50-100 operations (of 400) and fewer than half of them have details. It turned out that every single url on the site does not lead directly to page displayed, but is first redirected. Example of proper redirect:
1. 'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=16934891&historyType=charge' 2. 'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_k7B26980D-F146-E179-AD85-AAB6782F62FA' However a lot of urls are redirected like so: 1. 'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=18711185&historyType=charge' 2. 'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow' 3. 'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow' 4. 'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_cDD567B60-D3D2-ACED-7DEC-72080FC1906C_k51E2C0DB-264A-4DFF-47D2-8E861A1711FF'] steps 2,3 are without any details, so 4 gives main page instead of detail page. There scraping fails because there are no data expected by handler. Steps 2,3 are common for all failed items. I didn't notice such redirection when browsing manually. What can cause such redirects? How to avoid them? source code can be viewed in full here: https://github.com/mateuszzz88/scrapy_funds/blob/opeartion_details/crawler/scrapy_openlife/spiders/openlife.py Methods of interest are on_account_history and on_history_details Code also contains my attempts to solve the issue, including two custom downloader middlewares that don't help. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.