Of course, I fight with with this issue for a month, submit request for help on the Internet and find the cause of the issue 2 days later...
I managed to reproduce this issue in firefox by clicking more and faster. It seems that parameter _flowExecutionKey has server-side limit how many times it can be used (or how fast - I don't know yet). I will have to workaround it somehow. W dniu poniedziałek, 10 października 2016 11:46:08 UTC+2 użytkownik Mateusz Lewicki napisał: > > I have spider for site openlife.pl. It is my pension fund site, where I > log-in and can view history of payments, value of funds etc. I wanted to > scrap history of my payments and fees taken - and succeeded. About 400 > operations total. Operations are listed in < table > with all basic info > and url to details page (for now unused). Table lists only 25 entries, rest > is on subsequent "next page" pages - I read table, find url for next page, > go there with the same handler. Works 100%. > Later I wanted to scrap data from details urls. Seems straightforward and > actually was - I managed to write this too. Except it didn't work 100%. > When downloading details is enabled, I get 50-100 operations (of 400) and > fewer than half of them have details. > It turned out that every single url on the site does not lead directly to > page displayed, but is first redirected. Example of proper redirect: > > 1. ' > > https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=16934891&historyType=charge > ' > 2. ' > > http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_k7B26980D-F146-E179-AD85-AAB6782F62FA > ' > > However a lot of urls are redirected like so: > > 1. ' > > https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=18711185&historyType=charge > ' > 2. ' > > http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow > ' > 3. ' > > https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow > ' > 4. ' > > http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_cDD567B60-D3D2-ACED-7DEC-72080FC1906C_k51E2C0DB-264A-4DFF-47D2-8E861A1711FF > '] > > steps 2,3 are without any details, so 4 gives main page instead of detail > page. There scraping fails because there are no data expected by handler. > Steps 2,3 are common for all failed items. I didn't notice such redirection > when browsing manually. > > > What can cause such redirects? How to avoid them? > > > source code can be viewed in full here: > https://github.com/mateuszzz88/scrapy_funds/blob/opeartion_details/crawler/scrapy_openlife/spiders/openlife.py > > Methods of interest are on_account_history and > on_history_details > > Code also contains my attempts to solve the issue, including two custom > downloader middlewares that don't help. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.