I have spider for site openlife.pl. It is my pension fund site, where I 
log-in and can view history of payments, value of funds etc. I wanted to 
scrap history of my payments and fees taken - and succeeded. About 400 
operations total. Operations are listed in < table > with all basic info 
and url to details page (for now unused). Table lists only 25 entries, rest 
is on subsequent "next page" pages - I read table, find url for next page, 
go there with the same handler. Works 100%.
Later I wanted to scrap data from details urls. Seems straightforward and 
actually was - I managed to write this too. Except it didn't work 100%. 
When downloading details is enabled, I get 50-100 operations (of 400) and 
fewer than half of them have details. 
It turned out that every single url on the site does not lead directly to 
page displayed, but is first redirected. Example of proper redirect:

   1. 
   
'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=16934891&historyType=charge'
   2. 
   
'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_k7B26980D-F146-E179-AD85-AAB6782F62FA'

However a lot of urls are redirected like so:

   1. 
   
'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_c647455B5-0E6D-EE90-60CF-E03446BA9D96_kA1AF3928-4351-6195-1E48-BC839F1B971B&_eventId=details&idHistory=18711185&historyType=charge'
   2. 
   
'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow'
   3. 
   
'https://portal.openlife.pl/frontend/secure/accountHistory.html?_flowId=account_history-flow'
   4. 
   
'http://portal.openlife.pl/frontend/secure/accountHistory.html?_flowExecutionKey=_cDD567B60-D3D2-ACED-7DEC-72080FC1906C_k51E2C0DB-264A-4DFF-47D2-8E861A1711FF']

steps 2,3 are without any details, so 4 gives main page instead of detail 
page. There scraping fails because there are no data expected by handler. 
Steps 2,3 are common for all failed items. I didn't notice such redirection 
when browsing manually. 


What can cause such redirects? How to avoid them?


source code can be viewed in full here: 
https://github.com/mateuszzz88/scrapy_funds/blob/opeartion_details/crawler/scrapy_openlife/spiders/openlife.py
 
Methods of interest are on_account_history and 
on_history_details

Code also contains my attempts to solve the issue, including two custom 
downloader middlewares that don't help.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to