Dear Dimitris, dear all,

I am working with Massimo (the OP) on this scraping project.
I would like to elaborate a little bit on the problem description.
What we would like to do is to write a spider that mirrors an 
entire phpBB board, and collects info from some of its pages.
To do so, we are using a set of rules, including rules used to identify
those pages from which we want to extract information, and a rule
that matches all the pages (so that they can be collected and stored
locally).
To do so, we use the following set of rules:

Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"forumtitle")]'),callback
 = 'parse_forum',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"topictitle")]'),callback='parse_topic',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"memberlist")]'),callback='parse_standard',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"mode=viewprofile")]'),callback='parse_members',follow=True),
Rule(LinkExtractor(),callback = 'parse_standard',follow=True)


The first four rules specify those links pointing to pages containing 
information we want to scrape, while the last one matches all the links
that are not matched by the first four ones.
Our understanding (from Scrapy documentation) is that if a link matches 
more rules, the first matching rule will be used,
so those pages containing information will be processed according to the 
first four rules, while other pages (not matching the
restrict_xpaths clauses) will be processed according to the last one.

The pages corresponding to the first four rules require authentication, so 
we use the technique reported by Massimo to authenticate,
expecting that Scrapy will follow the rules after the authentication step 
has been carried out.
What we are instead observing is that this is not the case, as we are 
getting - from the phpBB board - responses telling
that authorization is required before getting those pages.

Strangely, if we remove the last rule, i.e. we use only the first four 
ones, namely:

Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"forumtitle")]'),callback
 = 'parse_forum',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"topictitle")]'),callback='parse_topic',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"memberlist")]'),callback='parse_standard',follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[contains(@href,"mode=viewprofile")]'),callback='parse_members',follow=True)


everything works as expected, that is we get the pages that are restricted 
to authenticated users only.

We suspect that in the first case (i.e., when using all the FIVE rules) 
Scrapy starts following links AFTER the FormRequest.from_response has been 
sent
to the server, but BEFORE the corresponding reply (carrying the session 
cookie, or any other authentication info) has been receveid and/or 
processed by
Scrapy.
Could this be the case? And, if so, how can we make rule matching start 
after the authentication response has been gotten and processed?
Conversely, what could be going wrong?

Thank you very much in advance for any help you can provide.

Cosimo



-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to