Re: crawl with pagination

Raf Roger Tue, 16 Aug 2016 13:31:28 -0700

So for now i took a website for testing purpose and to help me to learn 
basics of scrapy.


website is "http://www.allosociete.ch/telephone-horaires-metier/Pressing";

i would to get in csv the following output:
counter: a simple increment number
page_id: page number
url: url the display the company details
company name: once on the URL, to collect company name

for that on > http://www.allosociete.ch/telephone-horaires-metier/Pressing 
which is in page page 1, i was able to collect data as following:

import scrapy

class AlloSociete(scrapy.Spider):
  name = 'allosocietepressing'
  start_urls = 
['http://www.allosociete.ch/telephone-horaires-metier/Pressing']
  counter = 1
  pagenum = 1
  def parse(self, response):
    for href in response.css('div.lien-ville ul li a::attr("href")'):
      full_url = response.urljoin(href.extract())
      yield scrapy.Request(full_url, self.parse_lien)

  def parse_lien(self, response):
    yield {
      'count' : self.counter,
      'page' : self.pagenum,
      'lien' : response.url
    }
    self.counter = self.counter + 1

for now i was not able to have a clear understanding how to code the 
pagination catch and to replace self.pagenum by the pagination id.
this section has only 3 pages.

thanks to help me to understand how scrapy works as it seems to be very 
promising for collecting real time data.




On Monday, August 15, 2016 at 1:14:45 AM UTC+2, WANG Ruoxi wrote:
>
> Hi Raf,
>
> Not sure that I understand your question well, you can always use a regex 
> in the LinkExtractor to retrieve all the pagination links that you need.
> Something like 
>
> "telephone-horaires-metier\/Restaurant\?p=[0-9]+$" can match the links, if 
> your last number is always a positive integer.
>
> Regards,
>
>
>
> On Sunday, August 14, 2016 at 11:11:40 PM UTC+8, Raf Roger wrote:
>>
>> Hi,
>>
>> i'm new to scrapy and i'm looking for a way to retrieve all links (with 
>> class: ul li a).
>> on each page, there is pagination and first page url is like:
>> telephone-horaires-metier/Restaurant
>>
>> page 2 url is:
>> telephone-horaires-metier/Restaurant?p=2
>>
>> page 3 url is:
>> telephone-horaires-metier/Restaurant?p=3
>>
>> etc...
>>
>> the "next" url is always the current page +1 so if i'm page 2 "next" url 
>> is telephone-horaires-metier/Restaurant?p=3
>>
>> How can i do to collect all links on each page ?
>>
>> thx
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: crawl with pagination

Reply via email to