Re: CrawlSpider - how to set rules to crawl first 30 pages of each category

Svyatoslav Sydorenko Sat, 01 Mar 2014 07:44:24 -0800

Hi Duy,

You may replace \d+ with smth like:


In [46]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-29s').group(1)
Out[46]: '29'

In [47]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-31s').group(1)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)

/home/wk/src/ibcrawler/<ipython console> in <module>()

AttributeError: 'NoneType' object has no attribute 'group'

In [48]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-30s').group(1)
Out[48]: '30'

In [49]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-0s').group(1)
Out[49]: '0'

In [50]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-4s').group(1)
Out[50]: '4'

In [51]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-1s').group(1)
Out[51]: '1'


Понеділок, 24 лютого 2014 р. 21:16:37 UTC+2 користувач Duy Nguyen написав:
>
> Hi guys, 
>
> I have 2 start_urls , each of them have 100 pages with the pattern 
> "resumes/url1/page-\d+" 
>
> I am only interested in first 30 pages of each start_url. In other words, 
> I want to crawl  "resumes/url1/page-\d+" where *\d+ <= 30*
>
> Is there an option I can specify under "rules" ?
>
> OK to crawl: resumes/something/page-20/ 
>
> NOT OK to crawl: resumes/something/page-31/ 
>
> start_urls = [
>       "url1,"url2"
> ]
>     
> rules = [
>         Rule(SgmlLinkExtractor(allow=("resumes/\w+/page-\d+",), 
> restrict_xpaths=('//a[@title="Next"]')), 
>             callback="parse_items", follow=True),
>     ]
>
> def parse_items(self, response):
>     ........
>
> Thanks,
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Re: CrawlSpider - how to set rules to crawl first 30 pages of each category

Reply via email to