Hi Duy,
You may replace \d+ with smth like:
In [46]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-29s').group(1)
Out[46]: '29'
In [47]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-31s').group(1)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/home/wk/src/ibcrawler/<ipython console> in <module>()
AttributeError: 'NoneType' object has no attribute 'group'
In [48]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-30s').group(1)
Out[48]: '30'
In [49]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-0s').group(1)
Out[49]: '0'
In [50]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-4s').group(1)
Out[50]: '4'
In [51]: re.search(r'[^0-9]([1-2]\d|30|\d)[^0-9]','w-1s').group(1)
Out[51]: '1'
Понеділок, 24 лютого 2014 р. 21:16:37 UTC+2 користувач Duy Nguyen написав:
>
> Hi guys,
>
> I have 2 start_urls , each of them have 100 pages with the pattern
> "resumes/url1/page-\d+"
>
> I am only interested in first 30 pages of each start_url. In other words,
> I want to crawl "resumes/url1/page-\d+" where *\d+ <= 30*
>
> Is there an option I can specify under "rules" ?
>
> OK to crawl: resumes/something/page-20/
>
> NOT OK to crawl: resumes/something/page-31/
>
> start_urls = [
> "url1,"url2"
> ]
>
> rules = [
> Rule(SgmlLinkExtractor(allow=("resumes/\w+/page-\d+",),
> restrict_xpaths=('//a[@title="Next"]')),
> callback="parse_items", follow=True),
> ]
>
> def parse_items(self, response):
> ........
>
> Thanks,
>
>
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.