Re: Can scrapy achieve crawl job with RDBMS structured website through multiple spiders?

Peng Liu Thu, 17 Dec 2015 22:17:06 -0800

Thanks for your advice.

For your first approach, single spider, I know it could achieve the goal 
basically while it might suffer from simultaneous duplicate requests for 
same content(such as same user information) and scheduler performance 
(because of the not good modularity).


For your second approach, that seems to be a promising one. I'll check out 
the implementation by myself, that might need some time.

*What important is, I think some crawl jobs similar to my goal must be 
quite common, right? Why there's not a convenient way by just using scrapy 
to achieve that (I mean not by integrated with redis or other library).*

在 2015年12月18日星期五 UTC+8下午1:15:24，lnxpgn写道：
>
> If you use a single spider, let the "meta" attribute of 
> scrapy.http.Request to carry crawled items to the next request and continue 
> crawling;
> If you use multiple spiders, to serialize crawled items and put them into 
> Redis or other places, the next spider fetches and deserializes these items 
> and continue
>
> 2015-12-17 17:41 GMT+08:00 Peng Liu <myme5...@gmail.com <javascript:>>:
>
>> I've posted this problem 
>> <http://stackoverflow.com/questions/34330372/scrapy-different-spider-for-different-type-item>
>>  
>> onto stackoverflow.com, here's the content below.
>>
>> I think the framework of scrapy <http://scrapy.org/> might be a little 
>> inflexible. And I can't find good solution for my issue.
>>
>> Here's the issue I'm facing now.
>>
>> There's a website, let's say, to be, http://example.com/. I want to 
>> scrap some information from it.
>>
>> It has many items which are urls in form of 
>> http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>, 
>> for now I *have*the list of the valid ([0-9]+) which has about *3 
>> million* index ids, it might seems to be a simple mission to complete 
>> the whole webpage scrapping work.
>>
>> *But*, the structure of this mission is like this:
>>
>>    - there are many data of the item on the page of /item/. I want these 
>>    information, this is simple to achieve.
>>    - there are links refer to the entity related to the item, for 
>>    example item owner with link path /owner/, or the collections the 
>>    item belongs with link path /collection/ and so on. I want all the 
>>    *unique* information of these entities, which is hard to achieve. 
>>    They shouldn't be the nested item of item or scrapped by single 
>>    spider because of the reason below:
>>       - *single* owner have [1-n] items.
>>       - *single* item have [1-n] owners.
>>       - same as collection with item.
>>    - there are links refer to other entity related to the item, for 
>>    example, comment with link path /comment/ or user who like it with 
>>    link path /user/. Obviously, it's wise to split commentor user 
>> information 
>>    away from item and use *key or index* to refer to entity. This is 
>>    hard to achieve by single spider.
>>
>> So, I prefer to start a spider to handle the list of 
>> http://example.com/item/([0-9]+) <http://example.com/item/(%5B0-9%5D+)>, 
>> and use other type of spiders to handle with item owner, collection, 
>> comment, and userrespectively.
>>
>> *But*, the problem is I don't have the list of item owner, collection, 
>> comment, and user. I could go through all of these entities only by 
>> iterate the webpage of http://example.com/item/([0-9]+) 
>> <http://example.com/item/(%5B0-9%5D+)>.
>>
>> I have googled a lot but found no solution to fit my issue. Please feel 
>> free to give your opinion out.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to scrapy-users...@googlegroups.com <javascript:>.
>> To post to this group, send email to scrapy...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Can scrapy achieve crawl job with RDBMS structured website through multiple spiders?

Reply via email to