Hi Christoph,

I wrote that blog post and worked on the hbase storage solution at
Scrapinghub (as well as Scrapy veteran..).

I have to agree with Jordi - it depends on many factors. How much data, how
do you want to access it, etc. For example, storing data in S3 is
remarkably scalable - compress and upload a file after each scrapy job
finishes. You
can't access arbitrary scraped items in a reasonable time though.

HBase is scalable, and offers a good performance trade-off for many
scraping projects - I know others using it at scale. That being said, it's
a lot of work to make it scale - lots of arbitrary numbers and
configuration parameters you need to tune, at Scrapinghub we've written
many extensions (co-processors & filters) so that we can work efficiently
with it, scripts for maintenance (upgrading, managing regions, schema
updates..) etc. We also use it for much more than just scraped items (e.g.
stats, datasets, caches, frontiers for distributed crawls, crawl graphs,
etc.). It's been a lot of work, and quite a learning curve.

The fact that a single server is even an option for you means you don't
have the same scaling requirements, at least not for a while. It's likely
that your access patterns for the data are different too. What are you
building? Is there a  reason Scrapinghub doesn't work for you?

Cheers,

Shane



On 24 May 2014 22:26, Jordi Llonch <llon...@gmail.com> wrote:

> Christoph,
>
> You can scale up scrapy using different approaches. It depends on many
> factors: using scrapyd, using celery, etc...
>
> I will suggest you find what suits you best in terms of storage, network
> and computational requirements because there's big performance differences.
>
> To me, one of the best fits for storage is using HBase, or even better
> Hypertable.
>
> Cheers,
> Jordi
>
>
> 2014-05-24 1:00 GMT+10:00 christoph <skage...@googlemail.com>:
>
> Hi,
>>
>> I wonder what is a good choice for an environment for a scalable scrapy
>> project similar to scrapinghub?
>> Starting with a single vserver/root-server for crawling and data storing
>> with the possibility to add additional servers when I need more scraping
>> power or database space.
>> According to a blog entry (
>> http://blog.scrapinghub.com/2013/07/26/introducing-dash/), scrapinghub
>> is using Cloudera CDH (run on which OS?) and they store their data in
>> HBase. So this is a good choice?
>>
>> Is there any information how to setup scrapy in a CDH environment and
>> saving data into HBase?
>>
>> Thank you,
>> Christoph
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to scrapy-users+unsubscr...@googlegroups.com.
>> To post to this group, send email to scrapy-users@googlegroups.com.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to