Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

Dmitriy Ukhlov Fri, 28 Mar 2014 05:34:58 -0700

On 03/28/2014 11:29 AM, Serge Kovaleff wrote:

Hi Iliia,


I would take a look into BSON http://bsonspec.org/

Cheers,
Serge Kovaleff

On Thu, Mar 27, 2014 at 8:23 PM, Illia Khudoshyn<ikhudos...@mirantis.com <mailto:ikhudos...@mirantis.com>> wrote:


    Hi, Openstackers,

    I'm currently working on adding bulk data load functionality to
    MagnetoDB. This functionality implies inserting huge amounts of
    data (billions of rows, gigabytes of data). The data being
    uploaded is a set of JSON's (for now). The question I'm interested
    in is a way of data transportation. For now I do streaming HTTP
    POST request from the client side with gevent.pywsgi on the server
    side.

    Could anybody suggest any (better?) approach for the
    transportation, please?
    What are best practices for that.

    Thanks in advance.

--

    Best regards,

    Illia Khudoshyn,
    Software Engineer, Mirantis, Inc.

    38, Lenina ave. Kharkov, Ukraine

    www.mirantis.com <http://www.mirantis.ru/>

    www.mirantis.ru <http://www.mirantis.ru/>

    Skype: gluke_work

    ikhudos...@mirantis.com <mailto:ikhudos...@mirantis.com>


    _______________________________________________
    OpenStack-dev mailing list
    OpenStack-dev@lists.openstack.org
    <mailto:OpenStack-dev@lists.openstack.org>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev




_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Hi Iliia,

I guess if we a talking about cassandra batch loading the fastest way isto generate sstables locally and load it into Cassandra via JMX orsstableloader

http://www.datastax.com/dev/blog/bulk-loading

If you want to implement bulk load via magnetodb layer (not to cassandradirectly) you could try to use simple tcp socket and implement yourbinary protocol (using bson for example). Http is text protocol so usingtcp socket can help you to avoid overhead of base64 encoding. In myopinion, working with HTTP and BSON is doubtful solutionbecause you wil use 2 phase encoddung and decoding: 1) "object to bson",2) "bson to base64", 3) "base64 to bson", 4) "bson to object" 1) "obectto json" instead of 1) "object to json", 2) "json to object" in case ofHTTP + json

Http streaming as I know is asynchronous type of http. You can expectperformance growing thanks to skipping generation of http response onserver side and waiting on for that response on client side for eachchunk. But you still need to send almost the same amount of data. So ifnetwork throughput is your bottleneck - it doesn't help. If server sideis your bottleneck - it doesn't help too.

Also pay your attention that in any case, now MagnetoDB CassandraStorage convert your data to CQL query which is also text. It would benice to implement MagnetoDB BatchWriteItem operation via Cassandrasstable generation and loading via sstableloader, but unfortunately as Iknow this functionality support implemented only for Java world


--
Best regards,
Dmitriy Ukhlov
Mirantis Inc.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [MagnetoDB] Best practices for uploading large amounts of data

Reply via email to