[jira] [Commented] (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13084871#comment-13084871 ] Parlo Mendez commented on CASSANDRA-1735: - The last post is some time ago. What is the current status of messagepack implementation in cassandra? I think it would be very nice. Parlo Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12988655#comment-12988655 ] Muga Nishizawa commented on CASSANDRA-1735: --- Hi T Jake Luciani, I would like to notify that we have cleared the license issues with MessagePack. As you pointed out earlier, MessagePack used to require XNIO (LGPL) for network communication. We replaced XNIO with Apache MINA (Apache License) in MessagePack. Javassist which was another issue is a dual license (LGPL and MPL) module, and is used by other apache products as MPL. So we believe that we have cleared license related issues at the moment. Please check URL below for more details. https://github.com/msgpack/msgpack/ https://github.com/msgpack/msgpack-rpc/ Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12964798#action_12964798 ] T Jake Luciani commented on CASSANDRA-1735: --- It appears msgpack requires jassist and xnio both of which are LGPL. This means we can't include msgpack support in our disrtibution see http://www.apache.org/legal/3party.html Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12934520#action_12934520 ] Jonathan Ellis commented on CASSANDRA-1735: --- Gary wrote some performance tests in CASSANDRA-1765 and saw MessagePack performance worse than Thrift. Is something wrong with his code? Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932756#action_12932756 ] Terje Marthinussen commented on CASSANDRA-1735: --- I am very curious how the serialization in messagepack could compete with the serialization used on the data side for cassandra (SSTables) and how we could benefit from having the same serialization in both those places. Anyone has any thoughts? Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12932822#action_12932822 ] Muga Nishizawa commented on CASSANDRA-1735: --- Jonathan, Thanks for your response. What kind of performance improvement do you see with this patch? Performance improvement available with this patch will be the following: * Reducing serialization cost and the data size * Increase throughput between clients and a Cassandra node I have also measured the performance of MessagePack, from the viewpoints of reducing serialization cost and throughput. I will discuss details below. == Reduction of serialization cost and the data size == (Summary) MessagePack has proved to be better in reducing serialzation cost and the data size compared to other serialization libraries in the test below. (Test environment) I used jvm-serializers which is a well-known benchmark and compared performances with Protocol Buffers, Thrift, and Avro. Machine used for this benchmark has Core2 Duo 2GHz with 1GB RAM. (Results) create ser +same deser +shal +deep total size +dfl protobuf 683601629733338 34543759 9775239 149 thrift 572628755653479 36163770 10057349 197 msgpack 291493547503468 35453708 8748236 150 avro 2698640936237480 9301 10481 16890221 133 (Comments) It may be better to compare serialization cost using objects with Cassandra like a Column object. But such objects and sizes vary by users, and is not suitable for comparing serialization cost of various data. According to the above result, the size of MessagePack's serialized data is slightly larger than Avro. But MessagePack has significantly low serialization cost compared to Avro and Thrift. == Increasing throughput == (Summary) I compared MessagePack based RPC of Cassandra to that of Thrift. Random read throughput of MessagePack based RPC is 15% higher than that of Thrift and random write throughput is 21% higher. (Test environment) In this evaluation, Cassandra node ran as a standalone on a machine with Core2 Duo 2GHz and 1GB RAM. Client programs ran on two machines both with Core2 Duo 2GHz and 1GB RAM. Client program was based on ring cache. It created 100 threads per a JVM on each machine and accesses to a Cassandra node with ring cache. (Results) * Thrift based RPC part of Cassandra * Random read: 5,200 query/sec. * Random write: 11,200 query/sec. * MessagePack based RPC part of Cassandra * Random read: 6,000 query/sec. * Random write: 13,600 query/sec. (Comments) I measured the max throughput of random access (read/write) after 100 items (size of each item is small) were stored in the Cassandra node. The reason is because I wanted to make the state of CPU bottle neck for the Cassandra node. If the Cassandra node is the state of Disk IO bottle neck, I thought that I cannot properly evaluate max throughput of the RPC part. I did not measure the amount of data transferred in network during the evaluation directly. But from the benchmark result of jvm-serializers, I believe that the amount of transferred data for MessagePack-based Cassandra would be reduced compared to that of Thrift. Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC
[jira] Commented: (CASSANDRA-1735) Using MessagePack for reducing data size
[ https://issues.apache.org/jira/browse/CASSANDRA-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12931844#action_12931844 ] Jonathan Ellis commented on CASSANDRA-1735: --- Thanks, this is exciting! What kind of performance improvement do you see with this patch? Using MessagePack for reducing data size Key: CASSANDRA-1735 URL: https://issues.apache.org/jira/browse/CASSANDRA-1735 Project: Cassandra Issue Type: New Feature Components: API Affects Versions: 0.7 beta 3 Environment: Fedora11, JDK1.6.0_20 Reporter: Muga Nishizawa Attachments: 0001-implement-a-Cassandra-RPC-part-with-MessagePack.patch, dependency_libs.zip For improving Cassandra performance, I implemented a Cassandra RPC part with MessagePack. The implementation details are attached as a patch. The patch works on Cassandra 0.7.0-beta3. Please check it. MessagePack is one of object serialization libraries for cross-languages like Thrift and Protocol Buffers but it is much faster, small, and easy to implement. MessagePack allows reducing serialization cost and data size in network and disk. MessagePack websites are * website: http://msgpack.org/ This website compares MessagePack, Thrift and JSON. * desing details: http://redmine.msgpack.org/projects/msgpack/wiki/FormatDesign * source code: https://github.com/msgpack/msgpack/ Performance of the data serialization library is one of the most important issues for developing a distributed database in Java. If the performance is bad, it significantly reduces the overall database performance. Java's GC also runs many times. Cassandra has this problem as well. For reducing data size in network between a client and Cassandra, I prototyped the implementation of a Cassandra RPC part with MessagePack and MessagePack-RPC. The implementation is very simple. MessagePack-RPC can reuse the existing Thrift based CassandraServer (org.apache.cassandra.thrift.CassandraServer) while adapting MessagePack's communication protocol and data serialization. Major features of MessagePack-RPC are * Asynchronous RPC * Parallel Pipelining * Connection pooling * Delayed return * Event-driven I/O * more details: http://redmine.msgpack.org/projects/msgpack/wiki/RPCDesign * source code: https://github.com/msgpack/msgpack-rpc/ The attached patch includes a ring cache program for MessagePack and its test program. You can check the behavior of the Cassandra RPC with MessagePack. Thanks in advance, -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.