Re: Cutting a 0.7 release
Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone
Re: Cutting a 0.7 release
0.6.4 or 0.7.0, Both are OK to me. Just FYI, The memory efficiency has been significantly (almost x2-3) improved by runtime message serialization and compression. See https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3 (I'll attach more benchmarks and comparisons with other systems result soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork, SemiClustering, Graph's Combiners HAMA-857. According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. Once we release this one, we finally might want to focus on below issues: * Fault tolerant job processing (checkpoint recovery) * Support GPUs and InfiniBand Then, I think we can release version 1.0. On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Re: Cutting a 0.7 release
2014-02-24 13:52 GMT+01:00 Edward J. Yoon edwardy...@apache.org: 0.6.4 or 0.7.0, Both are OK to me. Just FYI, The memory efficiency has been significantly (almost x2-3) improved by runtime message serialization and compression. See https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3 (I'll attach more benchmarks and comparisons with other systems result soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork, SemiClustering, Graph's Combiners HAMA-857. sure, all the above things look good to me. According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. right, the only thing that I think we need to address before 0.7.0 is related to the OutOfMemory errors (especially when dealing with large graphs); for example IMHO even if the memory is not enough to store all the graph vertices assigned to a certain peer, a scalable system should never throw OOM exceptions, instead it may eventually process items slower (with caches / queues) but never throw an exception for that but that's just my opinion. Once we release this one, we finally might want to focus on below issues: * Fault tolerant job processing (checkpoint recovery) +1 * Support GPUs and InfiniBand +1 for the former, not sure about the latter. Then, I think we can release version 1.0. My 2 cents, Tommaso On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Re: Cutting a 0.7 release
On 24 Φεβ 2014, at 3:32 μ.μ., Tommaso Teofili tommaso.teof...@gmail.com wrote: According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. right, the only thing that I think we need to address before 0.7.0 is related to the OutOfMemory errors (especially when dealing with large graphs); for example IMHO even if the memory is not enough to store all the graph vertices assigned to a certain peer, a scalable system should never throw OOM exceptions, instead it may eventually process items slower (with caches / queues) but never throw an exception for that but that's just my opinion. I like and agree with this. Cheers, Anastasis
Re: Cutting a 0.7 release
1) Map and Reduce model is a file-based communication. So, each mappers can run separately. For example, To run MR job on 1 GB input data, 5 mappers will be scheduled. Even though there are only 2 task slots (single machine), MR job slow but works - 2 running Map Tasks, 3 pending Map tasks. However, unlike MapReduce, BSP uses network-based communication. It means that the every BSP tasks must run at once. And the number of BSP tasks is determined by the number of blocks of input. So, you CANNOT run 1 GB input data on a single machine. It's not a Memory issue. throw OOM exceptions, instead it may eventually process items slower (with caches / queues) but never throw an exception for that but that's just my I hope so too, but I think you are saying about Iterative MapReduce. 2) The normal block size of HDFS is 64 ~ 256 MB. If we can assume that the split size = block size, I feel that current system is enough. I don't think we have to spend a time for implementing disk-based something. WDYT? On Tue, Feb 25, 2014 at 12:19 AM, Anastasis Andronidis andronat_...@hotmail.com wrote: On 24 Φεβ 2014, at 3:32 μ.μ., Tommaso Teofili tommaso.teof...@gmail.com wrote: According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. right, the only thing that I think we need to address before 0.7.0 is related to the OutOfMemory errors (especially when dealing with large graphs); for example IMHO even if the memory is not enough to store all the graph vertices assigned to a certain peer, a scalable system should never throw OOM exceptions, instead it may eventually process items slower (with caches / queues) but never throw an exception for that but that's just my opinion. I like and agree with this. Cheers, Anastasis -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Re: Cutting a 0.7 release
Just let you know I may refactor based on the following diagram. http://people.apache.org/~chl501/diagram1.png That sketches the basic flow required for ft. I am currently evaluate related parts, so it's subjected to change. On 24 February 2014 20:52, Edward J. Yoon edwardy...@apache.org wrote: 0.6.4 or 0.7.0, Both are OK to me. Just FYI, The memory efficiency has been significantly (almost x2-3) improved by runtime message serialization and compression. See https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3 (I'll attach more benchmarks and comparisons with other systems result soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork, SemiClustering, Graph's Combiners HAMA-857. According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. Once we release this one, we finally might want to focus on below issues: * Fault tolerant job processing (checkpoint recovery) * Support GPUs and InfiniBand Then, I think we can release version 1.0. On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Re: Cutting a 0.7 release
Programmer can't control java memory like malloc/ free in c, type boxing/ unboxing, etc., it seems not be easy to evaluate the memory. So it would be good sticking to erlang fail fast style. Or we can have a programme that load data and measure the actual memory usage. On 24 February 2014 22:32, Tommaso Teofili tommaso.teof...@gmail.com wrote: 2014-02-24 13:52 GMT+01:00 Edward J. Yoon edwardy...@apache.org: 0.6.4 or 0.7.0, Both are OK to me. Just FYI, The memory efficiency has been significantly (almost x2-3) improved by runtime message serialization and compression. See https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3 (I'll attach more benchmarks and comparisons with other systems result soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork, SemiClustering, Graph's Combiners HAMA-857. sure, all the above things look good to me. According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. right, the only thing that I think we need to address before 0.7.0 is related to the OutOfMemory errors (especially when dealing with large graphs); for example IMHO even if the memory is not enough to store all the graph vertices assigned to a certain peer, a scalable system should never throw OOM exceptions, instead it may eventually process items slower (with caches / queues) but never throw an exception for that but that's just my opinion. Once we release this one, we finally might want to focus on below issues: * Fault tolerant job processing (checkpoint recovery) +1 * Support GPUs and InfiniBand +1 for the former, not sure about the latter. Then, I think we can release version 1.0. My 2 cents, Tommaso On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Re: Cutting a 0.7 release
That's huge diagram :-) Do you plan on work on HAMA-505, or create new one? On Tue, Feb 25, 2014 at 1:33 PM, Chia-Hung Lin cli...@googlemail.com wrote: Just let you know I may refactor based on the following diagram. http://people.apache.org/~chl501/diagram1.png That sketches the basic flow required for ft. I am currently evaluate related parts, so it's subjected to change. On 24 February 2014 20:52, Edward J. Yoon edwardy...@apache.org wrote: 0.6.4 or 0.7.0, Both are OK to me. Just FYI, The memory efficiency has been significantly (almost x2-3) improved by runtime message serialization and compression. See https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3 (I'll attach more benchmarks and comparisons with other systems result soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork, SemiClustering, Graph's Combiners HAMA-857. According to my personal evaluations, current system is fairly respectable. As I mentioned before, I believe we should stick to in-memory style since the today's machines can be equipped with up to 128 GB. Disk (or disk hybrid) based queue is a optional, not a must-have. Once we release this one, we finally might want to focus on below issues: * Fault tolerant job processing (checkpoint recovery) * Support GPUs and InfiniBand Then, I think we can release version 1.0. On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili tommaso.teof...@gmail.com wrote: Would you cut 0.7 or 0.6.4 ? I'd go with 0.6.4 as I think the next minor version change should be due to significant feature additions / changes and / or stability / scalability improvements. Regards, Tommaso 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org: Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc. -- Edward J. Yoon (@eddieyoon) Chief Executive Officer DataSayer, Inc.
Cutting a 0.7 release
Hi all, I plan on cutting a release next week. If you have some opinions, Pls feel free to comment here. Sent from my iPhone