Re: Cutting a 0.7 release

2014-02-24 Thread Tommaso Teofili
Would you cut 0.7 or 0.6.4 ?
I'd go with 0.6.4 as I think the next minor version change should be due to
significant feature additions / changes and / or stability / scalability
improvements.

Regards,
Tommaso


2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 Hi all,

 I plan on cutting a release next week. If you have some opinions, Pls feel
 free to comment here.

 Sent from my iPhone


Re: Cutting a 0.7 release

2014-02-24 Thread Edward J. Yoon
0.6.4 or 0.7.0, Both are OK to me.

Just FYI,

The memory efficiency has been significantly (almost x2-3) improved by
runtime message serialization and compression. See
https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3
(I'll attach more benchmarks and comparisons with other systems result
soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork,
SemiClustering, Graph's Combiners HAMA-857.

According to my personal evaluations, current system is fairly
respectable. As I mentioned before, I believe we should stick to
in-memory style since the today's machines can be equipped with up to
128 GB. Disk (or disk hybrid) based queue is a optional, not a
must-have.

Once we release this one, we finally might want to focus on below issues:

* Fault tolerant job processing (checkpoint recovery)
* Support GPUs and InfiniBand

Then, I think we can release version 1.0.

On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili
tommaso.teof...@gmail.com wrote:
 Would you cut 0.7 or 0.6.4 ?
 I'd go with 0.6.4 as I think the next minor version change should be due to
 significant feature additions / changes and / or stability / scalability
 improvements.

 Regards,
 Tommaso


 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 Hi all,

 I plan on cutting a release next week. If you have some opinions, Pls feel
 free to comment here.

 Sent from my iPhone



-- 
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer, Inc.


Re: Cutting a 0.7 release

2014-02-24 Thread Tommaso Teofili
2014-02-24 13:52 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 0.6.4 or 0.7.0, Both are OK to me.

 Just FYI,

 The memory efficiency has been significantly (almost x2-3) improved by
 runtime message serialization and compression. See

 https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3
 (I'll attach more benchmarks and comparisons with other systems result
 soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork,
 SemiClustering, Graph's Combiners HAMA-857.


sure, all the above things look good to me.



 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.


right, the only thing that I think we need to address before 0.7.0 is
related to the OutOfMemory errors (especially when dealing with large
graphs); for example IMHO even if the memory is not enough to store all the
graph vertices assigned to a certain peer, a scalable system should never
throw OOM exceptions, instead it may eventually process items slower (with
caches / queues) but never throw an exception for that but that's just my
opinion.



 Once we release this one, we finally might want to focus on below issues:

 * Fault tolerant job processing (checkpoint recovery)


+1


 * Support GPUs and InfiniBand


+1 for the former, not sure about the latter.



 Then, I think we can release version 1.0.


My 2 cents,
Tommaso



 On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Would you cut 0.7 or 0.6.4 ?
  I'd go with 0.6.4 as I think the next minor version change should be due
 to
  significant feature additions / changes and / or stability / scalability
  improvements.
 
  Regards,
  Tommaso
 
 
  2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:
 
  Hi all,
 
  I plan on cutting a release next week. If you have some opinions, Pls
 feel
  free to comment here.
 
  Sent from my iPhone



 --
 Edward J. Yoon (@eddieyoon)
 Chief Executive Officer
 DataSayer, Inc.



Re: Cutting a 0.7 release

2014-02-24 Thread Anastasis Andronidis
On 24 Φεβ 2014, at 3:32 μ.μ., Tommaso Teofili tommaso.teof...@gmail.com wrote:

 
 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.
 
 
 right, the only thing that I think we need to address before 0.7.0 is
 related to the OutOfMemory errors (especially when dealing with large
 graphs); for example IMHO even if the memory is not enough to store all the
 graph vertices assigned to a certain peer, a scalable system should never
 throw OOM exceptions, instead it may eventually process items slower (with
 caches / queues) but never throw an exception for that but that's just my
 opinion.
 

I like and agree with this.

Cheers,
Anastasis



Re: Cutting a 0.7 release

2014-02-24 Thread Edward J. Yoon
1) Map and Reduce model is a file-based communication. So, each
mappers can run separately. For example, To run MR job on 1 GB input
data, 5 mappers will be scheduled. Even though there are only 2 task
slots (single machine), MR job slow but works - 2 running Map Tasks, 3
pending Map tasks.

However, unlike MapReduce, BSP uses network-based communication. It
means that the every BSP tasks must run at once. And the number of BSP
tasks is determined by the number of blocks of input. So, you CANNOT
run 1 GB input data on a single machine. It's not a Memory issue.

 throw OOM exceptions, instead it may eventually process items slower (with
 caches / queues) but never throw an exception for that but that's just my

I hope so too, but I think you are saying about Iterative MapReduce.

2) The normal block size of HDFS is 64 ~ 256 MB. If we can assume that
the split size = block size, I feel that current system is enough.

I don't think we have to spend a time for implementing disk-based something.

WDYT?

On Tue, Feb 25, 2014 at 12:19 AM, Anastasis Andronidis
andronat_...@hotmail.com wrote:
 On 24 Φεβ 2014, at 3:32 μ.μ., Tommaso Teofili tommaso.teof...@gmail.com 
 wrote:


 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.


 right, the only thing that I think we need to address before 0.7.0 is
 related to the OutOfMemory errors (especially when dealing with large
 graphs); for example IMHO even if the memory is not enough to store all the
 graph vertices assigned to a certain peer, a scalable system should never
 throw OOM exceptions, instead it may eventually process items slower (with
 caches / queues) but never throw an exception for that but that's just my
 opinion.


 I like and agree with this.

 Cheers,
 Anastasis




-- 
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer, Inc.


Re: Cutting a 0.7 release

2014-02-24 Thread Chia-Hung Lin
Just let you know I may refactor based on the following diagram.

http://people.apache.org/~chl501/diagram1.png

That sketches the basic flow required for ft. I am currently evaluate
related parts, so it's subjected to change.






On 24 February 2014 20:52, Edward J. Yoon edwardy...@apache.org wrote:
 0.6.4 or 0.7.0, Both are OK to me.

 Just FYI,

 The memory efficiency has been significantly (almost x2-3) improved by
 runtime message serialization and compression. See
 https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3
 (I'll attach more benchmarks and comparisons with other systems result
 soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork,
 SemiClustering, Graph's Combiners HAMA-857.

 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.

 Once we release this one, we finally might want to focus on below issues:

 * Fault tolerant job processing (checkpoint recovery)
 * Support GPUs and InfiniBand

 Then, I think we can release version 1.0.

 On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
 Would you cut 0.7 or 0.6.4 ?
 I'd go with 0.6.4 as I think the next minor version change should be due to
 significant feature additions / changes and / or stability / scalability
 improvements.

 Regards,
 Tommaso


 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 Hi all,

 I plan on cutting a release next week. If you have some opinions, Pls feel
 free to comment here.

 Sent from my iPhone



 --
 Edward J. Yoon (@eddieyoon)
 Chief Executive Officer
 DataSayer, Inc.


Re: Cutting a 0.7 release

2014-02-24 Thread Chia-Hung Lin
Programmer can't control java memory like malloc/ free in c, type
boxing/ unboxing, etc., it seems not be easy to evaluate the memory.
So it would be good sticking to erlang fail fast style. Or we can have
a programme that load data and measure the actual memory usage.


On 24 February 2014 22:32, Tommaso Teofili tommaso.teof...@gmail.com wrote:
 2014-02-24 13:52 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 0.6.4 or 0.7.0, Both are OK to me.

 Just FYI,

 The memory efficiency has been significantly (almost x2-3) improved by
 runtime message serialization and compression. See

 https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3
 (I'll attach more benchmarks and comparisons with other systems result
 soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork,
 SemiClustering, Graph's Combiners HAMA-857.


 sure, all the above things look good to me.



 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.


 right, the only thing that I think we need to address before 0.7.0 is
 related to the OutOfMemory errors (especially when dealing with large
 graphs); for example IMHO even if the memory is not enough to store all the
 graph vertices assigned to a certain peer, a scalable system should never
 throw OOM exceptions, instead it may eventually process items slower (with
 caches / queues) but never throw an exception for that but that's just my
 opinion.



 Once we release this one, we finally might want to focus on below issues:

 * Fault tolerant job processing (checkpoint recovery)


 +1


 * Support GPUs and InfiniBand


 +1 for the former, not sure about the latter.



 Then, I think we can release version 1.0.


 My 2 cents,
 Tommaso



 On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
  Would you cut 0.7 or 0.6.4 ?
  I'd go with 0.6.4 as I think the next minor version change should be due
 to
  significant feature additions / changes and / or stability / scalability
  improvements.
 
  Regards,
  Tommaso
 
 
  2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:
 
  Hi all,
 
  I plan on cutting a release next week. If you have some opinions, Pls
 feel
  free to comment here.
 
  Sent from my iPhone



 --
 Edward J. Yoon (@eddieyoon)
 Chief Executive Officer
 DataSayer, Inc.



Re: Cutting a 0.7 release

2014-02-24 Thread Edward J. Yoon
That's huge diagram :-) Do you plan on work on HAMA-505, or create new one?

On Tue, Feb 25, 2014 at 1:33 PM, Chia-Hung Lin cli...@googlemail.com wrote:
 Just let you know I may refactor based on the following diagram.

 http://people.apache.org/~chl501/diagram1.png

 That sketches the basic flow required for ft. I am currently evaluate
 related parts, so it's subjected to change.






 On 24 February 2014 20:52, Edward J. Yoon edwardy...@apache.org wrote:
 0.6.4 or 0.7.0, Both are OK to me.

 Just FYI,

 The memory efficiency has been significantly (almost x2-3) improved by
 runtime message serialization and compression. See
 https://wiki.apache.org/hama/Benchmarks#PageRank_Performance_0.7.0-SNAPSHOT_vs_0.6.3
 (I'll attach more benchmarks and comparisons with other systems result
 soon). And, we've fixed many bugs. e.g., K-Means, NeuralNetwork,
 SemiClustering, Graph's Combiners HAMA-857.

 According to my personal evaluations, current system is fairly
 respectable. As I mentioned before, I believe we should stick to
 in-memory style since the today's machines can be equipped with up to
 128 GB. Disk (or disk hybrid) based queue is a optional, not a
 must-have.

 Once we release this one, we finally might want to focus on below issues:

 * Fault tolerant job processing (checkpoint recovery)
 * Support GPUs and InfiniBand

 Then, I think we can release version 1.0.

 On Mon, Feb 24, 2014 at 8:44 PM, Tommaso Teofili
 tommaso.teof...@gmail.com wrote:
 Would you cut 0.7 or 0.6.4 ?
 I'd go with 0.6.4 as I think the next minor version change should be due to
 significant feature additions / changes and / or stability / scalability
 improvements.

 Regards,
 Tommaso


 2014-02-24 8:47 GMT+01:00 Edward J. Yoon edwardy...@apache.org:

 Hi all,

 I plan on cutting a release next week. If you have some opinions, Pls feel
 free to comment here.

 Sent from my iPhone



 --
 Edward J. Yoon (@eddieyoon)
 Chief Executive Officer
 DataSayer, Inc.



-- 
Edward J. Yoon (@eddieyoon)
Chief Executive Officer
DataSayer, Inc.


Cutting a 0.7 release

2014-02-23 Thread Edward J. Yoon
Hi all,

I plan on cutting a release next week. If you have some opinions, Pls feel free 
to comment here.

Sent from my iPhone