[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-06 Thread Dan Brickley (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097840#comment-13097840
 ] 

Dan Brickley commented on GIRAPH-11:


Is there more detail somewhere on how 'range based' works?

> Improve the graph distribution of Giraph
> 
>
> Key: GIRAPH-11
> URL: https://issues.apache.org/jira/browse/GIRAPH-11
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
>  If the user data is not sorted by the vertex id, they must first run a 
> MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
> inconvenient.
> Giraph graph partitioning is currently range based and there are some 
> advantages and disadvantages of this approach.  The proposal of this JIRA 
> would be to allow for both range and hash based partitioning and provide more 
> flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range 
> based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-11) Improve the graph distribution of Giraph

2011-09-06 Thread Avery Ching (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-11?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098098#comment-13098098
 ] 

Avery Ching commented on GIRAPH-11:
---

I'm going to assume you're asking about the current partitioning.  If I'm 
wrong, I'll address what we plan to do in the future.  The current partitioning 
is implemented by assuming that the input splits are sorted globally (i.e. two 
input split of {A, B, C} {D, E}).  It will break the input splits into vertex 
ranges where the boundaries will not change.  These vertex ranges can be passed 
around the workers via several different balancers.  The balancer can be set 
via setVertexRangeBalancerClass() from GiraphJob or with the right 
configuration parameter (giraph.vertexRangeBalancerClass).  We have some 
implementations for a static balancer (no vertex movement, default), and an 
auto balancer (configurable to balance based on vertices or edges).  You're 
free to implement your own as well.  Hope that answers some of the questions, 
let me know if you have more.

> Improve the graph distribution of Giraph
> 
>
> Key: GIRAPH-11
> URL: https://issues.apache.org/jira/browse/GIRAPH-11
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Avery Ching
>
> Currently, Giraph assumes that the data from the VertexInputFormat is sorted. 
>  If the user data is not sorted by the vertex id, they must first run a 
> MapReduce or Pig job to generate a sorted dataset.  This is often a bit 
> inconvenient.
> Giraph graph partitioning is currently range based and there are some 
> advantages and disadvantages of this approach.  The proposal of this JIRA 
> would be to allow for both range and hash based partitioning and provide more 
> flexibility to the user.
> Design goals for the graph distribution:
> * Allow vertices to be unordered or unordered
> * Ability to repartition
> * Select the partitioning scheme based on user needs (i.e. hash or range 
> based)
> * Ability to provide user-specific hints about partitions
> Hash-based partitioning
> * Good vertex balancing across ranges for random data
> * Bad at vertex id locality
> Range-based partitioning
> * Good at vertex id locality
> * Ability to split ranges easily
> * Can cause hotspots for hot ranges

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (GIRAPH-15) Use of Jenkins for tests and builds

2011-09-06 Thread Hyunsik Choi (JIRA)

 [ 
https://issues.apache.org/jira/browse/GIRAPH-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyunsik Choi resolved GIRAPH-15.


Resolution: Fixed

> Use of Jenkins for tests and builds
> ---
>
> Key: GIRAPH-15
> URL: https://issues.apache.org/jira/browse/GIRAPH-15
> Project: Giraph
>  Issue Type: Task
>Reporter: Hyunsik Choi
>Assignee: Hyunsik Choi
>
> We can use Jenkins server (https://builds.apache.org/) for regular builds and 
> tests. To use jenkins, there are some processes.
> Here is FAQ about use of Jenkins.
> http://wiki.apache.org/general/Hudson

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-12) Investigate communication improvements

2011-09-06 Thread Avery Ching (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098512#comment-13098512
 ] 

Avery Ching commented on GIRAPH-12:
---

Jake from Twitter also recommended thinking about using Finagle.  His 
description:

"A fault tolerant, protocol-agnostic RPC system" based on Netty [which I see is 
already under consideration], written in scala, but with very mature java 
bindings too).  We use it internally at Twitter for clusters of mid-tier 
servers which have many dozens of machines talking to hundreds of other 
machines, without blowing up on thread-stack or using a gazillion threads.  
It's mavenized, so it's easy to try out.

> Investigate communication improvements
> --
>
> Key: GIRAPH-12
> URL: https://issues.apache.org/jira/browse/GIRAPH-12
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Hyunsik Choi
>Priority: Minor
>
> Currently every worker will start up a thread to communicate with every other 
> workers.  Hadoop RPC is used for communication.  For instance if there are 
> 400 workers, each worker will create 400 threads.  This ends up using a lot 
> of memory, even with the option  
> -Dmapred.child.java.opts="-Xss64k".  
> It would be good to investigate using frameworks like Netty or custom roll 
> our own to improve this situation.  By moving away from Hadoop RPC, we would 
> also make compatibility of different Hadoop versions easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-12) Investigate communication improvements

2011-09-06 Thread Hyunsik Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098551#comment-13098551
 ] 

Hyunsik Choi commented on GIRAPH-12:


Jake,
Thank you for recommendation :)

Avery,
Thank you for informing me.


I post my progress of this issue.

Recently, I have implemented and tested a lightweight RPC implementation based 
on netty and protocol-buffer, which resembles to YarnRPC. Apparently, an 
alternative RPC can give a performance gain.

finagle is very mature in compared to my own. It would be better solution. I'll 
test my own and finagle together. As soon as completed tests, I'll post the 
results.

> Investigate communication improvements
> --
>
> Key: GIRAPH-12
> URL: https://issues.apache.org/jira/browse/GIRAPH-12
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Hyunsik Choi
>Priority: Minor
>
> Currently every worker will start up a thread to communicate with every other 
> workers.  Hadoop RPC is used for communication.  For instance if there are 
> 400 workers, each worker will create 400 threads.  This ends up using a lot 
> of memory, even with the option  
> -Dmapred.child.java.opts="-Xss64k".  
> It would be good to investigate using frameworks like Netty or custom roll 
> our own to improve this situation.  By moving away from Hadoop RPC, we would 
> also make compatibility of different Hadoop versions easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-12) Investigate communication improvements

2011-09-06 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098583#comment-13098583
 ] 

Jake Mannix commented on GIRAPH-12:
---

No problem Hyunsik,

  If you have any questions on how to work with Finagle, drop me a line and if 
I can't figure it out, the primary authors of it are my co-workers and I can 
get them to jump on an email thread (or JIRA comment thread) and they'd be 
happy to help out.  If you've got a git branch with your test code, I'd be 
happy to take a look as well.

> Investigate communication improvements
> --
>
> Key: GIRAPH-12
> URL: https://issues.apache.org/jira/browse/GIRAPH-12
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Hyunsik Choi
>Priority: Minor
>
> Currently every worker will start up a thread to communicate with every other 
> workers.  Hadoop RPC is used for communication.  For instance if there are 
> 400 workers, each worker will create 400 threads.  This ends up using a lot 
> of memory, even with the option  
> -Dmapred.child.java.opts="-Xss64k".  
> It would be good to investigate using frameworks like Netty or custom roll 
> our own to improve this situation.  By moving away from Hadoop RPC, we would 
> also make compatibility of different Hadoop versions easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-12) Investigate communication improvements

2011-09-06 Thread Hyunsik Choi (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098587#comment-13098587
 ] 

Hyunsik Choi commented on GIRAPH-12:


Jake,

Thank you for your help :)
While I'm trying finagle, I will ask you if I have any questions.
Sooner I'll upload git branch with my test code :)


> Investigate communication improvements
> --
>
> Key: GIRAPH-12
> URL: https://issues.apache.org/jira/browse/GIRAPH-12
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Avery Ching
>Assignee: Hyunsik Choi
>Priority: Minor
>
> Currently every worker will start up a thread to communicate with every other 
> workers.  Hadoop RPC is used for communication.  For instance if there are 
> 400 workers, each worker will create 400 threads.  This ends up using a lot 
> of memory, even with the option  
> -Dmapred.child.java.opts="-Xss64k".  
> It would be good to investigate using frameworks like Netty or custom roll 
> our own to improve this situation.  By moving away from Hadoop RPC, we would 
> also make compatibility of different Hadoop versions easier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).

2011-09-06 Thread Jake Mannix (JIRA)
Improve PseudoRandomVertexInputFormat to create a more realistic synthetic 
graph (e.g. power-law distributed vertex-cardinality).
-

 Key: GIRAPH-26
 URL: https://issues.apache.org/jira/browse/GIRAPH-26
 Project: Giraph
  Issue Type: Test
  Components: benchmark
Reporter: Jake Mannix
Priority: Minor


The PageRankBenchmark class, to be a proper benchmark, should run over graphs 
which look more like data seen in the wild, and web link graphs, social network 
graphs, and text corpora (represented as a bipartite graph) all have power-law 
distributions, so benchmarking a synthetic graph which looks more like this 
would be a nice test which would stress cases of uneven split-distribution and 
bottlenecks of subclusters of the graph of heavily connected vertices.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).

2011-09-06 Thread Avery Ching (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098594#comment-13098594
 ] 

Avery Ching commented on GIRAPH-26:
---

Totally agree, any chance you might have some time to work on this? =)

> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic 
> graph (e.g. power-law distributed vertex-cardinality).
> -
>
> Key: GIRAPH-26
> URL: https://issues.apache.org/jira/browse/GIRAPH-26
> Project: Giraph
>  Issue Type: Test
>  Components: benchmark
>Reporter: Jake Mannix
>Priority: Minor
>
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs 
> which look more like data seen in the wild, and web link graphs, social 
> network graphs, and text corpora (represented as a bipartite graph) all have 
> power-law distributions, so benchmarking a synthetic graph which looks more 
> like this would be a nice test which would stress cases of uneven 
> split-distribution and bottlenecks of subclusters of the graph of heavily 
> connected vertices.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).

2011-09-06 Thread Jake Mannix (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098601#comment-13098601
 ] 

Jake Mannix commented on GIRAPH-26:
---

Yep, not at all hard to do.  I'll see if I can make a quick patch.

> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic 
> graph (e.g. power-law distributed vertex-cardinality).
> -
>
> Key: GIRAPH-26
> URL: https://issues.apache.org/jira/browse/GIRAPH-26
> Project: Giraph
>  Issue Type: Test
>  Components: benchmark
>Reporter: Jake Mannix
>Priority: Minor
>
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs 
> which look more like data seen in the wild, and web link graphs, social 
> network graphs, and text corpora (represented as a bipartite graph) all have 
> power-law distributions, so benchmarking a synthetic graph which looks more 
> like this would be a nice test which would stress cases of uneven 
> split-distribution and bottlenecks of subclusters of the graph of heavily 
> connected vertices.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (GIRAPH-25) NPE in BspServiceMaster when failing a job

2011-09-06 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/GIRAPH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated GIRAPH-25:


Attachment: GIRAPH-25.patch

Attached a basic fix.

The problem was that failing the job did everything correctly, but did not stop 
BspServiceMaster to proceed. 

There are two choices here -- declare an exception and throw it in this case, 
and deal with that upstream; or, c-style, return a -1. I chose the latter 
because it makes code that deals with this more succinct and it didn't change a 
public api. But I can rewrite if you prefer to throw an exception.

No test as I wasn't sure how best to fit this into the way the tests are set up.

> NPE in BspServiceMaster when failing a job
> --
>
> Key: GIRAPH-25
> URL: https://issues.apache.org/jira/browse/GIRAPH-25
> Project: Giraph
>  Issue Type: Bug
>Reporter: Dmitriy V. Ryaboy
>Priority: Minor
> Attachments: GIRAPH-25.patch
>
>
> When BspServiceMaster times out waiting for all workers to check in, it dies 
> with a NullPointerException.
> This can perhaps be handled a bit more gracefully.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-26) Improve PseudoRandomVertexInputFormat to create a more realistic synthetic graph (e.g. power-law distributed vertex-cardinality).

2011-09-06 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13098614#comment-13098614
 ] 

Dmitriy V. Ryaboy commented on GIRAPH-26:
-

I am not so sure it's that easy in a parallel world, Jake :)
http://arxiv.org/pdf/1003.3684v1

> Improve PseudoRandomVertexInputFormat to create a more realistic synthetic 
> graph (e.g. power-law distributed vertex-cardinality).
> -
>
> Key: GIRAPH-26
> URL: https://issues.apache.org/jira/browse/GIRAPH-26
> Project: Giraph
>  Issue Type: Test
>  Components: benchmark
>Reporter: Jake Mannix
>Priority: Minor
>
> The PageRankBenchmark class, to be a proper benchmark, should run over graphs 
> which look more like data seen in the wild, and web link graphs, social 
> network graphs, and text corpora (represented as a bipartite graph) all have 
> power-law distributions, so benchmarking a synthetic graph which looks more 
> like this would be a nice test which would stress cases of uneven 
> split-distribution and bottlenecks of subclusters of the graph of heavily 
> connected vertices.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira