Re: [Announcement] Giraph talk in Berlin on May 29th

2012-05-12 Thread Avery Ching

Nice!

Avery

On 5/12/12 2:58 AM, Sebastian Schelter wrote:

Hi,

I will give a talk titled Large Scale Graph Processing with Apache
Giraph in Berlin on May 29th. Details are available at:

https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275

Best,
Sebastian




Re: Possible bug when resetting aggregators ? (and missing documentation)

2012-05-02 Thread Avery Ching

I think you're right that the javadoc isn't specific enough.

   * Use a registered aggregator in current superstep.
   * Even when the same aggregator should be used in the next
   * superstep, useAggregator needs to be called at the beginning
   * of that superstep in preSuperstep().
   *
   * @param name Name of aggregator
   * @return boolean (false when not registered)
   */
  boolean useAggregator(String name);

This should be augmented to say that none of the Aggregator methods 
should be called until this method is invoke.  Feel free to file a JIRA 
and fix.  Thanks!


If you would like to, please feel free to add Aggregator documentation 
to https://cwiki.apache.org/confluence/display/GIRAPH/Index


Avery

On 5/2/12 12:15 PM, Benjamin Heitmann wrote:

Hello,

I had to use aggregators for various statistic reporting tasks,
and I noticed that the aggregator operations need to be used in a very specific 
squence,
especially when the aggregator is getting a reset between supersteps.

I found that the sequence described in RandomMessageBenchmark (in the 
org.apache.giraph.benchmark package)
results in consistent counts for one aggregator across all workers.
The most important thing, seems to be to call the reset method 
setAggregatedValue() in preSuperstep() of the WorkerContext class,
before calling this.useAggregator().

If I called the reset method in postSuperstep(), then every worker reported a 
different value for the aggregator.

However, the aggregator which gets the reset between supersteps, still is wrong.

I know this, because a second aggregator counts the same thing, and reports it 
after each superstep,
without getting a reset.

Is this a known issue ? Should I file a bug report on it ?


In addition, it would be great to document correct usage of the aggregators 
somewhere.
Even just in the javadoc of the aggregator interface might be enough.

Should I try to add some documentation to the aggregator interface?
(org.apache.giraph.graph.Aggregator.java)
Then the committers can correct me if that documentation is wrong, I guess.




Re: Please welcome our newest committer and PMC member, Eugene!

2012-05-01 Thread Avery Ching

Awesome!  Congrats Eugene, we're excited to have you taking on a big role.

Avery

On 5/1/12 5:18 PM, Hyunsik Choi wrote:

Congrats and welcome Eugene!
I'm looking forward to your contribution.

--
Hyunsik Choi

On Wed, May 2, 2012 at 5:39 AM, Jakob Homan jgho...@gmail.com 
mailto:jgho...@gmail.com wrote:


I'm happy to announce that the Giraph PMC has voted Eugene Koontz in
as a committer and PMC member.  Eugene has been pitching in with great
patches that have been very useful, such as helping us sort out our
terrifying munging situation (GIRAPH-168).

Welcome aboard, Eugene!

-Jakob






Re: Does Giraph support labeled graphs?

2012-04-19 Thread Avery Ching

Anyone want to work on https://issues.apache.org/jira/browse/GIRAPH-155? =)

On 4/19/12 9:22 AM, Claudio Martella wrote:

The problem with this approach is that Giraph doesn't support
multi-graphs. Following RDF, you can have multiple edges connecting
the same pair of vertices.
So for methods such as getEdgeValue(I) you'd have to return something
like ListE. For this, I'd suggest to forget the Giraph specific
methods and just add your own on top, which you will call internally.

On Thu, Apr 19, 2012 at 12:36 PM, Benjamin Heitmann
benjamin.heitm...@deri.org  wrote:

Hi Avery and Paolo,

On 11 Apr 2012, at 18:37, Avery Ching wrote:


There is no preferred way to represent labeled graphs.  A close example to 
your adjacency list idea is LongDoubleDoubleAdjacencyListVertexInputFormat.

Exactly. Giraph supports labeled Graphs very easily.

My reply is a little bit lat, so you probably already figured out the following:

The thing you need to do is create your own class which extends HashMapVertex,
and as the third parameter of theI, V, E, M  signature, you provide a Text 
for the edge parameter. No other code is required in that class in order to
use the edge labels then AFAIK.

But you will need to write a VertexInputFormat class to fill the edges when you 
parse your input.







Re: Slides for my talk at the Berlin Hadoop Get Together

2012-04-19 Thread Avery Ching
Very nice!  Will these be similar to the 'Parallel Processing beyond 
MapReduce' workshop after Berlin Buzzwords?  It would be good to add at 
leaset one of them to the page.


Avery

On 4/19/12 12:31 PM, Sebastian Schelter wrote:

Here are the slides of my talk Introducing Apache Giraph for Large
Scale Graph Processing at the Berlin Hadoop Get Together yesterday:

http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing

I reused a lot of stuff from Claudio's excellent prezi presentation.

Best,
Sebastian




Re: java.lang.RuntimeException [...] msgMap did not exist [...]

2012-04-17 Thread Avery Ching

Etienne,

There should be one task log per task.  Do you have all the tasks logs?  
It looks like this one failed because another one failed.


Avery

On 4/17/12 9:37 AM, Etienne Dumoulin wrote:

Avery,

I attach the file, indeed it looks more interesting that the others. 
There is a null pointer exception:
 15 MapAttempt TASK_TYPE=MAP 
TASKID=task_201204121825_0001_m_02 
TASK_ATTEMPT_ID=attempt_201204121825_0001_m_02_0 
TASK_STATUS=FAILED FINISH_TIME=13342517  07662 
HOSTNAME=nantes ERROR=java\.lang\.NullPointerException
   16at 
org\.apache\.giraph\.graph\.GraphMapper\.run(GraphMapper\.java:639)
   17at 
org\.apache\.hadoop\.mapred\.MapTask\.runNewMapper(MapTask\.java:763)

   18at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:369)
   19at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:259)
   20at java\.security\.AccessController\.doPrivileged(Native Method)
   21at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
   22at 
org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1059)

   23at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:253)

Also I found this file in 
logs/history/done/version-1/rennes.local.net_1334252188432_/2012/04/13/00/job_201204121836_0003_1334307958403_hadoop_org.apache.giraph.examples.SimpleShortestPathsVert. 
I run it on the 13th at 10am local time, however in these logs the 
date is 20120412. In addition I have in the logs directory I have no 
job conf dating of the 13th. Does hadoop does not take the local time 
to name the files?


Thanks,

Étienne


On 16 April 2012 19:45, Avery Ching ach...@apache.org 
mailto:ach...@apache.org wrote:


Etienne, the task tracker logs are not what I meant, sorry for the
confusion.  Every task produces it's own output and error log. 
That is likely where we can find the issue.  Likely a task failed,

and the task logs should say why.

Avery


On 4/16/12 3:00 AM, Etienne Dumoulin wrote:

Hi Avery,

Thanks for your fast reply. I attach the forgotten file.

Regards,

Étienne

On 13 April 2012 17:40, Avery Ching ach...@apache.org
mailto:ach...@apache.org wrote:

Hi Etienne,

Thanks for your questions.  Giraph uses map tasks to run its
master and workers.  Can you provide the task output logs?
 It looks like your workers failed to report status for some
reason and we need to find out why.  The datanode logs can't
help us here.

Avery


On 4/13/12 3:35 AM, Etienne Dumoulin wrote:

Hi Guys,

I tried out giraph yesterday and I have an issue to run
the shortest path example.

I am working on a toy heterogeneous cluster of 3
datanodes and 1 namenode, jobtracker, with hadoop 0.20.203.0.
One of the datanode is a small server quad-core 16 GB
ram, the others are small PC 1 core 1GB ram, same OS:
ubuntu-server 10.04.

I run on a first issue with the 0.1 version, the same
described here:
https://issues.apache.org/jira/browse/GIRAPH-114.
Before I found the patch I tried different configurations:
It works on a standalone environment, with the namenode
and the server, with the namenode and the two small PC.
It does not work either with the entire cluster, or with
one small PC and the server as datanode.

Then I downloaded today the svn version, no luck, it has
the same behaviour than the 0.1 version (go till 100%
then go back to 0%) but not the same info logs.
Bellow the svn version console log, nantes is the name
of the big datanode, rennes the namenode/jobtracker:

hadoop@rennes:~/test$ hadoop jar

~/project/giraph/trunk_2012_04_13/target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar
org.apache.giraph.examples.SimpleShortestPathsVertex
shortestPathsInputGraph shortestPathsOutputGraph 0 3
12/04/13 10:05:58 INFO mapred.JobClient: Running job:
job_201204121836_0003
12/04/13 10:05:59 INFO mapred.JobClient:  map 0% reduce 0%
12/04/13 10:06:18 INFO mapred.JobClient:  map 25% reduce 0%
12/04/13 10:08:55 INFO mapred.JobClient:  map 100% reduce 0%
12/04/13 10:21:28 INFO mapred.JobClient:  map 75% reduce 0%
12/04/13 10:21:33 INFO mapred.JobClient: Task Id :
attempt_201204121836_0003_m_02_0, Status : FAILED
Task attempt_201204121836_0003_m_02_0 failed to
report status for 600 seconds. Killing!
12/04/13 10:23:57 INFO mapred.JobClient: Task Id :
attempt_201204121836_0003_m_01_0, Status : FAILED
java.lang.RuntimeException: sendMessage: msgMap did not
exist for nantes:30002

Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph

2012-04-13 Thread Avery Ching

Hi Paulo,

Can you try something for me?  I was able to get the PageRankBenchmark 
to work running in local mode just fine on my side.


I think we should have some kind of a helper script (similar to 
bin/giraph) to running simple tests in LocalJobRunner.


I believe that for LocalJobRunner to run, we need to do 
-Dgiraph.SplitMasterWorker=false -Dlocal.test.mode=true.  In the case of 
PageRankBenchmark, I also have to set the workers to 1 (LocalJobRunner 
can only run one task at a time).


So I get the class path that bin/giraph was using to run (just added a 
echo $CLASSPATH at the end) and then inserted the 
giraph-0.2-SNAPSHOT-jar-with-dependencies.jar in front of it (this is 
necessary for the ZooKeeper jar inclusion).  Then I just ran a normal 
java command and the output below.


One thing to remember is that if you rerun it, you'll have to remove the 
_bsp directories that are created, otherwise it will think it has 
already been completed.


Hope that helps,

Avery

 java -cp 
target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar:/Users/aching/git/git_svn_giraph_trunk/conf:/Users/aching/.m2/repository/ant/ant/1.6.5/ant-1.6.5.jar:/Users/aching/.m2/repository/com/google/guava/guava/r09/guava-r09.jar:/Users/aching/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/Users/aching/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/Users/aching/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/Users/aching/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/Users/aching/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/Users/aching/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/Users/aching/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/Users/aching/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/Users/aching/.m2/repository/commons-httpclient/commons-httpclient/3.0.1/commons-httpclient-3.0.1.jar:/Users/aching/.m2/repository/commons-lang/commons-lang/2.4/commons-lang-2.4.jar:/Users/aching/.m2/repository/commons-logging/commons-logging/1.0.3/commons-logging-1.0.3.jar:/Users/aching/.m2/repository/commons-net/commons-net/1.4.1/commons-net-1.4.1.jar:/Users/aching/.m2/repository/hsqldb/hsqldb/1.8.0.10/hsqldb-1.8.0.10.jar:/Users/aching/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/Users/aching/.m2/repository/javax/mail/mail/1.4/mail-1.4.jar:/Users/aching/.m2/repository/jline/jline/0.9.94/jline-0.9.94.jar:/Users/aching/.m2/repository/junit/junit/3.8.1/junit-3.8.1.jar:/Users/aching/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar:/Users/aching/.m2/repository/net/iharder/base64/2.3.8/base64-2.3.8.jar:/Users/aching/.m2/repository/net/java/dev/jets3t/jets3t/0.7.1/jets3t-0.7.1.jar:/Users/aching/.m2/repository/net/sf/kosmosfs/kfs/0.3/kfs-0.3.jar:/Users/aching/.m2/repository/org/apache/commons/commons-io/1.3.2/commons-io-1.3.2.jar:/Users/aching/.m2/repository/org/apache/commons/commons-math/2.1/commons-math-2.1.jar:/Users/aching/.m2/repository/org/apache/hadoop/hadoop-core/0.20.203.0/hadoop-core-0.20.203.0.jar:/Users/aching/.m2/repository/org/apache/mahout/mahout-collections/1.0/mahout-collections-1.0.jar:/Users/aching/.m2/repository/org/apache/zookeeper/zookeeper/3.3.3/zookeeper-3.3.3.jar:/Users/aching/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.0/jackson-core-asl-1.8.0.jar:/Users/aching/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.0/jackson-mapper-asl-1.8.0.jar:/Users/aching/.m2/repository/org/eclipse/jdt/core/3.1.1/core-3.1.1.jar:/Users/aching/.m2/repository/org/json/json/20090211/json-20090211.jar:/Users/aching/.m2/repository/org/mockito/mockito-all/1.8.5/mockito-all-1.8.5.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jetty/6.1.26/jetty-6.1.26.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jsp-2.1/6.1.14/jsp-2.1-6.1.14.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jsp-api-2.1/6.1.14/jsp-api-2.1-6.1.14.jar:/Users/aching/.m2/repository/org/mortbay/jetty/servlet-api/2.5-20081211/servlet-api-2.5-20081211.jar:/Users/aching/.m2/repository/org/mortbay/jetty/servlet-api-2.5/6.1.14/servlet-api-2.5-6.1.14.jar:/Users/aching/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar:/Users/aching/.m2/repository/tomcat/jasper-compiler/5.5.12/jasper-compiler-5.5.12.jar:/Users/aching/.m2/repository/tomcat/jasper-runtime/5.5.12/jasper-runtime-5.5.12.jar:/Users/aching/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar 
org.apache.giraph.benchmark.PageRankBenchmark 
-Dgiraph.SplitMasterWorker=false -Dlocal.test.mode=true  -c 1 -e 2 -s 2 
-V 10 -w 1


2012-04-13 09:30:27.261 java[45785:1903] Unable to load realm mapping 
info from SCDynamicStore
12/04/13 09:30:27 INFO benchmark.PageRankBenchmark: Using class 
org.apache.giraph.benchmark.PageRankBenchmark

Re: java.lang.RuntimeException [...] msgMap did not exist [...]

2012-04-13 Thread Avery Ching

Hi Etienne,

Thanks for your questions.  Giraph uses map tasks to run its master and 
workers.  Can you provide the task output logs?  It looks like your 
workers failed to report status for some reason and we need to find out 
why.  The datanode logs can't help us here.


Avery

On 4/13/12 3:35 AM, Etienne Dumoulin wrote:

Hi Guys,

I tried out giraph yesterday and I have an issue to run the shortest 
path example.


I am working on a toy heterogeneous cluster of 3 datanodes and 1 
namenode, jobtracker, with hadoop 0.20.203.0.
One of the datanode is a small server quad-core 16 GB ram, the others 
are small PC 1 core 1GB ram, same OS: ubuntu-server 10.04.


I run on a first issue with the 0.1 version, the same described here: 
https://issues.apache.org/jira/browse/GIRAPH-114.

Before I found the patch I tried different configurations:
It works on a standalone environment, with the namenode and the 
server, with the namenode and the two small PC.
It does not work either with the entire cluster, or with one small PC 
and the server as datanode.


Then I downloaded today the svn version, no luck, it has the same 
behaviour than the 0.1 version (go till 100% then go back to 0%) but 
not the same info logs.
Bellow the svn version console log, nantes is the name of the big 
datanode, rennes the namenode/jobtracker:


hadoop@rennes:~/test$ hadoop jar 
~/project/giraph/trunk_2012_04_13/target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar 
org.apache.giraph.examples.SimpleShortestPathsVertex 
shortestPathsInputGraph shortestPathsOutputGraph 0 3
12/04/13 10:05:58 INFO mapred.JobClient: Running job: 
job_201204121836_0003

12/04/13 10:05:59 INFO mapred.JobClient:  map 0% reduce 0%
12/04/13 10:06:18 INFO mapred.JobClient:  map 25% reduce 0%
12/04/13 10:08:55 INFO mapred.JobClient:  map 100% reduce 0%
12/04/13 10:21:28 INFO mapred.JobClient:  map 75% reduce 0%
12/04/13 10:21:33 INFO mapred.JobClient: Task Id : 
attempt_201204121836_0003_m_02_0, Status : FAILED
Task attempt_201204121836_0003_m_02_0 failed to report status for 
600 seconds. Killing!
12/04/13 10:23:57 INFO mapred.JobClient: Task Id : 
attempt_201204121836_0003_m_01_0, Status : FAILED
java.lang.RuntimeException: sendMessage: msgMap did not exist for 
nantes:30002 for vertex 2
at 
org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:993)
at 
org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:168)
at 
org.apache.giraph.examples.SimpleShortestPathsVertex.compute(SimpleShortestPathsVertex.java:104)

at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:593)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:648)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

at org.apache.hadoop.mapred.Child.main(Child.java:253)

Task attempt_201204121836_0003_m_01_0 failed to report status for 
601 seconds. Killing!

12/04/13 10:23:58 INFO mapred.JobClient:  map 50% reduce 0%
12/04/13 10:24:01 INFO mapred.JobClient:  map 25% reduce 0%
12/04/13 10:24:06 INFO mapred.JobClient: Task Id : 
attempt_201204121836_0003_m_03_0, Status : FAILED
Task attempt_201204121836_0003_m_03_0 failed to report status for 
602 seconds. Killing!


I attached the hadoop logs for rennes namenode and jobtraker and for 
nantes the big datanode.


Is someone already got this error/found a fix?

Thanks for your time,

Étienne







Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph

2012-04-11 Thread Avery Ching
It shouldn't be, your code looks very similar to the unittests (i.e. 
TestManualCheckpoint.java).  So, you're trying to run your test with the 
local hadoop (similar to the unittests)?  Or are you using an actual 
hadoop setup?


Avery

On 4/10/12 11:41 PM, Paolo Castagna wrote:

I am using hadoop-core-1.0.1.jar ... could that be a problem?

Paolo

Paolo Castagna wrote:

Hi Avery,
nope, no luck.

I have changed all my log.debug(...) into log.info(...). Same behavior.

I have a log4j.properties [1] file in my classpath and it has:
log4j.logger.org.apache.jena.grande=DEBUG
log4j.logger.org.apache.jena.grande.giraph=DEBUG
I also tried to change that to:
log4j.logger.org.apache.jena.grande=INFO
log4j.logger.org.apache.jena.grande.giraph=INFO
No luck.

My Giraph job has:
GiraphJob job = new GiraphJob(getConf(), getClass().getName());
job.setVertexClass(getClass());
job.setVertexInputFormatClass(TurtleVertexInputFormat.class);
job.setVertexOutputFormatClass(TurtleVertexOutputFormat.class);

But, if I run in debug with a breakpoint in the TurtleVertexInputFormat.class
constructor, it is never instanciated. How can it be?

So perhaps the problem is not the logging, it is the fact that
my GiraphJob is not using TurtleVertexInputFormat.class and
TurtleVertexOutputFormat.class, but I don't see what I am doing
wrong. :-/

Thanks,
Paolo

  [1]
https://github.com/castagna/jena-grande/blob/master/src/test/resources/log4j.properties

Avery Ching wrote:

I think the issue might be that Hadoop only logs INFO and above messages
by default.  Can you retry with INFO level logging?

Avery

On 4/10/12 12:17 PM, Paolo Castagna wrote:

Hi,
I am still learning Giraph, so, please, be patient with me and forgive my
trivial questions.

As a simple initial use case, I want to compute the shortest paths
from a single
source in a social graph in RDF format using the FOAF [1] vocabulary.
This example also will hopefully inform GIRAPH-170 [2] and related
issues, such
as: GIRAPH-141 [3].

Here is an example in Turtle [4] format of a tiny graph using FOAF:

@prefix :http://example.org/   .
@prefix foaf:http://xmlns.com/foaf/0.1/   .

:alice
  a   foaf:Person ;
  foaf:name   Alice ;
  foaf:mboxmailto:al...@example.org   ;
  foaf:knows  :bob ;
  foaf:knows  :charlie ;
  foaf:knows  :snoopy ;
  .

:bob
  foaf:name   Bob ;
  foaf:knows  :charlie ;
  .

:charlie
  foaf:name   Charlie ;
  foaf:knows  :alice ;
  .

This is nice, human friendly (RDF without angle brackets!), but not
easily
splittable to be processed with MapReduce (or Giraph).

Here is the same graph in N-Triples [5] format:

http://example.org/alice
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://xmlns.com/foaf/0.1/Person   .
http://example.org/alice   http://xmlns.com/foaf/0.1/name   Alice .
http://example.org/alice   http://xmlns.com/foaf/0.1/mbox
mailto:al...@example.org   .
http://example.org/alice   http://xmlns.com/foaf/0.1/knows
http://example.org/bob   .
http://example.org/alice   http://xmlns.com/foaf/0.1/knows
http://example.org/charlie   .
http://example.org/alice   http://xmlns.com/foaf/0.1/knows
http://example.org/snoopy   .
http://example.org/charlie   http://xmlns.com/foaf/0.1/name
Charlie .
http://example.org/charlie   http://xmlns.com/foaf/0.1/knows
http://example.org/alice   .
http://example.org/bob   http://xmlns.com/foaf/0.1/name   Bob .
http://example.org/bob   http://xmlns.com/foaf/0.1/knows
http://example.org/charlie   .

This is more verbose and ugly, but splittable.

The graph I am interested in is the graph represented by foaf:knows
relationships/links between people (please, note --knows--
relationship here
has a direction, this isn't symmetric as in centralized social networking
websites such as Facebook or LinkedIn. Alice can claim to know Bob,
without Bob
knowing it and/or it might even be a false claim):

alice --knows--   bob
alice --knows--   charlie
alice --knows--   snoopy
bob --knows--   charlie
charlie --knows--   alice

As a first step, I wrote a MapReduce job [6] to transform the RDF
graph above in
a sort of adjacency list using Turtle syntax, here is the output
(three lines):

http://example.org/alice   http://xmlns.com/foaf/0.1/mbox
mailto:al...@example.org;http://xmlns.com/foaf/0.1/name   Alice;
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://xmlns.com/foaf/0.1/Person;http://xmlns.com/foaf/0.1/knows
http://example.org/charlie,http://example.org/bob,
http://example.org/snoopy; .http://example.org/charlie
http://xmlns.com/foaf/0.1/knows   http://example.org/alice.

http://example.org/bob   http://xmlns.com/foaf/0.1/name   Bob;
http://xmlns.com/foaf/0.1/knows   http://example.org/charlie; .
http://example.org/alice   http://xmlns.com/foaf/0.1/knows
http://example.org/bob.

http://example.org/charlie   http://xmlns.com/foaf/0.1/name
Charlie;
http://xmlns.com/foaf/0.1/knows   http://example.org/alice; .
http://example.org/bob   http://xmlns.com/foaf

Re: Does Giraph support labeled graphs?

2012-04-11 Thread Avery Ching
There is no preferred way to represent labeled graphs.  A close 
example to your adjacency list idea is 
LongDoubleDoubleAdjacencyListVertexInputFormat.


Hope that helps,

Avery

On 4/11/12 10:00 AM, Paolo Castagna wrote:

Hi,
I am not sure what's the best way to represent labeled graphs in Giraph.

Here is my graph (i.e. vertex_id --edge_label_id--  vertex_id ):

32 --62--  115
32 --153--  189
32 --200--  236
32 --266--  303
32 --266--  331
32 --266--  363
303 --153--  407
303 --266--  331
331 --153--  394
331 --266--  32
...

I have code to produce an adjacency list:

32 ( 62 115 ) ( 153 189 ) ( 200 236 ) ( 266 303 331 363 )
303 ( 153 407 ) ( 266 331 )
331 ( 153 394 ) ( 266 32 )
...

What's the bets way to represent labeled graphs with Giraph?

Correct me if I am wrong, but none of the current VertexInputFormat(s) is good
for this, am I right?

As a workaround, it is possible to generate an unlabeled adjacency list with
just the edge type someone is interested in, say for example --266--  :

32 303 331 363
303 331
331 32
...


Cheers,
Paolo


PS:
The graph above is RDF, parsed using Apache Jena's RIOT and stored in TDB.
An example of code to generate the adjacency list from TDB indexes is here:
https://github.com/castagna/jena-grande/blob/0667599264527721daea80d56ad3f99e437dcda2/src/main/java/org/apache/jena/grande/examples/RunTdbLowLevel.java




Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph

2012-04-10 Thread Avery Ching
I think the issue might be that Hadoop only logs INFO and above messages 
by default.  Can you retry with INFO level logging?


Avery

On 4/10/12 12:17 PM, Paolo Castagna wrote:

Hi,
I am still learning Giraph, so, please, be patient with me and forgive my
trivial questions.

As a simple initial use case, I want to compute the shortest paths from a single
source in a social graph in RDF format using the FOAF [1] vocabulary.
This example also will hopefully inform GIRAPH-170 [2] and related issues, such
as: GIRAPH-141 [3].

Here is an example in Turtle [4] format of a tiny graph using FOAF:

@prefix :http://example.org/  .
@prefix foaf:http://xmlns.com/foaf/0.1/  .

:alice
 a   foaf:Person ;
 foaf:name   Alice ;
 foaf:mboxmailto:al...@example.org  ;
 foaf:knows  :bob ;
 foaf:knows  :charlie ;
 foaf:knows  :snoopy ;
 .

:bob
 foaf:name   Bob ;
 foaf:knows  :charlie ;
 .

:charlie
 foaf:name   Charlie ;
 foaf:knows  :alice ;
 .

This is nice, human friendly (RDF without angle brackets!), but not easily
splittable to be processed with MapReduce (or Giraph).

Here is the same graph in N-Triples [5] format:

http://example.org/alice  http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://xmlns.com/foaf/0.1/Person  .
http://example.org/alice  http://xmlns.com/foaf/0.1/name  Alice .
http://example.org/alice  http://xmlns.com/foaf/0.1/mbox
mailto:al...@example.org  .
http://example.org/alice  http://xmlns.com/foaf/0.1/knows
http://example.org/bob  .
http://example.org/alice  http://xmlns.com/foaf/0.1/knows
http://example.org/charlie  .
http://example.org/alice  http://xmlns.com/foaf/0.1/knows
http://example.org/snoopy  .
http://example.org/charlie  http://xmlns.com/foaf/0.1/name  Charlie .
http://example.org/charlie  http://xmlns.com/foaf/0.1/knows
http://example.org/alice  .
http://example.org/bob  http://xmlns.com/foaf/0.1/name  Bob .
http://example.org/bob  http://xmlns.com/foaf/0.1/knows
http://example.org/charlie  .

This is more verbose and ugly, but splittable.

The graph I am interested in is the graph represented by foaf:knows
relationships/links between people (please, note --knows--  relationship here
has a direction, this isn't symmetric as in centralized social networking
websites such as Facebook or LinkedIn. Alice can claim to know Bob, without Bob
knowing it and/or it might even be a false claim):

alice --knows--  bob
alice --knows--  charlie
alice --knows--  snoopy
bob --knows--  charlie
charlie --knows--  alice

As a first step, I wrote a MapReduce job [6] to transform the RDF graph above in
a sort of adjacency list using Turtle syntax, here is the output (three lines):

http://example.org/alice  http://xmlns.com/foaf/0.1/mbox
mailto:al...@example.org;http://xmlns.com/foaf/0.1/name  Alice;
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://xmlns.com/foaf/0.1/Person;http://xmlns.com/foaf/0.1/knows
http://example.org/charlie,http://example.org/bob,
http://example.org/snoopy; .http://example.org/charlie
http://xmlns.com/foaf/0.1/knows  http://example.org/alice.

http://example.org/bob  http://xmlns.com/foaf/0.1/name  Bob;
http://xmlns.com/foaf/0.1/knows  http://example.org/charlie; .
http://example.org/alice  http://xmlns.com/foaf/0.1/knows
http://example.org/bob.

http://example.org/charlie  http://xmlns.com/foaf/0.1/name  Charlie;
http://xmlns.com/foaf/0.1/knows  http://example.org/alice; .
http://example.org/bob  http://xmlns.com/foaf/0.1/knows
http://example.org/charlie.http://example.org/alice
http://xmlns.com/foaf/0.1/knows  http://example.org/charlie.

This is legal Turtle, but it is also splittable. Each line has all the RDF
statements (i.e. egdes) for a person (there are also incoming edges).

I wrote a TurtleVertexReader [7] which extends TextVertexReaderNodeWritable,
Text, NodeWritable, Text  and a TurtleVertexInputFormat [8] which extends
TextVertexInputFormatNodeWritable, Text, NodeWritable, Text.
I wrote (copying from the example SimpleShortestPathsVertex) a
FoafShortestPathsVertex [9] which extends EdgeListVertexNodeWritable,
IntWritable, NodeWritable, IntWritable  and I am running it locally using these
arguments: -Dgiraph.maxWorkers=1 -Dgiraph.SplitMasterWorker=false
-DoverwriteOutput=true src/test/resources/data3.ttl target/foaf
http://example.org/alice 1

TurtleVertexReader, TurtleVertexInputFormat and FoafShortestPathsVertex are
still work in progress and I am sure there are plenty of stupid errors.
However, I do not understand why when I run FoafShortestPathsVertex with the
DEBUG level, I see debug statements from FoafShortestPathsVertex:
19:34:44 DEBUG FoafShortestPathsVertex   :: main({-Dgiraph.maxWorkers=1,
-Dgiraph.SplitMasterWorker=false, -DoverwriteOutput=true,
src/test/resources/data3.ttl, target/foaf, http://example.org/alice, 1})
19:34:44 DEBUG FoafShortestPathsVertex   :: getConf() --  null
19:34:44 DEBUG FoafShortestPathsVertex   :: setConf(Configuration:
core-default.xml, 

Re: Announcement: 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords

2012-04-04 Thread Avery Ching

That is great news Sebastian!  Congrats, I wish I was in Berlin to attend.

Avery

On 4/4/12 2:12 AM, Sebastian Schelter wrote:

Hi everybody,

I'd like to announce the 'Parallel Processing beyond MapReduce' workshop
which will take place directly after the Berlin Buzzwords conference (
http://berlinbuzzwords.de/ ).


This workshop will discuss novel paradigms for parallel processing
beyond the traditional MapReduce paradigm offered by Apache Hadoop.

The workshop will introduce two new systems:

Apache Giraph aims at processing large graphs, runs on standard Hadoop
infrastructure and is a loose port of Google's Pregel system. Giraph
follows the bulk-synchronous parallel model relative to graphs where
vertices can send messages to other vertices during a given superstep.

Stratosphere (http://www.stratosphere.eu) is a system that is developed
in a joint research project by Technische Universität Berlin, Humboldt
Universität zu Berlin and the Hasso-Plattner-Institut in Potsdam. It is
a database inspired, large-scale data processor based on concepts of
robust and adaptive execution. Stratosphere offers the PACT programming
model that extends the MapReduce programming model with additional
second order functions. As execution platform it uses the Nephele
system, a massively parallel data flow engine which is also researched
and developed in the project.

Attendees will hear about the new possibilities of Hadoop's NextGen
MapReduce architecture (YARN) and get a detailed introduction to the
Apache Giraph and Stratosphere systems. After that there will be plenty
of time for questions, discussions and diving into source code.

As a prerequisite, attendees have to bring a notebook with:
  - a copy of Giraph downloaded with source
  - Hadoop 0.23+ source tree and JARS local
  - a copy of Stratosphere with source
  - an IDE of their choice

The workshop will take place on the 6th and 7th of June and is limited
to 15 attendees. Please register by sending an email to sebastian [DOT]
schelter [AT] tu-berlin [DOT] de

http://berlinbuzzwords.de/content/workshops-berlin-buzzwords


/s




Re: Exceptions when establishing RPC

2012-04-03 Thread Avery Ching
If you're using one master and one slave, you need to do -w 1.  Did you 
see any error about the RPC server starting up?


Avery

On 4/3/12 1:37 PM, Robert Davis wrote:

Hello,

I was trying to run Giraph on two machines (one master and one slave) 
but kept getting exceptions when establishing RPC to the slave 
machine. Does anybody has any ideas what's going wrong here? I am 
running the test with following parameters.


hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 10 -s 2 -v -V 2000 -w 2


Thanks,
Robert

12/04/03 01:35:01 DEBUG comm.BasicRPCCommunications: 
startPeerConnectionThread: hostname 
ec2-107-20-19-131.compute-1.amazonaws.com 
http://ec2-107-20-19-131.compute-1.amazonaws.com, port 30001
12/04/03 01:35:01 DEBUG comm.BasicRPCCommunications: 
startPeerConnectionThread: Connecting to 
Worker(hostname=ec2-107-20-19-131.compute-1.amazonaws.com 
http://ec2-107-20-19-131.compute-1.amazonaws.com, MRpartition=1, 
port=30001), addr = ec2-107-20-19-131.compute-1.amazonaws.com:30001 
http://ec2-107-20-19-131.compute-1.amazonaws.com:30001 if outMsgMap 
(null) == null
12/04/03 01:35:11 WARN comm.BasicRPCCommunications: 
connectAllRPCProxys: Failed on attempt 1 of 5 to connect to 
(id=0,cur=Worker(hostname=ec2-107-20-19-131.compute-1.amazonaws.com 
http://ec2-107-20-19-131.compute-1.amazonaws.com, MRpartition=1, 
port=30001),prev=null,ckpt_file=null)
java.net.ConnectException: Call to 
ec2-107-20-19-131.compute-1.amazonaws.com:30001 
http://ec2-107-20-19-131.compute-1.amazonaws.com:30001 failed on 
connection exception: java.net.ConnectException: Connection refused

at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095)
at org.apache.hadoop.ipc.Client.call(Client.java:1071)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at $Proxy3.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420)
at 
org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:194)
at 
org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:190)

at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)
at 
org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:188)
at 
org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:58)
at 
org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:678)
at 
org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:622)
at 
org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:583)
at 
org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:555)

at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:474)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:646)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083)

at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)

at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434)

at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202)
at org.apache.hadoop.ipc.Client.call(Client.java:1046)
... 25 more





Re: Incomplete output when running PageRank example

2012-03-31 Thread Avery Ching
As Benjamin mentioned, it depends on the number of map tasks your hadoop 
install is running with.  You could set it proportionally to the number 
of cores it has if you like, but try using Benjamin's suggestions to get 
it working with more map tasks.  I believe if you don't set the default, 
the default is 2, which is not enough for 2 workers.


Avery

On 3/31/12 11:51 AM, Robert Davis wrote:

Thanks a lot, Benjamin.

I set the number of maptask as 2 since I only have a duo-core 
processor (though with hyperthread) on my laptop. I ran it again but 
it still appeared incorrect. The output is as follows.


Regards,
Robert

$ hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 5000 
-w 2
12/03/31 11:40:08 INFO benchmark.PageRankBenchmark: Using class 
org.apache.giraph.benchmark.HashMapVertexPageRankBenchmark
12/03/31 11:40:10 WARN bsp.BspOutputFormat: checkOutputSpecs: 
ImmutableOutputCommiter will not check anything
12/03/31 11:40:11 INFO mapred.JobClient: Running job: 
job_201203301834_0004

12/03/31 11:40:12 INFO mapred.JobClient:  map 0% reduce 0%
12/03/31 11:40:38 INFO mapred.JobClient:  map 33% reduce 0%
12/03/31 11:45:44 INFO mapred.JobClient: Job complete: 
job_201203301834_0004

12/03/31 11:45:44 INFO mapred.JobClient: Counters: 5
12/03/31 11:45:44 INFO mapred.JobClient:   Job Counters
12/03/31 11:45:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=620769
12/03/31 11:45:44 INFO mapred.JobClient: Total time spent by all 
reduces waiting after reserving slots (ms)=0
12/03/31 11:45:44 INFO mapred.JobClient: Total time spent by all 
maps waiting after reserving slots (ms)=0

12/03/31 11:45:44 INFO mapred.JobClient: Launched map tasks=2
12/03/31 11:45:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=4377

On Sat, Mar 31, 2012 at 3:45 AM, Benjamin Heitmann 
benjamin.heitm...@deri.org mailto:benjamin.heitm...@deri.org wrote:



Hi Robert,

On 31 Mar 2012, at 09:42, Robert Davis wrote:

 Hello Giraphers,

 I am new to Giraph. I just check out a version and ran it in the
single
 machine mode. I got the following results which has no Giraph
counter
 information (as those in the example output). I am wondering
what has gone
 wrong. The hadoop I am using is 1.0

it looks like your Giraph job did not actually finish the calculation.

As you say that you are new to Giraph, there might be a high
chance that you ran into the same issue which tripped me up a few
weeks ago ;)

(I am not sure where the following information should be documented,
maybe this issue should be documented on the same page which
describes how to run the pagerank benchmark)

You provide the parameter -w 30 to your job, which means that it
will use 30 workers. Maybe thats from the example on the Giraph
web page,
however there is one very important caveat for the number of workers:
the number of workers needs to be smaller then
mapred.tasktracker.map.tasks.maximum minus one.

Giraph will use one mapper task to start some sort of coordinating
worker (probably something zookeeper specific),
and then it will start the number of workers which you specified
using -w . If the total number of workers is bigger then the
maximum number of tasks,
then your Giraph job will not finish actually calculating stuff.
(There might be a config option for specifying how many workers
need to be finished in order to start the next superstep,
but I did not try that personally.)

If you are running Hadoop/Giraph on your personal machine, then I
would recommend, using 3 workers, and you should edit your
conf/mapred-site.xml
to include some values for the following configuration parameters
(and restart hadoop...)

property
namemapred.map.tasks/name
value4/value
/property
property
namemapred.reduce.tasks/name
value4/value
/property
property
namemapred.tasktracker.map.tasks.maximum/name
value4/value
/property
property
namemapred.tasktracker.reduce.tasks.maximum/name
value4/value
/property







Re: Problem deploying Giraph job to hadoop cluster: onlineZooKeeperServers connection failure

2012-03-21 Thread Avery Ching
Benjamin, my guess is that your jar might not have all the ZooKeeper 
dependencies.  Can you look at the log for the process that was supposed 
to start ZooKeeper?  I'm thinking it didn't start...


Avery

On 3/20/12 1:14 PM, Benjamin Heitmann wrote:

Hello,

after getting my feet wet with the InternalVertexRunner, I tried packaging a 
Giraph job as a jar for the first time.

I am getting the following error:

==
12/03/20 17:21:04 INFO mapred.JobClient: Task Id : 
attempt_201203201422_0009_m_00_2, Status : FAILED
java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 
10 tries!
at 
org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:687)
at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:425)
at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:646)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
at org.apache.hadoop.mapred.Child.main(Child.java:249)

attempt_201203201422_0009_m_00_2: log4j:WARN No appenders could be found 
for logger (org.apache.giraph.zk.ZooKeeperManager).
attempt_201203201422_0009_m_00_2: log4j:WARN Please initialize the log4j 
system properly.
=

Here is some more information, which hopefully might give this mailing list 
some insight into what is happening,
because I cant figure it out...

* I am using Hadoop 1.0.1 and giraph svn revision 1293545 (the last one from 
February)
* If I run the same Vertex class and Input/OutputFormat using 
InternalVertexRunner, then everything works fine. (using again Hadoop 1.0.1 and 
giraph rev 1293545)
* I package the giraph job as a selfcontaining jar, and it contains the giraph 
jar, as well as the zookeeper jar in its lib dir
(I mostly used the recipe from here 
https://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/ )
* there was an error in which hadoop could not find a class. And I had to fix 
that error with:
giraphJob.setJarByClass(SimpleRDFVertex.class);
* My Vertex class extends HashMapVertexText, Text, Text, NullWritable
* I followed the code example from SimpleShortestPathVertex regarding the run() 
method and using the main method to call ToolRunner.run()

Here is the code for my run() method:

==
@Override
public int run(String[] args) throws Exception {
// takes 3 args: inputDir outputDir numberOfWorkers

GiraphJob job = new GiraphJob(getConf(), getClass().getName());

job.setJarByClass(SimpleRDFVertex.class);

job.setVertexClass(SimpleRDFVertex.class);
job.setVertexInputFormatClass(SimpleRDFVertexInputFormat.class);
job.setVertexOutputFormatClass(SimpleRDFVertexOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setWorkerConfiguration(Integer.parseInt(args[2]), 
Integer.parseInt(args[2]), 100.0f);

return job.run(true) ? 0 : -1;
}
==

Am I constructing the GiraphJob in the wrong way ?

I saw the GiraphRunner class, but the giraph source tree currently does not 
seem to contain an example of how to use that class.
Is it safer to use that class for starting a GiraphJob ?
If yes, how should the job jar be assembled in order to use GiraphRunner ?


sincerely, Benjamin Heitmann.





Re: Pseudo-random number Vertex Reader

2012-03-18 Thread Avery Ching
You can use it for performance testing, although it is not a great 
simulation of real graphs.  Real graphs tend to be more power law 
distributed (see https://issues.apache.org/jira/browse/GIRAPH-26).


Hope that helps,

Avery

On 3/17/12 8:13 PM, Fleischman, Stephen (ISS SCI - Plano TX) wrote:


Avery,

I am using Giraph solely for performance characterization -- primarily 
comparing hardware platforms but also for Hadoop configuration 
tuning.  Am I correct that we could use the 
PseudoRandomVertexInputFormat, as used in the PageRank example,  to 
generate any size graphs that can then be used in the simple shortest 
path example program and thus avoiding the need to obtain actual datasets?


Best regards,

Steve Fleischman





Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?

2012-03-16 Thread Avery Ching
If you found it useful, others might find it useful as well.  Please 
feel free to add to a JIRA.


Avery

On 3/15/12 4:44 AM, Dionysis Logothetis wrote:
Ok, I've created an issue: 
https://issues.apache.org/jira/browse/GIRAPH-155

Feel free to edit if you think the description is not clear.


By the way, I have also created a vertex reader that reads adjacency 
lists but with no values for vertices and edges. That's also a format 
that I've seen in several graph data sets. The vertex reader is 
essentially a copy of the AdjacencyListVertexReader modified to handle 
this format. It's basically an abstract class and subclasses can 
override methods to provide default values for vertices and edges 
(otherwise values are initialized to null), just like Avery described 
below. If you think it's useful I can contribute this.



On Wed, Mar 14, 2012 at 7:39 AM, Avery Ching ach...@apache.org 
mailto:ach...@apache.org wrote:


Thanks for your input.  Response inline.

Avery


On 3/13/12 7:14 AM, Dionysios Logothetis wrote:

Hi all,
I'm a new Giraph user, and I'm facing a similar situation. My
input graph is basically in the form of edges defined simply as a
source and destination pair (optionally there could be an edge
value). And these edges might be distributed across multiple
files (this is actually a format I've seen in several graph data
sets).

Without having looked at the internals of Giraph, I originally
imagined that creating a MutableVertex and calling
addVertexRequest for both vertices in an edge and addEdgeRequest
from within the VertexReader would do the trick.


I agree that this idea can work, we also have to have a default
vertex value in case folks add edges to a vertex index only.



Now, this doesn't really work since there needs to be a graph
state created in advance. The graph state is not created until
all vertices have been loaded.

I wouldn't work about graph state here since it's the input
superstep.  We can set it for all vertices after creation if need be.



There's also another implication with
potentially multiple workers trying to create the same vertex,
but I think a vertex resolver can handle this, assuming the
resolver is instantiated before the vertices are loaded.


Yup.



Is there a workaround to do this currently apart from
pre-processing the graph?


Not currently.  Can you please open a JIRA on
https://issues.apache.org/jira/browse/GIRAPH to put track this
issue?  I think we should do it.



Do you think it would be useful to have such functionality?


Yes!



I think it makes sense to handle graph mutations either at the
very beginning or during a execution in a uniform way. By the
way, I'd be interested in contributing to the project.


We'd love to have your contributions, it's a great fit. =)



Looking forward to your response!

Thanks!


On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching ach...@apache.org
mailto:ach...@apache.org wrote:

Benjamin,

By the way, you're not the first to ask for a feature of this
kind.  Perhaps we should consider an alternative format for
loading input vertex data that is based on the edges or data
of the vertices rather than totally vertex-centric.  We could
load an edge, or a vertex value and join then all based on
the vertex id.  Handling conflicts could be a little
difficult, but perhaps the vertex resolver could handle this
as well.

Avery


On 3/12/12 12:41 PM, Benjamin Heitmann wrote:

On 12 Mar 2012, at 18:15, David Garcia wrote:

Not sure what you're asking about.
 getCurrentVertex() should only ever
create one vertex.  Presumably it returns this vertex
to the calling
function. . .which is called in loadVertices() I think.

Thanks David.

I am asking this question because I have a text input
format which is very different from a node adjacency list.
The most important difference, is that each line of the
input file describes two nodes.
The other important difference is that a node might be
described on more then one line of the input.

I have multiple gigabits of input, so it would be very
beneficial to directly load the input into Giraph.
Otherwise the overhead of converting the input to some
sort of node adjacency list is so big,
that it might be a show-stopper regarding the suitability
of Giraph.







For more details, here is the text from my previous
email:   =[snip]===

I am wondering if it would be possible to parse RDF input
files from a TextInputFormat

Please vote for our Giraph proposal for the upcoming Hadoop Summit

2012-03-16 Thread Avery Ching

Hi Giraphers,

We have a submission for the 2012 Hadoop summit and part of deciding 
whether it gets accepted is based on community voting.  It would be 
great to get more folks interested and involved in what is going on with 
Giraph so please vote!  Here's the link:


https://hadoopsummit2012.uservoice.com/forums/151413-track-1-future-of-apache-hadoop/suggestions/2663542-processing-over-a-billion-edges-on-apache-giraph

We had some great exposure at last year's Hadoop Summit and hope to be a 
part of this year's program as well.


Thanks!

Avery


Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?

2012-03-14 Thread Avery Ching

Thanks for your input.  Response inline.

Avery

On 3/13/12 7:14 AM, Dionysios Logothetis wrote:

Hi all,
I'm a new Giraph user, and I'm facing a similar situation. My input 
graph is basically in the form of edges defined simply as a source and 
destination pair (optionally there could be an edge value). And these 
edges might be distributed across multiple files (this is actually a 
format I've seen in several graph data sets).


Without having looked at the internals of Giraph, I originally 
imagined that creating a MutableVertex and calling addVertexRequest 
for both vertices in an edge and addEdgeRequest from within the 
VertexReader would do the trick.


I agree that this idea can work, we also have to have a default vertex 
value in case folks add edges to a vertex index only.


Now, this doesn't really work since there needs to be a graph state 
created in advance. The graph state is not created until all vertices 
have been loaded.
I wouldn't work about graph state here since it's the input superstep.  
We can set it for all vertices after creation if need be.


There's also another implication with potentially multiple workers 
trying to create the same vertex, but I think a vertex resolver can 
handle this, assuming the resolver is instantiated before the vertices 
are loaded.



Yup.

Is there a workaround to do this currently apart from pre-processing 
the graph?


Not currently.  Can you please open a JIRA on 
https://issues.apache.org/jira/browse/GIRAPH to put track this issue?  I 
think we should do it.



Do you think it would be useful to have such functionality?


Yes!

I think it makes sense to handle graph mutations either at the very 
beginning or during a execution in a uniform way. By the way, I'd be 
interested in contributing to the project.


We'd love to have your contributions, it's a great fit. =)


Looking forward to your response!

Thanks!


On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching ach...@apache.org 
mailto:ach...@apache.org wrote:


Benjamin,

By the way, you're not the first to ask for a feature of this
kind.  Perhaps we should consider an alternative format for
loading input vertex data that is based on the edges or data of
the vertices rather than totally vertex-centric.  We could load an
edge, or a vertex value and join then all based on the vertex id.
 Handling conflicts could be a little difficult, but perhaps the
vertex resolver could handle this as well.

Avery


On 3/12/12 12:41 PM, Benjamin Heitmann wrote:

On 12 Mar 2012, at 18:15, David Garcia wrote:

Not sure what you're asking about.  getCurrentVertex()
should only ever
create one vertex.  Presumably it returns this vertex to
the calling
function. . .which is called in loadVertices() I think.

Thanks David.

I am asking this question because I have a text input format
which is very different from a node adjacency list.
The most important difference, is that each line of the input
file describes two nodes.
The other important difference is that a node might be
described on more then one line of the input.

I have multiple gigabits of input, so it would be very
beneficial to directly load the input into Giraph.
Otherwise the overhead of converting the input to some sort of
node adjacency list is so big,
that it might be a show-stopper regarding the suitability of
Giraph.







For more details, here is the text from my previous email:  
=[snip]===


I am wondering if it would be possible to parse RDF input
files from a TextInputFormat class.

The most suitable text format for RDF is called NTriples,
and it has this very simple format:

subject1 predicate1 object1 .\n
subject1 predicate2 object2 .\n
...

So each line contains the subject, which is a vertex, a
predicate, which is a typed edge, and the object, which is
another vertex.
Then the line is terminated by a dot and a new-line.

In Giraph terms, the result of parsing the first line would be
the creation of a vertex for subject1 with an edge of type
predicate1,
and then the creation of a second vertex for object1. So two
vertices need to be created for that one line.

Now the second line contains more information about the vertex
subject1.
So in Giraph terms, the vertex which was created for subject1
needs to be retrieved/revisited and an edge of type predicate2,
which points to the new vertex object2 needs to be created.
And vertex object2 needs to be created.

Just to point it out, such RDF NTriples files are unsorted, so
information about the same vertex might appear e.g. at the
first

Re: Question about TextInputFormat pattern for parsing e.g. RDF

2012-03-12 Thread Avery Ching

Sorry for the delayed response.  Responses inline.

Avery

On 3/8/12 7:14 AM, Benjamin Heitmann wrote:

Hello again,

I am wondering if it would be possible to parse RDF input files from a 
TextInputFormat class.

The most suitable text format for RDF is called NTriples, and it has this 
very simple format:

subject1 predicate1 object1 .\n
subject1 predicate2 object2 .\n
...

So each line contains the subject, which is a vertex, a predicate, which is a 
typed edge, and the object, which is another vertex.
Then the line is terminated by a dot and a new-line.

In Giraph terms, the result of parsing the first line would be the creation of 
a vertex for subject1 with an edge of type predicate1,
and then the creation of a second vertex for object1. So two vertices need to 
be created for that one line.

Now the second line contains more information about the vertex subject1.
So in Giraph terms, the vertex which was created for subject1 needs to be 
retrieved/revisited and an edge of type predicate2,
which points to the new vertex object2 needs to be created. And vertex object2 
needs to be created.

Just to point it out, such RDF NTriples files are unsorted, so information 
about the same vertex might appear e.g. at the first and at the last line
of a multiple GB big file.

Which interface can be used in a TextInputFormat/VertexReader in order to find 
an already created vertex ?


This is not possible unfortunately.  It's similar to the Hadoop 
InputFormat.  Vertices (analogous to key-value pairs) are read one at a 
time.  They are not saved for later access (just like Hadoop).



Are there any other issues when VertexReader.getCurrentVertex() creates two 
vertices at the same time ?


A second related question:
If I have multiple formats for my input files, how would I implement that ?
Just by adding a switch to the logic in getCurrentVertex() ? Or is there a 
better way to switch the input logic based on the file type ?
All my input files would result in the same kind of Vertex being created.


My motivation for doing this, in short:
I have a large amount of RDF NTriples data which is provided by DBPedia. It 
amounts to somewhere between 5 GB and 20 GB,
depending on which subset is used. Expressing this RDF data, so that each 
vertex is completely described in one text line,
would require me to load it into an RDF store first, and then reprocess the 
data. In terms of RDF stores, that is already a non-trivial amount of data
requiring quite a bit of hardware and tweaking. That is the reason why it would 
be valuable to directly load the RDF data into Giraph.



My suggestion would be the following:

Run a MR job to join all your RDFs on the vertex key and you can either 
convert them to an easy format to parse with a custom VertexInputFormat 
of your choice.  If these are one way relationships, you need not create 
the target vertex.  If they are undirect relationships, when you are 
processing your RDFs in the MR job, add a directed relationship to both 
vertices.



cheers, Benjamin.





Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?

2012-03-12 Thread Avery Ching

Benjamin,

By the way, you're not the first to ask for a feature of this kind.  
Perhaps we should consider an alternative format for loading input 
vertex data that is based on the edges or data of the vertices rather 
than totally vertex-centric.  We could load an edge, or a vertex value 
and join then all based on the vertex id.  Handling conflicts could be a 
little difficult, but perhaps the vertex resolver could handle this as well.


Avery

On 3/12/12 12:41 PM, Benjamin Heitmann wrote:

On 12 Mar 2012, at 18:15, David Garcia wrote:


Not sure what you're asking about.  getCurrentVertex() should only ever
create one vertex.  Presumably it returns this vertex to the calling
function. . .which is called in loadVertices() I think.

Thanks David.

I am asking this question because I have a text input format which is very 
different from a node adjacency list.
The most important difference, is that each line of the input file describes 
two nodes.
The other important difference is that a node might be described on more then 
one line of the input.

I have multiple gigabits of input, so it would be very beneficial to directly 
load the input into Giraph.
Otherwise the overhead of converting the input to some sort of node adjacency 
list is so big,
that it might be a show-stopper regarding the suitability of Giraph.







For more details, here is the text from my previous email:   
=[snip]===

I am wondering if it would be possible to parse RDF input files from a 
TextInputFormat class.

The most suitable text format for RDF is called NTriples, and it has this 
very simple format:

subject1 predicate1 object1 .\n
subject1 predicate2 object2 .\n
...

So each line contains the subject, which is a vertex, a predicate, which is a 
typed edge, and the object, which is another vertex.
Then the line is terminated by a dot and a new-line.

In Giraph terms, the result of parsing the first line would be the creation of 
a vertex for subject1 with an edge of type predicate1,
and then the creation of a second vertex for object1. So two vertices need to 
be created for that one line.

Now the second line contains more information about the vertex subject1.
So in Giraph terms, the vertex which was created for subject1 needs to be 
retrieved/revisited and an edge of type predicate2,
which points to the new vertex object2 needs to be created. And vertex object2 
needs to be created.

Just to point it out, such RDF NTriples files are unsorted, so information 
about the same vertex might appear e.g. at the first and at the last line
of a multiple GB big file.

Which interface can be used in a TextInputFormat/VertexReader in order to find 
an already created vertex ?

Are there any other issues when VertexReader.getCurrentVertex() creates two 
vertices at the same time ?


A second related question:
If I have multiple formats for my input files, how would I implement that ?
Just by adding a switch to the logic in getCurrentVertex() ? Or is there a 
better way to switch the input logic based on the file type ?
All my input files would result in the same kind of Vertex being created.


My motivation for doing this, in short:
I have a large amount of RDF NTriples data which is provided by DBPedia. It 
amounts to somewhere between 5 GB and 20 GB,
depending on which subset is used. Expressing this RDF data, so that each 
vertex is completely described in one text line,
would require me to load it into an RDF store first, and then reprocess the 
data. In terms of RDF stores, that is already a non-trivial amount of data
requiring quite a bit of hardware and tweaking. That is the reason why it would 
be valuable to directly load the RDF data into Giraph.








Re: Error in instantiating custom Vertex class via InternalVertexRunner.run

2012-03-05 Thread Avery Ching

Inline responses.  We look forward to hearing about your work Benjamin!

On 3/5/12 9:12 AM, Benjamin Heitmann wrote:

On 2 Mar 2012, at 23:15, Avery Ching wrote:


If I'm reading this right, you're using a public abstract class for the vertex. 
 The vertex class must be instantiable and cannot be abstract.

Hope that helps,


Thanks, that was the right issue to point out. I removed the abstract 
keyword, which solved the issue.
(Of course, then I found lots of other bugs in my code... ;)


Glad to hear it.


After adding the abstract keyword, I ran into some problems in overriding package 
private methods of BasicVertex.
Almost all of the abstract methods in BasicVertex are declared as public, e.g.   
public abstract IterableM  getMessages();

However, there are two methods which do not have the public keyword:
abstract void putMessages(IterableM  messages);
abstract void releaseResources();

I am guessing that this inconsistency is just on oversight.


Actually, it is not.  =)  So the issue is that if we do make these 
methods not package-private (i.e. protected/public), then when a user 
subclasses a vertex, they will be able to shoot themselves in the foot 
by calling these methods which are only meant for internal use.  Any 
other suggestions are welcome.



However, if I understood everything correctly, then this provides problems for 
developers who want to implement BasicVertex
*outside* of the Giraph source tree. As the public keyword is missing, it is 
not possible to override these two method signatures
from another package. The result, is that if I do not need IntIntNullIntVertex, 
but instead IntMyStateNullIntVertex which implements BasicVertex,
then I will need to either copy BasicVe

Is that the right reasoning, or is there some other pattern for using 
BasicVertex which I missed ?

Should I file a bug report somewhere ?


cheers, Benjamin.






Re: PageRankBenchmark failing with zooKeeper.KeeperException

2012-03-05 Thread Avery Ching

Hi Abhishek,

Nice to meet you.  Can you try it with less workers?  For instance -w 1 
or -w 2?  I think the likely issue is that you need have as many map 
slots as the number of workers + at least one master.  If you don't have 
enough slots, the job will fail.  Also, you might want to dial down the 
number of vertices a bit, unless you have oodles of memory.  Please let 
us know if that helps.


Avery

On 3/5/12 9:03 PM, Abhishek Srivastava wrote:

Hi All,

I have been trying (quite unsuccessfully for a while now) to run the 
PageRankBenchmark
to play around with Giraph. I got hadoop running in a single node 
setup and hadoop
jobs and jars run just fine. When I try to run the PageRankBenchmark, 
I get this

incomprehensible error which I'm not able to diagnose.



---CUT 
HERE-
abhi@darkstar:trunk $ hadoop jar 
target/giraph-0.70-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 5000 
-w 30

Warning: $HADOOP_HOME is deprecated.

Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankVertex
12/03/04 03:44:08 WARN bsp.BspOutputFormat: checkOutputSpecs: 
ImmutableOutputCommiter will not check anything
12/03/04 03:44:09 INFO mapred.JobClient: Running job: 
job_201203031851_0004

12/03/04 03:44:10 INFO mapred.JobClient:  map 0% reduce 0%
12/03/04 03:44:26 INFO mapred.JobClient:  map 3% reduce 0%
12/03/04 10:43:52 INFO mapred.JobClient:  map 0% reduce 0%
12/03/04 10:43:57 INFO mapred.JobClient: Task Id : 
attempt_201203031851_0004_m_00_0, Status : FAILED
Task attempt_201203031851_0004_m_00_0 failed to report status for 
24979 seconds. Killing!
12/03/04 10:44:00 INFO mapred.JobClient: Task Id : 
attempt_201203031851_0004_m_01_0, Status : FAILED
Task attempt_201203031851_0004_m_01_0 failed to report status for 
25159 seconds. Killing!

12/03/04 10:44:07 INFO mapred.JobClient:  map 3% reduce 0%
12/03/04 10:49:07 INFO mapred.JobClient:  map 0% reduce 0%
12/03/04 10:49:12 INFO mapred.JobClient: Task Id : 
attempt_201203031851_0004_m_00_1, Status : FAILED

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status 
of 1.

at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/03/04 10:49:22 INFO mapred.JobClient:  map 3% reduce 0%
12/03/04 10:54:23 INFO mapred.JobClient:  map 0% reduce 0%
12/03/04 10:54:28 INFO mapred.JobClient: Task Id : 
attempt_201203031851_0004_m_00_2, Status : FAILED

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status 
of 1.

at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

12/03/04 10:54:38 INFO mapred.JobClient:  map 3% reduce 0%
12/03/04 10:59:10 INFO mapred.JobClient: Task Id : 
attempt_201203031851_0004_m_01_1, Status : FAILED
java.lang.IllegalStateException: unregisterHealth: KeeperException - 
Couldn't delete 
/_hadoopBsp/job_201203031851_0004/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/darkstar_1
at 
org.apache.giraph.graph.BspServiceWorker.unregisterHealth(BspServiceWorker.java:727)
at 
org.apache.giraph.graph.BspServiceWorker.failureCleanup(BspServiceWorker.java:735)

at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:648)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: 
org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/_hadoopBsp/job_201203031851_0004/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/darkstar_1
at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)

at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728)
at 
org.apache.giraph.graph.BspServiceWorker.unregisterHealth(BspServiceWorker.java:721)

... 9 more

Task attempt_201203031851_0004_m_01_1 failed to report status for 
601 seconds. Killing!
attempt_201203031851_0004_m_01_1: log4j:WARN No appenders could be 
found for logger (org.apache.zookeeper.ClientCnxn).
attempt_201203031851_0004_m_01_1: log4j:WARN Please initialize the 
log4j system properly.

12/03/04 10:59:47 INFO mapred.JobClient:  map 0% reduce 0%
12/03/04 10:59:58 INFO mapred.JobClient: Job complete: 
job_201203031851_0004

12/03/04 10:59:58 INFO mapred.JobClient: Counters: 6
12/03/04 

Re: Giraph input format restrictions

2012-02-19 Thread Avery Ching
Sorry about the old documentation.  I just updated the shortest paths 
example.  Before major changes to the graph distribution, the vertex ids 
were required to be sorted.  That is no longer the case.  You can input 
vertices in any order.  The only restriction is that the vertex ids must 
be unique (no duplicate vertices).  If there are duplicates an exception 
will be thrown since duplicates are probably not expected and this is 
probably an error.  This could be relaxed in the future as well if need 
be, but we would need to decide on how to handle duplicates.


Thanks for all the great questions!

Avery

On 2/19/12 11:25 AM, yavuz gokirmak wrote:

Hi,

In Shortest Paths Example it is written that Currently there is one 
restriction on the VertexInputFormat that is not obvious. The vertices 
must be sorted.. I didn't understand the reason of this restriction, 
why vertices should be ordered?


Secondly, as I understood, we have to transform our initial data into 
a form that each line corresponds to a vertex(with edge and values if 
exists) in the graph.


For example, I have a data that each row is corresponds to an edge 
between to vertices

format1:
a b
a c
a d
b c
b a
c d

Do I have to convert this file into a format similar to below in order 
to use with giraph algorithms?

format2:
a b c d
b c a
c d

thanks..




Re: how to use SimplePageRankVertex

2012-02-18 Thread Avery Ching
IntIntNullIntTextInputFormat in the examples package (extending 
TextVertexInputFormat as David suggests) is very similar to what you 
need I think, although the types might be different for your 
application.  You can start with that perhaps.


Avery

On 2/18/12 7:48 AM, David Garcia wrote:
The easiest thing to do is to extend text vertex or/and textvertext 
input format and/or the record reader.  The record reader will give 
you the vertices you want.  Look at the record reader for 
textvertexinputformat.  It's an innerclass on this format class.


Sent from my HTC Inspire™ 4G on ATT

- Reply message -
From: yavuz gokirmak ygokir...@gmail.com
To: giraph-user@incubator.apache.org giraph-user@incubator.apache.org
Subject: how to use SimplePageRankVertex
Date: Sat, Feb 18, 2012 9:08 am



Hi,

I am planning to use giraph for network analysis. First I am trying to 
fully understand SimplePageRankVertex implementation and modify in 
order to serve my needs.


I have a question about example,
What is the expected input format for SimplePageRankVertex, I couldn't 
understand the input format although  SimplePageRankVertexReader class 
has few lines.



My input file is contains of rows such as:
usera, userb
usera, userc
userc, usera
userb, userc
userc, userb
.
.
.
Each row represents a relation between two users,
*usera,userb* means that *usera is clicked userb's profile *

Is it possible to make social network analysis over such kind of data 
using giraph?

I will be glad if you can give advices..

thanks in advance
best regards
ygokirmak




Re: counter limit question

2012-02-16 Thread Avery Ching

Yes, there is a way to disable the counters at runtime.

See GiraphJob:

  /** Use superstep counters? (boolean) */
  public static final String USE_SUPERSTEP_COUNTERS =
  giraph.useSuperstepCounters;

and set to false.

Avery

On 2/16/12 1:41 PM, David Garcia wrote:
I have a job that could conceivably involve thousands of supersteps. 
 I know that I can adjust this in mapped-site.xml, but what are the 
framework's limitations for the number of counters possible?  Is there 
a better way to address this (I.e. Prevent giraph from using Hadoop 
counters for every super step)?


-David




Re: maven, hadoop, zookeeper, and giraph!

2012-02-16 Thread Avery Ching

Hi Jeffrey,

Best attempt as answers inline.

On 2/16/12 6:12 PM, Jeffrey Yunes wrote:

Hi Giraph community,
I think I followed all of the directions (for a Giraph on a psuedo-cluster), 
and it looks like


mvn clean test -Dprop.mapred.job.tracker=localhost:9001

runs fine. However, I'm new to the Hadoop infrastructure, and have a couple of 
questions about getting started with Giraph.

1)

hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar 
org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50 -w 3

gives me the error java.lang.NullPointerException at at 
org.apache.giraph.benchmark.PageRankBenchmark.run(PageRankBenchmark.java:127) It 
looks like some error with configuration?


This is a bug.  I have a quick fix for it.  Sorry about that.  I opened 
an issue for it.  https://issues.apache.org/jira/browse/GIRAPH-150


diff --git 
a/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java b/

index 0e76122..4d08929 100644
--- a/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java
+++ b/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java
@@ -124,7 +124,8 @@ public class PageRankBenchmark extends EdgeListVertex
 } else {
   job.setVertexClass(PageRankBenchmark.class);
 }
-LOG.info(Using class  + 
BspUtils.getVertexClass(getConf()).getName());

+LOG.info(Using class  +
+BspUtils.getVertexClass(job.getConfiguration()).getName());
 job.setVertexInputFormatClass(PseudoRandomVertexInputFormat.class);
 job.setWorkerConfiguration(workers, workers, 100.0f);


2) How should I / do I enable the log4j? An appender that writes to the HDFS? 
How else could I grep all my logs for errors and things?
log4j is used by the task trackers to dump to the job logs.  If you 
click on your running job in the web page, you can then click into each 
task and look at the logs under 'Task Logs'.  You can configure the task 
tracker log4jproperties to set the log level, but the default is info I 
believe.

3) With regard to Giraph and maven, none of the directions suggested doing local overrides. 
Therefore, why should I expect my Giraph installation to refer to libraries and configuration in 
~/Applications/hadoop or zookeeper rather than those in ~.m2/repo?
Giraph builts a massive jar that has all the required classes and jars 
to launch ZooKeeper and interact with Hadoop.  This makes for easy 
deployment to a running cluster.



4) Why doesn't running maven for Giraph install hadoop along the way (or does 
it)?
Because there are so many versions of Hadoop and if you are lauching 
Hadoop, then the hadoop jar should be in your classpath automatically.



I'd appreciate if you'd help improve my understanding!

No problem.  Welcome to Giraph!


Thanks!
-Jeff







Re: Giraph Architecture bug in

2012-02-08 Thread Avery Ching
AFAIK we don't have any SOP for opening issues.  Maybe I'll take a crack 
at this one tonight if I find some time, unless you were planning to 
work on it David.


Avery

On 2/8/12 5:46 PM, David Garcia wrote:

I opened up

* GIRAPH-144https://issues.apache.org/jira/browse/GIRAPH-144


I apologize if I didn't do it up according to project SOP's.  I haven't
had time to read it thoroughly.

-David


On 2/8/12 7:29 PM, David Garciadgar...@potomacfusion.com  wrote:


Yeah, I'll write something up.


On 2/8/12 7:26 PM, Avery Chingach...@apache.org  wrote:


Since we call waitForCompletion() (which calls submit() internally) in
GiraphJob#run(), we cannot override those methods.  A better fix would
probably be to use composition rather than inheritance (i.e.

public class GiraphJob {
 Job internalJob;
}

and expose the methods we would like as necessary.  There are other
methods we don't want the user to call, (i.e. setMapperClass(), etc.).
David, can you please open an issue for this?

Avery

On 2/8/12 5:17 PM, David Garcia wrote:

This is a very subtle bug.  GiraphJob inherits from
org.apache.mapreduce.Job.  However, the methods submit() and
waitForCompletion() are not overridden.  I assumed that they were
implemented, so when I called either one of these methods, the
framework
started up identity mappers/reducers.  A simple fix is to throw
unsupported operation exceptions or to implement these methods.
Perhaps
this has been done already?

-David

On 2/7/12 7:46 PM, David Garciadgar...@potomacfusion.com   wrote:


I am running into a weird error that I haven't seen yet (I suppose
I've
been lucky).  I see the following in the logging:

org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable


In the job definition, the property mapreduce.map.class is not even
defined.  For Giraph, this is usually set to
mapreduce.map.class=org.apache.giraph.graph.GraphMapper

I'm building my project with hadoop 0.20.204.

When I build the GiraphProject myself (and run my own tests with the
projects dependencies), I have no problems.  The main difference is
that
I'm using a Giraph dependency in my work project.  All input is
welcome.
Thx!!

-David





Re: running job with giraph dependency anomaly

2012-02-07 Thread Avery Ching
If you're using GiraphJob, the mapper class should be set for you.  
That's weird.


Avery

On 2/7/12 5:58 PM, David Garcia wrote:

That's interesting.  Yes, I don't need native libraries.  The problem I'm
having is that after I run job.waitForCompletion(..),
The job runs a mapper that is something other than GraphMapper.  It
doesn't complain that a Mapper isn't defined or anything.  It runs
something else.  As I mentioned below, the map-class doesn't appear to be
defined.


On 2/7/12 7:50 PM, Jakob Homanjgho...@gmail.com  wrote:


That's not necessarily a bad thing.  Hadoop (not Giraph) has native
code library it can use for improved performance.  You'll see this
message when running on a cluster that's not been deployed to use the
native libraries.  If I follow what you wrote, most likely your work
project cluster is so configured.  Unless you actively expect to have
the native libraries loaded, I wouldn't be concerned.


On Tue, Feb 7, 2012 at 5:46 PM, David Garciadgar...@potomacfusion.com
wrote:

I am running into a weird error that I haven't seen yet (I suppose I've
been lucky).  I see the following in the logging:

org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable


In the job definition, the property mapreduce.map.class is not even
defined.  For Giraph, this is usually set to
mapreduce.map.class=org.apache.giraph.graph.GraphMapper

I'm building my project with hadoop 0.20.204.

When I build the GiraphProject myself (and run my own tests with the
projects dependencies), I have no problems.  The main difference is that
I'm using a Giraph dependency in my work project.  All input is welcome.
Thx!!

-David





Re: creating non existing vertices by sending messages

2012-02-03 Thread Avery Ching
Thanks for the comments David.  The behavior of what happens is 
completely defined by the chosen VertexResolver, see 
(GiraphJob#setWorkerContextClass).  Developers can implement any 
behavior they want.  I believe the only reason to bypass was as a 
performance optimization.


Avery

On 2/3/12 8:34 AM, Claudio Martella wrote:

Agreed, probably making the path configurable is the way to go.

On Fri, Feb 3, 2012 at 5:30 PM, David Garciadgar...@potomacfusion.com  wrote:

I just wanted to send this out because I remember reading a discussion on
this topic.  Currently, graph will create a vertex in the graph if a
message is sent to a vertexID that doesn't exist.  Personally, I really
really like this behavior.  It enables me to forgo vertex creation if I
don't need it.  If I need the vertex, I can simply send a message to
create it, and process the message that was sent.  I understand that there
are some concerns with this. . .I would suggest making this behavior
configurable at job creation.  This would be an awesome compromise, and
would not preclude either type of behavior.

-David








Re: multi-graph support in giraph

2012-02-03 Thread Avery Ching
We can diverge from the Pregel API as long as we have a good reason for 
it.  I do agree that while we can support multi-graphs with a 
user-chosen edge type, some built-in support that makes programming 
easier sounds like a good goal.  Andre or Claudio, feel free to open a 
JIRA to discuss this.


We should also figure out the appropriate APIs as well that make it the 
most convenient to use.


Avery

On 2/3/12 9:14 AM, Claudio Martella wrote:

On Fri, Feb 3, 2012 at 6:07 PM, André Kelpe
efeshundert...@googlemail.com  wrote:

2012/2/3 Claudio Martellaclaudio.marte...@gmail.com:

Hi Andre,

Hi!


As I see it, we'd basically have to move all the API about edges from
single object to Iterable (i.e. returning multiple edges for a given
vertex endpoint as you suggested), and maybe also returning multiple
vertices for a given edge(label).

If the goal of giraph is to be close to the pregel paper, then that
kind of API makes
more sense.

 From how I see it, we've already taken a distance from Pregel in many
API decisions. Personally I believe we don't have to stick to Pregel,
we definitely have just to design Giraph for it's useful. For what we
know, the API of the paper could be just the smallest subset of the
real Pregel API that could fit clearly into the paper.



I am going to look into your code and see if I can
integrated it in the
copy of giraph I use internally here right now.

Be ware that the code is not meant for general purpose but for a
specific task. The extended api methods though should be quite
general.


The single-graph, as it's implemented now compared to multi-graph,
would be a subcase of this which internally it would return
getEdgeValue().iterator().next().

That would mean, you'd have two different kinds of vertex, one
compatible with single-graphs
and one with multi-graphs. Sounds tricky to maintain on the long run,
but could be an idea.

I see the single-graph vertex as a subclass of the multi-graph vertex,
something along with what's already going on with mutablevertex, so i
don't see a problem in maintaining it.


André (@fs111)







Re: [VOTE] Release Giraph 0.1-incubating (rc0)

2012-01-31 Thread Avery Ching
To address the issues of binaries, could we release multiple binaries of 
Giraph that coincide with the different versions of Hadoop?


On 1/31/12 7:44 PM, David Garcia wrote:

I think these concerns preclude the entire idea of a release.  A release
should be something that users can use as a dependency. . .like a maven
coordinate.  I think you guys should wait until you have made these
decisions. . .and then cut a binary.

On 1/31/12 5:36 PM, Jakob Homanjgho...@gmail.com  wrote:


Giraphers-
I've created a candidate for our first release. It's a source release
without a binary for two reasons: first, there's still discussion
going on about what needs to be done for the NOTICE and LICENSE files
for projects that bring in transitive dependencies to the binary
release
(http://www.mail-archive.com/general@incubator.apache.org/msg32693.html)
and second because we're still munging our binary against three types
of Hadoop, which would mean we'd need to release three different
binary artifacts, which seems suboptimal.  Hopefully both of these
issues will be addressed by 0.2.

I've tested the release against an unsecure 20.2 cluster.  It'd be
great to test it against other configurations.  Note that we're voting
on the tag; the files are provided as a convenience.

Release notes:
http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/RELEASE_NOTE
S.html

Release artifacts:
http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/

Corresponding svn tag:
http://svn.apache.org/repos/asf/incubator/giraph/tags/release-0.1-rc0/

Our signing keys (my key doesn't seem to be being picked up by
http://people.apache.org/keys/group/giraph.asc):
http://svn.apache.org/repos/asf/incubator/giraph/KEYS

The vote runs for 72 hours, until Friday 4pm PST.  After a successful
vote here, Incubator will vote on the release as well.

Thanks,
Jakob




Re: giraph stability problem

2012-01-23 Thread Avery Ching
Glad to hear you fixed your problem.  It would be great if you could 
describe any improvements that would help you have found the issues 
earlier.  Maybe we (or you) could add them =).


Avery

On 1/23/12 8:31 AM, André Kelpe wrote:

Hi all,

thanks for all the answers so far, it turns out that it actually isn't
that much of a problem: I just had some inconsistencies in my input,
which made giraph explode. I did a rerun with correct input data and
now the whole thing finishes in a few seconds.

It would of course be nice to have the described out-of-process
messaging with spill over to disk for bigger problems, but that seems
to be not necessary for the problem space I am in right now :-).

--André




Re: Scalability results for GoldenOrb and comparison with Giraph

2011-12-14 Thread Avery Ching
 algorithms display similar properties
for configurations in the regime not dominated by a framework overhead
bottleneck. And second, the GoldenOrb SSSP results being compared are also
from configurations which have reached a steady power law slope over the
range of nodes considered, for runs using the same algorithm as the Pregel
results. These two points, I feel, justify the comparisons made (though,
again, it would be better to have a standardized set of configurations for
testing to facilitate comparing results, even between algorithms). Since all
three sets of scalability tests yield fairly linear complexity plots
(execution time vs. number of vertices in the graph, slide 29 of your talk),
it also makes sense to compare weak scaling results, a proposition supported
by the consistency of the observed GoldenOrb weak scaling results for SSSP
across multiple test configurations.


As for the results found in your October 2011 talk, they are impressive
and clearly demonstrate an ability to effectively scale to large graph
problems (shown by the weak scaling slope of ~ 0.01) and to maximize the
benefit of throwing additional computational resources at a known problem
(shown by the strong scaling slope of ~ -0.93), so I'm interested to see the
results of the improvements that have been made. I'm a big proponent of
routine scalability testing using a fixed set of configurations as part of
the software testing process, as the comparable results help to quantify
improvement as the software is developed further and can often help to
identify unintended side effects of changes / find optimal configurations
for various regimes of problems, and would like to see Giraph succeed, so
let me know if there's any open issues which I might be able to dig into
(I'm on the dev mailing list as well, though haven't posted there).

Thanks,
Jon


On Dec 11, 2011, at 1:02 PM, Avery Ching wrote:


Hi Jon,

-golden...@googlegroups.com (so as to not clog up their mailing list
uninvited)

First of all, thank you for sharing this comparison.  I would like to
note a few things.  The results I posted in October 2011 were actually a bit
old (done in June 2011) and do not have several improvements that reduce
memory usage significantly (i.e. GIRAPH-12 and GIRAPH-91).  The number of
vertices loadable per worker is highly dependent on the number of edges per
worker, the amount of available heap memory, number of messages, the
balancing of the graph across the workers, etc.  In recent tests at
Facebook, I have been able to load over 10 million vertices / worker easily
with 20 edges / vertex.  I know that you wrote that the maximum per worker
was at least 1.6 million vertices for Giraph, I just wanted to let folks
know that it's in fact much higher.  We'll work on continuing to improve
that in the future as today's graph problems are in the billions of vertices
or rather hundreds of billions =).

Also, with respect to scalability, if I'm interpreting these results
correctly, does it mean that GoldenOrb is currently unable to load more than
250k vertices / cluster as observed by former Ravel developers?  if so,
given the small tests and overhead per superstep, I wouldn't expect the
scalability to be much improved by more workers.  Also, the max value and
shortest paths algorithms are highly data dependent to how many messages are
passed around per superstep and perhaps not a fair scaling comparison with
Giraph's scalability designed page rank benchmark test (equal messages per
superstep distributed evenly across vertices).  Would be nice to see an
apples-to-apples comparison if someone has the time...=)

Thanks,

Avery

On 12/10/11 3:16 PM, Jon Allen wrote:

Since GoldenOrb was released this past summer, a number of people have
asked questions regarding scalability and performance testing, as well as a
comparison of these results with those of Giraph (
http://incubator.apache.org/giraph/ ), so I went forward with running tests
to help answer some of these questions.

A full report of the scalability testing results, along with methodology
details, relevant information regarding testing and analysis, links to data
points for Pregel and Giraph, scalability testing references, and background
mathematics, can be found here:

http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/

Since this data will also be of interest to the Giraph community (for
methodology, background references, and analysis reasons), I am cross
posting to the Giraph user mailing list.

A synopsis of the scalability results for GoldenOrb, and comparison data
points for Giraph and Google's Pregel framework are provided below.

The setup and execution of GoldenOrb scalability tests were conducted by
three former Ravel (http://www.raveldata.com ) developers, including myself,
with extensive knowledge of the GoldenOrb code base and optimal system
configurations, ensuring the most optimal settings were used for scalability
testing.


RESULTS SUMMARY:


MAX CAPACITY:

Pregel (at least

Re: Packaging a Giraph application in a jar

2011-11-09 Thread Avery Ching

Would be great if you can document what you did. =)

Thanks,

Avery

On 11/8/11 3:13 PM, Claudio Martella wrote:

Sorry guys, may bad.

Was calling job.waitForCompletion() directly. I've been coding
standard mapreduce whole weekend...

Anyway I got a solution for clean packaging of your own application
over giraph, and that is exactly using maven-shade-plugin. it will
prepare the uberjar for you.

On Tue, Nov 8, 2011 at 9:33 PM, Claudio Martella
claudio.marte...@gmail.com  wrote:

Hello list,

I'm actually having troubles as well to get my application running.

I've give a shot to maven-shade plugin which unpacks my dependencies
and packs them all together with my classes in a new jar.

I attach the hierarchy of the jar so that somebody can maybe spot
what's missing, because i can't get it working. I get an identity
map-reduce job with jobconf complaining about no job jar being set.

Any idea?

On Sat, Nov 5, 2011 at 5:09 PM, Avery Chingach...@apache.org  wrote:

Hi Gianmarco,

You're right, most of us (to my knowledge) have been using Giraph with an
uberjar as you've put it.  However, Jakob has been doing some work to make
this easier.  See the below issue:

https://issues.apache.org/jira/browse/GIRAPH-64

If you can suggest a better approach, please add to the issue or create a
new one if appropriate.

Thanks,

Avery

On 11/5/11 4:11 AM, Gianmarco De Francisci Morales wrote:

Hi community,

I was wondering what is the current best practice to package an
application in a jar for deployment.
I tried the 'hadoop way' by putting giraph-*.jar in the /lib directory of
my jar, and using the -libjars option but none of them worked. It looks like
the backend classloader is doing some mess and it doesn't find my own
classes in the jar.

I resorted to uncompressing the giraph-*.jar and repackaging my classes
with it, all at the same level (an uber-fat jar), but even though it works
it doesn't sound like the right approach.

Any suggestions?

Thanks,
--
Gianmarco








--
 Claudio Martella
 claudio.marte...@gmail.com








Re: way to run unit tests from inside IDE?

2011-10-29 Thread Avery Ching
I use Eclipse and it's okay for running unittests, but I need to set the 
VM args in the junit run configuration for each specific test to 
-Dprop.jarLocation=target/giraph-0.70-jar-with-dependencies.jar.  I 
assume you need to do the same for Intellij.


This is done in pom.xml when doing 'mvn test' and other mvn commands.

Avery

On 10/28/11 11:21 PM, Jake Mannix wrote:

I seem to be getting weird stuff like:

setup: Using local job runner with location  for testBspCombiner
11/10/28 23:21:00 WARN mapred.JobClient: Use GenericOptionsParser for 
parsing the arguments. Applications should implement Tool for the same.
11/10/28 23:21:00 INFO mapred.JobClient: Cleaning up the staging area 
file:/tmp/hadoop-jake/mapred/staging/jake1475251079/.staging/job_local_0005


java.lang.IllegalArgumentException: Can not create a Path from an 
empty string

at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82)
at org.apache.hadoop.fs.Path.init(Path.java:90)
at 
org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:720)
at 
org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:596)

at org.apache.hadoop.mapred.JobClient.access$200(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:806)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:465)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494)
at org.apache.giraph.graph.GiraphJob.run(GiraphJob.java:495)
at org.apache.giraph.TestBspBasic.testBspCombiner(TestBspBasic.java:261)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at 
com.intellij.junit3.TestRunnerUtil$SuiteMethodWrapper.run(TestRunnerUtil.java:262)
at 
com.intellij.junit3.JUnit3IdeaTestRunner.doRun(JUnit3IdeaTestRunner.java:139)
at 
com.intellij.junit3.JUnit3IdeaTestRunner.startRunnerWithArgs(JUnit3IdeaTestRunner.java:52)
at 
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:199)

at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

When I try to run in IntelliJ, but not from command line.  Anyone run 
into this?


  -jake




Re: Restriction of VertexInputFormat

2011-10-26 Thread Avery Ching

Hi Gianmarco,

Welcome to Giraph!  We definitely look forward to having your 
input/contributions.  Answers inline.


On 10/26/11 8:07 AM, Gianmarco De Francisci Morales wrote:

Hi,

First of all let me introduce myself, my name is Gianmarco and I am a 
researcher.
Second, let me congratulate with the developers for the project. It 
looks very promising and I am very interested in it.


I have two questions.

1) I was trying to understand better the system, and I came across 
this sentence in the documentation:
Currently there is one restriction on the VertexInputFormat that is 
not obvious. The vertices must be sorted.

Does this still apply? And if so, could someone explain me the reason?


Yes it still applies.  Please see 
https://issues.apache.org/jira/browse/GIRAPH-11.  I am getting closer to 
having this done, but got derailed by work.  Hopefully I'll have a patch 
by next week to finally address it (touches pretty much all the code).


2) Do the superstep times that get reported in hadoop counters at the 
end of the job include communication time or only processing time?


It includes the time of the superstep from the master's perspective 
(waiting for workers to register health, assigning work, checkpointing 
(maybe), vertex exchange (maybe), vertex processing, waiting for all 
workers to finish, etc.).




Thanks,
--
Gianmarco De Francisci Morales





Re: Message processing

2011-09-09 Thread Avery Ching
The GraphLab model is more asynchronous than BSP  They allow you to update
your neighbors rather than the BSP model of messaging per superstep.  Rather
than one massive barrier in BSP, they implement this with vertex locking.
 They also all a vertex to modify the state of its neighbors.  We could
certainly add something similar as an alternative computing model, perhaps
without locking.  Here's one idea:

1) No explicit supersteps (asynchronous)
2) All vertices execute compute() (and may or may not send messages)
initially
3) Vertices can examine their neighbors or any vertex in the graph (issue
RPCs to get their state)
4) When messages are received by a vertex, compute() is executed on it (and
state is locally locked to compute only)
5) Vertices stlll vote to halt when done, indicating the end of the
application.
6) Combiners can still be used to reduce the number of messages sent (and
the number of times compute is executed).

This could be fun.  And provide an interesting comparison platform barrier
based vs vertex based synchronization.

On Fri, Sep 9, 2011 at 6:36 AM, Jake Mannix jake.man...@gmail.com wrote:



 On Fri, Sep 9, 2011 at 3:22 AM, Claudio Martella 
 claudio.marte...@gmail.com wrote:

 One misunderstanding my side. Isn't it true that the messages have to be
 buffered as they all have to be collected before they can be processed (by
 definition of superstep)? So you cannot really process them as they come?


 This is the current implementation, yes, but I'm trying to see if an
 alternative is also possible in this framework, for Vertex implementations
 which are able to handle asynchronous updates.  In this model, a vertex
 would be required to be able to handle multiple calls to compute() in a
 single superstep, and would instead signal that it's superstep
 computations are done at some (application specific) point.

 I know this goes outside of the concept of a BSP model, but I didn't want
 to get into too many details before I figure out how possible it was to
 implement some of this.

-jake