Re: [Announcement] Giraph talk in Berlin on May 29th
Nice! Avery On 5/12/12 2:58 AM, Sebastian Schelter wrote: Hi, I will give a talk titled Large Scale Graph Processing with Apache Giraph in Berlin on May 29th. Details are available at: https://www.xing.com/events/gameduell-tech-talk-on-the-topic-large-scale-graph-processing-with-apache-giraph-1092275 Best, Sebastian
Re: Possible bug when resetting aggregators ? (and missing documentation)
I think you're right that the javadoc isn't specific enough. * Use a registered aggregator in current superstep. * Even when the same aggregator should be used in the next * superstep, useAggregator needs to be called at the beginning * of that superstep in preSuperstep(). * * @param name Name of aggregator * @return boolean (false when not registered) */ boolean useAggregator(String name); This should be augmented to say that none of the Aggregator methods should be called until this method is invoke. Feel free to file a JIRA and fix. Thanks! If you would like to, please feel free to add Aggregator documentation to https://cwiki.apache.org/confluence/display/GIRAPH/Index Avery On 5/2/12 12:15 PM, Benjamin Heitmann wrote: Hello, I had to use aggregators for various statistic reporting tasks, and I noticed that the aggregator operations need to be used in a very specific squence, especially when the aggregator is getting a reset between supersteps. I found that the sequence described in RandomMessageBenchmark (in the org.apache.giraph.benchmark package) results in consistent counts for one aggregator across all workers. The most important thing, seems to be to call the reset method setAggregatedValue() in preSuperstep() of the WorkerContext class, before calling this.useAggregator(). If I called the reset method in postSuperstep(), then every worker reported a different value for the aggregator. However, the aggregator which gets the reset between supersteps, still is wrong. I know this, because a second aggregator counts the same thing, and reports it after each superstep, without getting a reset. Is this a known issue ? Should I file a bug report on it ? In addition, it would be great to document correct usage of the aggregators somewhere. Even just in the javadoc of the aggregator interface might be enough. Should I try to add some documentation to the aggregator interface? (org.apache.giraph.graph.Aggregator.java) Then the committers can correct me if that documentation is wrong, I guess.
Re: Please welcome our newest committer and PMC member, Eugene!
Awesome! Congrats Eugene, we're excited to have you taking on a big role. Avery On 5/1/12 5:18 PM, Hyunsik Choi wrote: Congrats and welcome Eugene! I'm looking forward to your contribution. -- Hyunsik Choi On Wed, May 2, 2012 at 5:39 AM, Jakob Homan jgho...@gmail.com mailto:jgho...@gmail.com wrote: I'm happy to announce that the Giraph PMC has voted Eugene Koontz in as a committer and PMC member. Eugene has been pitching in with great patches that have been very useful, such as helping us sort out our terrifying munging situation (GIRAPH-168). Welcome aboard, Eugene! -Jakob
Re: Does Giraph support labeled graphs?
Anyone want to work on https://issues.apache.org/jira/browse/GIRAPH-155? =) On 4/19/12 9:22 AM, Claudio Martella wrote: The problem with this approach is that Giraph doesn't support multi-graphs. Following RDF, you can have multiple edges connecting the same pair of vertices. So for methods such as getEdgeValue(I) you'd have to return something like ListE. For this, I'd suggest to forget the Giraph specific methods and just add your own on top, which you will call internally. On Thu, Apr 19, 2012 at 12:36 PM, Benjamin Heitmann benjamin.heitm...@deri.org wrote: Hi Avery and Paolo, On 11 Apr 2012, at 18:37, Avery Ching wrote: There is no preferred way to represent labeled graphs. A close example to your adjacency list idea is LongDoubleDoubleAdjacencyListVertexInputFormat. Exactly. Giraph supports labeled Graphs very easily. My reply is a little bit lat, so you probably already figured out the following: The thing you need to do is create your own class which extends HashMapVertex, and as the third parameter of theI, V, E, M signature, you provide a Text for the edge parameter. No other code is required in that class in order to use the edge labels then AFAIK. But you will need to write a VertexInputFormat class to fill the edges when you parse your input.
Re: Slides for my talk at the Berlin Hadoop Get Together
Very nice! Will these be similar to the 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords? It would be good to add at leaset one of them to the page. Avery On 4/19/12 12:31 PM, Sebastian Schelter wrote: Here are the slides of my talk Introducing Apache Giraph for Large Scale Graph Processing at the Berlin Hadoop Get Together yesterday: http://www.slideshare.net/sscdotopen/introducing-apache-giraph-for-large-scale-graph-processing I reused a lot of stuff from Claudio's excellent prezi presentation. Best, Sebastian
Re: java.lang.RuntimeException [...] msgMap did not exist [...]
Etienne, There should be one task log per task. Do you have all the tasks logs? It looks like this one failed because another one failed. Avery On 4/17/12 9:37 AM, Etienne Dumoulin wrote: Avery, I attach the file, indeed it looks more interesting that the others. There is a null pointer exception: 15 MapAttempt TASK_TYPE=MAP TASKID=task_201204121825_0001_m_02 TASK_ATTEMPT_ID=attempt_201204121825_0001_m_02_0 TASK_STATUS=FAILED FINISH_TIME=13342517 07662 HOSTNAME=nantes ERROR=java\.lang\.NullPointerException 16at org\.apache\.giraph\.graph\.GraphMapper\.run(GraphMapper\.java:639) 17at org\.apache\.hadoop\.mapred\.MapTask\.runNewMapper(MapTask\.java:763) 18at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:369) 19at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:259) 20at java\.security\.AccessController\.doPrivileged(Native Method) 21at javax\.security\.auth\.Subject\.doAs(Subject\.java:396) 22at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1059) 23at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:253) Also I found this file in logs/history/done/version-1/rennes.local.net_1334252188432_/2012/04/13/00/job_201204121836_0003_1334307958403_hadoop_org.apache.giraph.examples.SimpleShortestPathsVert. I run it on the 13th at 10am local time, however in these logs the date is 20120412. In addition I have in the logs directory I have no job conf dating of the 13th. Does hadoop does not take the local time to name the files? Thanks, Étienne On 16 April 2012 19:45, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Etienne, the task tracker logs are not what I meant, sorry for the confusion. Every task produces it's own output and error log. That is likely where we can find the issue. Likely a task failed, and the task logs should say why. Avery On 4/16/12 3:00 AM, Etienne Dumoulin wrote: Hi Avery, Thanks for your fast reply. I attach the forgotten file. Regards, Étienne On 13 April 2012 17:40, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Hi Etienne, Thanks for your questions. Giraph uses map tasks to run its master and workers. Can you provide the task output logs? It looks like your workers failed to report status for some reason and we need to find out why. The datanode logs can't help us here. Avery On 4/13/12 3:35 AM, Etienne Dumoulin wrote: Hi Guys, I tried out giraph yesterday and I have an issue to run the shortest path example. I am working on a toy heterogeneous cluster of 3 datanodes and 1 namenode, jobtracker, with hadoop 0.20.203.0. One of the datanode is a small server quad-core 16 GB ram, the others are small PC 1 core 1GB ram, same OS: ubuntu-server 10.04. I run on a first issue with the 0.1 version, the same described here: https://issues.apache.org/jira/browse/GIRAPH-114. Before I found the patch I tried different configurations: It works on a standalone environment, with the namenode and the server, with the namenode and the two small PC. It does not work either with the entire cluster, or with one small PC and the server as datanode. Then I downloaded today the svn version, no luck, it has the same behaviour than the 0.1 version (go till 100% then go back to 0%) but not the same info logs. Bellow the svn version console log, nantes is the name of the big datanode, rennes the namenode/jobtracker: hadoop@rennes:~/test$ hadoop jar ~/project/giraph/trunk_2012_04_13/target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex shortestPathsInputGraph shortestPathsOutputGraph 0 3 12/04/13 10:05:58 INFO mapred.JobClient: Running job: job_201204121836_0003 12/04/13 10:05:59 INFO mapred.JobClient: map 0% reduce 0% 12/04/13 10:06:18 INFO mapred.JobClient: map 25% reduce 0% 12/04/13 10:08:55 INFO mapred.JobClient: map 100% reduce 0% 12/04/13 10:21:28 INFO mapred.JobClient: map 75% reduce 0% 12/04/13 10:21:33 INFO mapred.JobClient: Task Id : attempt_201204121836_0003_m_02_0, Status : FAILED Task attempt_201204121836_0003_m_02_0 failed to report status for 600 seconds. Killing! 12/04/13 10:23:57 INFO mapred.JobClient: Task Id : attempt_201204121836_0003_m_01_0, Status : FAILED java.lang.RuntimeException: sendMessage: msgMap did not exist for nantes:30002
Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph
Hi Paulo, Can you try something for me? I was able to get the PageRankBenchmark to work running in local mode just fine on my side. I think we should have some kind of a helper script (similar to bin/giraph) to running simple tests in LocalJobRunner. I believe that for LocalJobRunner to run, we need to do -Dgiraph.SplitMasterWorker=false -Dlocal.test.mode=true. In the case of PageRankBenchmark, I also have to set the workers to 1 (LocalJobRunner can only run one task at a time). So I get the class path that bin/giraph was using to run (just added a echo $CLASSPATH at the end) and then inserted the giraph-0.2-SNAPSHOT-jar-with-dependencies.jar in front of it (this is necessary for the ZooKeeper jar inclusion). Then I just ran a normal java command and the output below. One thing to remember is that if you rerun it, you'll have to remove the _bsp directories that are created, otherwise it will think it has already been completed. Hope that helps, Avery java -cp target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar:/Users/aching/git/git_svn_giraph_trunk/conf:/Users/aching/.m2/repository/ant/ant/1.6.5/ant-1.6.5.jar:/Users/aching/.m2/repository/com/google/guava/guava/r09/guava-r09.jar:/Users/aching/.m2/repository/commons-beanutils/commons-beanutils/1.7.0/commons-beanutils-1.7.0.jar:/Users/aching/.m2/repository/commons-beanutils/commons-beanutils-core/1.8.0/commons-beanutils-core-1.8.0.jar:/Users/aching/.m2/repository/commons-cli/commons-cli/1.2/commons-cli-1.2.jar:/Users/aching/.m2/repository/commons-codec/commons-codec/1.4/commons-codec-1.4.jar:/Users/aching/.m2/repository/commons-collections/commons-collections/3.2.1/commons-collections-3.2.1.jar:/Users/aching/.m2/repository/commons-configuration/commons-configuration/1.6/commons-configuration-1.6.jar:/Users/aching/.m2/repository/commons-digester/commons-digester/1.8/commons-digester-1.8.jar:/Users/aching/.m2/repository/commons-el/commons-el/1.0/commons-el-1.0.jar:/Users/aching/.m2/repository/commons-httpclient/commons-httpclient/3.0.1/commons-httpclient-3.0.1.jar:/Users/aching/.m2/repository/commons-lang/commons-lang/2.4/commons-lang-2.4.jar:/Users/aching/.m2/repository/commons-logging/commons-logging/1.0.3/commons-logging-1.0.3.jar:/Users/aching/.m2/repository/commons-net/commons-net/1.4.1/commons-net-1.4.1.jar:/Users/aching/.m2/repository/hsqldb/hsqldb/1.8.0.10/hsqldb-1.8.0.10.jar:/Users/aching/.m2/repository/javax/activation/activation/1.1/activation-1.1.jar:/Users/aching/.m2/repository/javax/mail/mail/1.4/mail-1.4.jar:/Users/aching/.m2/repository/jline/jline/0.9.94/jline-0.9.94.jar:/Users/aching/.m2/repository/junit/junit/3.8.1/junit-3.8.1.jar:/Users/aching/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar:/Users/aching/.m2/repository/net/iharder/base64/2.3.8/base64-2.3.8.jar:/Users/aching/.m2/repository/net/java/dev/jets3t/jets3t/0.7.1/jets3t-0.7.1.jar:/Users/aching/.m2/repository/net/sf/kosmosfs/kfs/0.3/kfs-0.3.jar:/Users/aching/.m2/repository/org/apache/commons/commons-io/1.3.2/commons-io-1.3.2.jar:/Users/aching/.m2/repository/org/apache/commons/commons-math/2.1/commons-math-2.1.jar:/Users/aching/.m2/repository/org/apache/hadoop/hadoop-core/0.20.203.0/hadoop-core-0.20.203.0.jar:/Users/aching/.m2/repository/org/apache/mahout/mahout-collections/1.0/mahout-collections-1.0.jar:/Users/aching/.m2/repository/org/apache/zookeeper/zookeeper/3.3.3/zookeeper-3.3.3.jar:/Users/aching/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.8.0/jackson-core-asl-1.8.0.jar:/Users/aching/.m2/repository/org/codehaus/jackson/jackson-mapper-asl/1.8.0/jackson-mapper-asl-1.8.0.jar:/Users/aching/.m2/repository/org/eclipse/jdt/core/3.1.1/core-3.1.1.jar:/Users/aching/.m2/repository/org/json/json/20090211/json-20090211.jar:/Users/aching/.m2/repository/org/mockito/mockito-all/1.8.5/mockito-all-1.8.5.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jetty/6.1.26/jetty-6.1.26.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jetty-util/6.1.26/jetty-util-6.1.26.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jsp-2.1/6.1.14/jsp-2.1-6.1.14.jar:/Users/aching/.m2/repository/org/mortbay/jetty/jsp-api-2.1/6.1.14/jsp-api-2.1-6.1.14.jar:/Users/aching/.m2/repository/org/mortbay/jetty/servlet-api/2.5-20081211/servlet-api-2.5-20081211.jar:/Users/aching/.m2/repository/org/mortbay/jetty/servlet-api-2.5/6.1.14/servlet-api-2.5-6.1.14.jar:/Users/aching/.m2/repository/oro/oro/2.0.8/oro-2.0.8.jar:/Users/aching/.m2/repository/tomcat/jasper-compiler/5.5.12/jasper-compiler-5.5.12.jar:/Users/aching/.m2/repository/tomcat/jasper-runtime/5.5.12/jasper-runtime-5.5.12.jar:/Users/aching/.m2/repository/xmlenc/xmlenc/0.52/xmlenc-0.52.jar org.apache.giraph.benchmark.PageRankBenchmark -Dgiraph.SplitMasterWorker=false -Dlocal.test.mode=true -c 1 -e 2 -s 2 -V 10 -w 1 2012-04-13 09:30:27.261 java[45785:1903] Unable to load realm mapping info from SCDynamicStore 12/04/13 09:30:27 INFO benchmark.PageRankBenchmark: Using class org.apache.giraph.benchmark.PageRankBenchmark
Re: java.lang.RuntimeException [...] msgMap did not exist [...]
Hi Etienne, Thanks for your questions. Giraph uses map tasks to run its master and workers. Can you provide the task output logs? It looks like your workers failed to report status for some reason and we need to find out why. The datanode logs can't help us here. Avery On 4/13/12 3:35 AM, Etienne Dumoulin wrote: Hi Guys, I tried out giraph yesterday and I have an issue to run the shortest path example. I am working on a toy heterogeneous cluster of 3 datanodes and 1 namenode, jobtracker, with hadoop 0.20.203.0. One of the datanode is a small server quad-core 16 GB ram, the others are small PC 1 core 1GB ram, same OS: ubuntu-server 10.04. I run on a first issue with the 0.1 version, the same described here: https://issues.apache.org/jira/browse/GIRAPH-114. Before I found the patch I tried different configurations: It works on a standalone environment, with the namenode and the server, with the namenode and the two small PC. It does not work either with the entire cluster, or with one small PC and the server as datanode. Then I downloaded today the svn version, no luck, it has the same behaviour than the 0.1 version (go till 100% then go back to 0%) but not the same info logs. Bellow the svn version console log, nantes is the name of the big datanode, rennes the namenode/jobtracker: hadoop@rennes:~/test$ hadoop jar ~/project/giraph/trunk_2012_04_13/target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.examples.SimpleShortestPathsVertex shortestPathsInputGraph shortestPathsOutputGraph 0 3 12/04/13 10:05:58 INFO mapred.JobClient: Running job: job_201204121836_0003 12/04/13 10:05:59 INFO mapred.JobClient: map 0% reduce 0% 12/04/13 10:06:18 INFO mapred.JobClient: map 25% reduce 0% 12/04/13 10:08:55 INFO mapred.JobClient: map 100% reduce 0% 12/04/13 10:21:28 INFO mapred.JobClient: map 75% reduce 0% 12/04/13 10:21:33 INFO mapred.JobClient: Task Id : attempt_201204121836_0003_m_02_0, Status : FAILED Task attempt_201204121836_0003_m_02_0 failed to report status for 600 seconds. Killing! 12/04/13 10:23:57 INFO mapred.JobClient: Task Id : attempt_201204121836_0003_m_01_0, Status : FAILED java.lang.RuntimeException: sendMessage: msgMap did not exist for nantes:30002 for vertex 2 at org.apache.giraph.comm.BasicRPCCommunications.sendMessageReq(BasicRPCCommunications.java:993) at org.apache.giraph.graph.BasicVertex.sendMsg(BasicVertex.java:168) at org.apache.giraph.examples.SimpleShortestPathsVertex.compute(SimpleShortestPathsVertex.java:104) at org.apache.giraph.graph.GraphMapper.map(GraphMapper.java:593) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:648) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Task attempt_201204121836_0003_m_01_0 failed to report status for 601 seconds. Killing! 12/04/13 10:23:58 INFO mapred.JobClient: map 50% reduce 0% 12/04/13 10:24:01 INFO mapred.JobClient: map 25% reduce 0% 12/04/13 10:24:06 INFO mapred.JobClient: Task Id : attempt_201204121836_0003_m_03_0, Status : FAILED Task attempt_201204121836_0003_m_03_0 failed to report status for 602 seconds. Killing! I attached the hadoop logs for rennes namenode and jobtraker and for nantes the big datanode. Is someone already got this error/found a fix? Thanks for your time, Étienne
Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph
It shouldn't be, your code looks very similar to the unittests (i.e. TestManualCheckpoint.java). So, you're trying to run your test with the local hadoop (similar to the unittests)? Or are you using an actual hadoop setup? Avery On 4/10/12 11:41 PM, Paolo Castagna wrote: I am using hadoop-core-1.0.1.jar ... could that be a problem? Paolo Paolo Castagna wrote: Hi Avery, nope, no luck. I have changed all my log.debug(...) into log.info(...). Same behavior. I have a log4j.properties [1] file in my classpath and it has: log4j.logger.org.apache.jena.grande=DEBUG log4j.logger.org.apache.jena.grande.giraph=DEBUG I also tried to change that to: log4j.logger.org.apache.jena.grande=INFO log4j.logger.org.apache.jena.grande.giraph=INFO No luck. My Giraph job has: GiraphJob job = new GiraphJob(getConf(), getClass().getName()); job.setVertexClass(getClass()); job.setVertexInputFormatClass(TurtleVertexInputFormat.class); job.setVertexOutputFormatClass(TurtleVertexOutputFormat.class); But, if I run in debug with a breakpoint in the TurtleVertexInputFormat.class constructor, it is never instanciated. How can it be? So perhaps the problem is not the logging, it is the fact that my GiraphJob is not using TurtleVertexInputFormat.class and TurtleVertexOutputFormat.class, but I don't see what I am doing wrong. :-/ Thanks, Paolo [1] https://github.com/castagna/jena-grande/blob/master/src/test/resources/log4j.properties Avery Ching wrote: I think the issue might be that Hadoop only logs INFO and above messages by default. Can you retry with INFO level logging? Avery On 4/10/12 12:17 PM, Paolo Castagna wrote: Hi, I am still learning Giraph, so, please, be patient with me and forgive my trivial questions. As a simple initial use case, I want to compute the shortest paths from a single source in a social graph in RDF format using the FOAF [1] vocabulary. This example also will hopefully inform GIRAPH-170 [2] and related issues, such as: GIRAPH-141 [3]. Here is an example in Turtle [4] format of a tiny graph using FOAF: @prefix :http://example.org/ . @prefix foaf:http://xmlns.com/foaf/0.1/ . :alice a foaf:Person ; foaf:name Alice ; foaf:mboxmailto:al...@example.org ; foaf:knows :bob ; foaf:knows :charlie ; foaf:knows :snoopy ; . :bob foaf:name Bob ; foaf:knows :charlie ; . :charlie foaf:name Charlie ; foaf:knows :alice ; . This is nice, human friendly (RDF without angle brackets!), but not easily splittable to be processed with MapReduce (or Giraph). Here is the same graph in N-Triples [5] format: http://example.org/alice http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person . http://example.org/alice http://xmlns.com/foaf/0.1/name Alice . http://example.org/alice http://xmlns.com/foaf/0.1/mbox mailto:al...@example.org . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/bob . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/charlie . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/snoopy . http://example.org/charlie http://xmlns.com/foaf/0.1/name Charlie . http://example.org/charlie http://xmlns.com/foaf/0.1/knows http://example.org/alice . http://example.org/bob http://xmlns.com/foaf/0.1/name Bob . http://example.org/bob http://xmlns.com/foaf/0.1/knows http://example.org/charlie . This is more verbose and ugly, but splittable. The graph I am interested in is the graph represented by foaf:knows relationships/links between people (please, note --knows-- relationship here has a direction, this isn't symmetric as in centralized social networking websites such as Facebook or LinkedIn. Alice can claim to know Bob, without Bob knowing it and/or it might even be a false claim): alice --knows-- bob alice --knows-- charlie alice --knows-- snoopy bob --knows-- charlie charlie --knows-- alice As a first step, I wrote a MapReduce job [6] to transform the RDF graph above in a sort of adjacency list using Turtle syntax, here is the output (three lines): http://example.org/alice http://xmlns.com/foaf/0.1/mbox mailto:al...@example.org;http://xmlns.com/foaf/0.1/name Alice; http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person;http://xmlns.com/foaf/0.1/knows http://example.org/charlie,http://example.org/bob, http://example.org/snoopy; .http://example.org/charlie http://xmlns.com/foaf/0.1/knows http://example.org/alice. http://example.org/bob http://xmlns.com/foaf/0.1/name Bob; http://xmlns.com/foaf/0.1/knows http://example.org/charlie; . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/bob. http://example.org/charlie http://xmlns.com/foaf/0.1/name Charlie; http://xmlns.com/foaf/0.1/knows http://example.org/alice; . http://example.org/bob http://xmlns.com/foaf
Re: Does Giraph support labeled graphs?
There is no preferred way to represent labeled graphs. A close example to your adjacency list idea is LongDoubleDoubleAdjacencyListVertexInputFormat. Hope that helps, Avery On 4/11/12 10:00 AM, Paolo Castagna wrote: Hi, I am not sure what's the best way to represent labeled graphs in Giraph. Here is my graph (i.e. vertex_id --edge_label_id-- vertex_id ): 32 --62-- 115 32 --153-- 189 32 --200-- 236 32 --266-- 303 32 --266-- 331 32 --266-- 363 303 --153-- 407 303 --266-- 331 331 --153-- 394 331 --266-- 32 ... I have code to produce an adjacency list: 32 ( 62 115 ) ( 153 189 ) ( 200 236 ) ( 266 303 331 363 ) 303 ( 153 407 ) ( 266 331 ) 331 ( 153 394 ) ( 266 32 ) ... What's the bets way to represent labeled graphs with Giraph? Correct me if I am wrong, but none of the current VertexInputFormat(s) is good for this, am I right? As a workaround, it is possible to generate an unlabeled adjacency list with just the edge type someone is interested in, say for example --266-- : 32 303 331 363 303 331 331 32 ... Cheers, Paolo PS: The graph above is RDF, parsed using Apache Jena's RIOT and stored in TDB. An example of code to generate the adjacency list from TDB indexes is here: https://github.com/castagna/jena-grande/blob/0667599264527721daea80d56ad3f99e437dcda2/src/main/java/org/apache/jena/grande/examples/RunTdbLowLevel.java
Re: A simple use case: shortest paths on a FOAF (i.e. Friend of a Friend) graph
I think the issue might be that Hadoop only logs INFO and above messages by default. Can you retry with INFO level logging? Avery On 4/10/12 12:17 PM, Paolo Castagna wrote: Hi, I am still learning Giraph, so, please, be patient with me and forgive my trivial questions. As a simple initial use case, I want to compute the shortest paths from a single source in a social graph in RDF format using the FOAF [1] vocabulary. This example also will hopefully inform GIRAPH-170 [2] and related issues, such as: GIRAPH-141 [3]. Here is an example in Turtle [4] format of a tiny graph using FOAF: @prefix :http://example.org/ . @prefix foaf:http://xmlns.com/foaf/0.1/ . :alice a foaf:Person ; foaf:name Alice ; foaf:mboxmailto:al...@example.org ; foaf:knows :bob ; foaf:knows :charlie ; foaf:knows :snoopy ; . :bob foaf:name Bob ; foaf:knows :charlie ; . :charlie foaf:name Charlie ; foaf:knows :alice ; . This is nice, human friendly (RDF without angle brackets!), but not easily splittable to be processed with MapReduce (or Giraph). Here is the same graph in N-Triples [5] format: http://example.org/alice http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person . http://example.org/alice http://xmlns.com/foaf/0.1/name Alice . http://example.org/alice http://xmlns.com/foaf/0.1/mbox mailto:al...@example.org . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/bob . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/charlie . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/snoopy . http://example.org/charlie http://xmlns.com/foaf/0.1/name Charlie . http://example.org/charlie http://xmlns.com/foaf/0.1/knows http://example.org/alice . http://example.org/bob http://xmlns.com/foaf/0.1/name Bob . http://example.org/bob http://xmlns.com/foaf/0.1/knows http://example.org/charlie . This is more verbose and ugly, but splittable. The graph I am interested in is the graph represented by foaf:knows relationships/links between people (please, note --knows-- relationship here has a direction, this isn't symmetric as in centralized social networking websites such as Facebook or LinkedIn. Alice can claim to know Bob, without Bob knowing it and/or it might even be a false claim): alice --knows-- bob alice --knows-- charlie alice --knows-- snoopy bob --knows-- charlie charlie --knows-- alice As a first step, I wrote a MapReduce job [6] to transform the RDF graph above in a sort of adjacency list using Turtle syntax, here is the output (three lines): http://example.org/alice http://xmlns.com/foaf/0.1/mbox mailto:al...@example.org;http://xmlns.com/foaf/0.1/name Alice; http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://xmlns.com/foaf/0.1/Person;http://xmlns.com/foaf/0.1/knows http://example.org/charlie,http://example.org/bob, http://example.org/snoopy; .http://example.org/charlie http://xmlns.com/foaf/0.1/knows http://example.org/alice. http://example.org/bob http://xmlns.com/foaf/0.1/name Bob; http://xmlns.com/foaf/0.1/knows http://example.org/charlie; . http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/bob. http://example.org/charlie http://xmlns.com/foaf/0.1/name Charlie; http://xmlns.com/foaf/0.1/knows http://example.org/alice; . http://example.org/bob http://xmlns.com/foaf/0.1/knows http://example.org/charlie.http://example.org/alice http://xmlns.com/foaf/0.1/knows http://example.org/charlie. This is legal Turtle, but it is also splittable. Each line has all the RDF statements (i.e. egdes) for a person (there are also incoming edges). I wrote a TurtleVertexReader [7] which extends TextVertexReaderNodeWritable, Text, NodeWritable, Text and a TurtleVertexInputFormat [8] which extends TextVertexInputFormatNodeWritable, Text, NodeWritable, Text. I wrote (copying from the example SimpleShortestPathsVertex) a FoafShortestPathsVertex [9] which extends EdgeListVertexNodeWritable, IntWritable, NodeWritable, IntWritable and I am running it locally using these arguments: -Dgiraph.maxWorkers=1 -Dgiraph.SplitMasterWorker=false -DoverwriteOutput=true src/test/resources/data3.ttl target/foaf http://example.org/alice 1 TurtleVertexReader, TurtleVertexInputFormat and FoafShortestPathsVertex are still work in progress and I am sure there are plenty of stupid errors. However, I do not understand why when I run FoafShortestPathsVertex with the DEBUG level, I see debug statements from FoafShortestPathsVertex: 19:34:44 DEBUG FoafShortestPathsVertex :: main({-Dgiraph.maxWorkers=1, -Dgiraph.SplitMasterWorker=false, -DoverwriteOutput=true, src/test/resources/data3.ttl, target/foaf, http://example.org/alice, 1}) 19:34:44 DEBUG FoafShortestPathsVertex :: getConf() -- null 19:34:44 DEBUG FoafShortestPathsVertex :: setConf(Configuration: core-default.xml,
Re: Announcement: 'Parallel Processing beyond MapReduce' workshop after Berlin Buzzwords
That is great news Sebastian! Congrats, I wish I was in Berlin to attend. Avery On 4/4/12 2:12 AM, Sebastian Schelter wrote: Hi everybody, I'd like to announce the 'Parallel Processing beyond MapReduce' workshop which will take place directly after the Berlin Buzzwords conference ( http://berlinbuzzwords.de/ ). This workshop will discuss novel paradigms for parallel processing beyond the traditional MapReduce paradigm offered by Apache Hadoop. The workshop will introduce two new systems: Apache Giraph aims at processing large graphs, runs on standard Hadoop infrastructure and is a loose port of Google's Pregel system. Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Stratosphere (http://www.stratosphere.eu) is a system that is developed in a joint research project by Technische Universität Berlin, Humboldt Universität zu Berlin and the Hasso-Plattner-Institut in Potsdam. It is a database inspired, large-scale data processor based on concepts of robust and adaptive execution. Stratosphere offers the PACT programming model that extends the MapReduce programming model with additional second order functions. As execution platform it uses the Nephele system, a massively parallel data flow engine which is also researched and developed in the project. Attendees will hear about the new possibilities of Hadoop's NextGen MapReduce architecture (YARN) and get a detailed introduction to the Apache Giraph and Stratosphere systems. After that there will be plenty of time for questions, discussions and diving into source code. As a prerequisite, attendees have to bring a notebook with: - a copy of Giraph downloaded with source - Hadoop 0.23+ source tree and JARS local - a copy of Stratosphere with source - an IDE of their choice The workshop will take place on the 6th and 7th of June and is limited to 15 attendees. Please register by sending an email to sebastian [DOT] schelter [AT] tu-berlin [DOT] de http://berlinbuzzwords.de/content/workshops-berlin-buzzwords /s
Re: Exceptions when establishing RPC
If you're using one master and one slave, you need to do -w 1. Did you see any error about the RPC server starting up? Avery On 4/3/12 1:37 PM, Robert Davis wrote: Hello, I was trying to run Giraph on two machines (one master and one slave) but kept getting exceptions when establishing RPC to the slave machine. Does anybody has any ideas what's going wrong here? I am running the test with following parameters. hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 10 -s 2 -v -V 2000 -w 2 Thanks, Robert 12/04/03 01:35:01 DEBUG comm.BasicRPCCommunications: startPeerConnectionThread: hostname ec2-107-20-19-131.compute-1.amazonaws.com http://ec2-107-20-19-131.compute-1.amazonaws.com, port 30001 12/04/03 01:35:01 DEBUG comm.BasicRPCCommunications: startPeerConnectionThread: Connecting to Worker(hostname=ec2-107-20-19-131.compute-1.amazonaws.com http://ec2-107-20-19-131.compute-1.amazonaws.com, MRpartition=1, port=30001), addr = ec2-107-20-19-131.compute-1.amazonaws.com:30001 http://ec2-107-20-19-131.compute-1.amazonaws.com:30001 if outMsgMap (null) == null 12/04/03 01:35:11 WARN comm.BasicRPCCommunications: connectAllRPCProxys: Failed on attempt 1 of 5 to connect to (id=0,cur=Worker(hostname=ec2-107-20-19-131.compute-1.amazonaws.com http://ec2-107-20-19-131.compute-1.amazonaws.com, MRpartition=1, port=30001),prev=null,ckpt_file=null) java.net.ConnectException: Call to ec2-107-20-19-131.compute-1.amazonaws.com:30001 http://ec2-107-20-19-131.compute-1.amazonaws.com:30001 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy3.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:370) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:420) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:194) at org.apache.giraph.comm.RPCCommunications$1.run(RPCCommunications.java:190) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:188) at org.apache.giraph.comm.RPCCommunications.getRPCProxy(RPCCommunications.java:58) at org.apache.giraph.comm.BasicRPCCommunications.startPeerConnectionThread(BasicRPCCommunications.java:678) at org.apache.giraph.comm.BasicRPCCommunications.connectAllRPCProxys(BasicRPCCommunications.java:622) at org.apache.giraph.comm.BasicRPCCommunications.setup(BasicRPCCommunications.java:583) at org.apache.giraph.graph.BspServiceWorker.setup(BspServiceWorker.java:555) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:474) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:646) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1083) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:656) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) ... 25 more
Re: Incomplete output when running PageRank example
As Benjamin mentioned, it depends on the number of map tasks your hadoop install is running with. You could set it proportionally to the number of cores it has if you like, but try using Benjamin's suggestions to get it working with more map tasks. I believe if you don't set the default, the default is 2, which is not enough for 2 workers. Avery On 3/31/12 11:51 AM, Robert Davis wrote: Thanks a lot, Benjamin. I set the number of maptask as 2 since I only have a duo-core processor (though with hyperthread) on my laptop. I ran it again but it still appeared incorrect. The output is as follows. Regards, Robert $ hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 5000 -w 2 12/03/31 11:40:08 INFO benchmark.PageRankBenchmark: Using class org.apache.giraph.benchmark.HashMapVertexPageRankBenchmark 12/03/31 11:40:10 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything 12/03/31 11:40:11 INFO mapred.JobClient: Running job: job_201203301834_0004 12/03/31 11:40:12 INFO mapred.JobClient: map 0% reduce 0% 12/03/31 11:40:38 INFO mapred.JobClient: map 33% reduce 0% 12/03/31 11:45:44 INFO mapred.JobClient: Job complete: job_201203301834_0004 12/03/31 11:45:44 INFO mapred.JobClient: Counters: 5 12/03/31 11:45:44 INFO mapred.JobClient: Job Counters 12/03/31 11:45:44 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=620769 12/03/31 11:45:44 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/31 11:45:44 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/31 11:45:44 INFO mapred.JobClient: Launched map tasks=2 12/03/31 11:45:44 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=4377 On Sat, Mar 31, 2012 at 3:45 AM, Benjamin Heitmann benjamin.heitm...@deri.org mailto:benjamin.heitm...@deri.org wrote: Hi Robert, On 31 Mar 2012, at 09:42, Robert Davis wrote: Hello Giraphers, I am new to Giraph. I just check out a version and ran it in the single machine mode. I got the following results which has no Giraph counter information (as those in the example output). I am wondering what has gone wrong. The hadoop I am using is 1.0 it looks like your Giraph job did not actually finish the calculation. As you say that you are new to Giraph, there might be a high chance that you ran into the same issue which tripped me up a few weeks ago ;) (I am not sure where the following information should be documented, maybe this issue should be documented on the same page which describes how to run the pagerank benchmark) You provide the parameter -w 30 to your job, which means that it will use 30 workers. Maybe thats from the example on the Giraph web page, however there is one very important caveat for the number of workers: the number of workers needs to be smaller then mapred.tasktracker.map.tasks.maximum minus one. Giraph will use one mapper task to start some sort of coordinating worker (probably something zookeeper specific), and then it will start the number of workers which you specified using -w . If the total number of workers is bigger then the maximum number of tasks, then your Giraph job will not finish actually calculating stuff. (There might be a config option for specifying how many workers need to be finished in order to start the next superstep, but I did not try that personally.) If you are running Hadoop/Giraph on your personal machine, then I would recommend, using 3 workers, and you should edit your conf/mapred-site.xml to include some values for the following configuration parameters (and restart hadoop...) property namemapred.map.tasks/name value4/value /property property namemapred.reduce.tasks/name value4/value /property property namemapred.tasktracker.map.tasks.maximum/name value4/value /property property namemapred.tasktracker.reduce.tasks.maximum/name value4/value /property
Re: Problem deploying Giraph job to hadoop cluster: onlineZooKeeperServers connection failure
Benjamin, my guess is that your jar might not have all the ZooKeeper dependencies. Can you look at the log for the process that was supposed to start ZooKeeper? I'm thinking it didn't start... Avery On 3/20/12 1:14 PM, Benjamin Heitmann wrote: Hello, after getting my feet wet with the InternalVertexRunner, I tried packaging a Giraph job as a jar for the first time. I am getting the following error: == 12/03/20 17:21:04 INFO mapred.JobClient: Task Id : attempt_201203201422_0009_m_00_2, Status : FAILED java.lang.IllegalStateException: onlineZooKeeperServers: Failed to connect in 10 tries! at org.apache.giraph.zk.ZooKeeperManager.onlineZooKeeperServers(ZooKeeperManager.java:687) at org.apache.giraph.graph.GraphMapper.setup(GraphMapper.java:425) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:646) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) attempt_201203201422_0009_m_00_2: log4j:WARN No appenders could be found for logger (org.apache.giraph.zk.ZooKeeperManager). attempt_201203201422_0009_m_00_2: log4j:WARN Please initialize the log4j system properly. = Here is some more information, which hopefully might give this mailing list some insight into what is happening, because I cant figure it out... * I am using Hadoop 1.0.1 and giraph svn revision 1293545 (the last one from February) * If I run the same Vertex class and Input/OutputFormat using InternalVertexRunner, then everything works fine. (using again Hadoop 1.0.1 and giraph rev 1293545) * I package the giraph job as a selfcontaining jar, and it contains the giraph jar, as well as the zookeeper jar in its lib dir (I mostly used the recipe from here https://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/ ) * there was an error in which hadoop could not find a class. And I had to fix that error with: giraphJob.setJarByClass(SimpleRDFVertex.class); * My Vertex class extends HashMapVertexText, Text, Text, NullWritable * I followed the code example from SimpleShortestPathVertex regarding the run() method and using the main method to call ToolRunner.run() Here is the code for my run() method: == @Override public int run(String[] args) throws Exception { // takes 3 args: inputDir outputDir numberOfWorkers GiraphJob job = new GiraphJob(getConf(), getClass().getName()); job.setJarByClass(SimpleRDFVertex.class); job.setVertexClass(SimpleRDFVertex.class); job.setVertexInputFormatClass(SimpleRDFVertexInputFormat.class); job.setVertexOutputFormatClass(SimpleRDFVertexOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setWorkerConfiguration(Integer.parseInt(args[2]), Integer.parseInt(args[2]), 100.0f); return job.run(true) ? 0 : -1; } == Am I constructing the GiraphJob in the wrong way ? I saw the GiraphRunner class, but the giraph source tree currently does not seem to contain an example of how to use that class. Is it safer to use that class for starting a GiraphJob ? If yes, how should the job jar be assembled in order to use GiraphRunner ? sincerely, Benjamin Heitmann.
Re: Pseudo-random number Vertex Reader
You can use it for performance testing, although it is not a great simulation of real graphs. Real graphs tend to be more power law distributed (see https://issues.apache.org/jira/browse/GIRAPH-26). Hope that helps, Avery On 3/17/12 8:13 PM, Fleischman, Stephen (ISS SCI - Plano TX) wrote: Avery, I am using Giraph solely for performance characterization -- primarily comparing hardware platforms but also for Hadoop configuration tuning. Am I correct that we could use the PseudoRandomVertexInputFormat, as used in the PageRank example, to generate any size graphs that can then be used in the simple shortest path example program and thus avoiding the need to obtain actual datasets? Best regards, Steve Fleischman
Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
If you found it useful, others might find it useful as well. Please feel free to add to a JIRA. Avery On 3/15/12 4:44 AM, Dionysis Logothetis wrote: Ok, I've created an issue: https://issues.apache.org/jira/browse/GIRAPH-155 Feel free to edit if you think the description is not clear. By the way, I have also created a vertex reader that reads adjacency lists but with no values for vertices and edges. That's also a format that I've seen in several graph data sets. The vertex reader is essentially a copy of the AdjacencyListVertexReader modified to handle this format. It's basically an abstract class and subclasses can override methods to provide default values for vertices and edges (otherwise values are initialized to null), just like Avery described below. If you think it's useful I can contribute this. On Wed, Mar 14, 2012 at 7:39 AM, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Thanks for your input. Response inline. Avery On 3/13/12 7:14 AM, Dionysios Logothetis wrote: Hi all, I'm a new Giraph user, and I'm facing a similar situation. My input graph is basically in the form of edges defined simply as a source and destination pair (optionally there could be an edge value). And these edges might be distributed across multiple files (this is actually a format I've seen in several graph data sets). Without having looked at the internals of Giraph, I originally imagined that creating a MutableVertex and calling addVertexRequest for both vertices in an edge and addEdgeRequest from within the VertexReader would do the trick. I agree that this idea can work, we also have to have a default vertex value in case folks add edges to a vertex index only. Now, this doesn't really work since there needs to be a graph state created in advance. The graph state is not created until all vertices have been loaded. I wouldn't work about graph state here since it's the input superstep. We can set it for all vertices after creation if need be. There's also another implication with potentially multiple workers trying to create the same vertex, but I think a vertex resolver can handle this, assuming the resolver is instantiated before the vertices are loaded. Yup. Is there a workaround to do this currently apart from pre-processing the graph? Not currently. Can you please open a JIRA on https://issues.apache.org/jira/browse/GIRAPH to put track this issue? I think we should do it. Do you think it would be useful to have such functionality? Yes! I think it makes sense to handle graph mutations either at the very beginning or during a execution in a uniform way. By the way, I'd be interested in contributing to the project. We'd love to have your contributions, it's a great fit. =) Looking forward to your response! Thanks! On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Benjamin, By the way, you're not the first to ask for a feature of this kind. Perhaps we should consider an alternative format for loading input vertex data that is based on the edges or data of the vertices rather than totally vertex-centric. We could load an edge, or a vertex value and join then all based on the vertex id. Handling conflicts could be a little difficult, but perhaps the vertex resolver could handle this as well. Avery On 3/12/12 12:41 PM, Benjamin Heitmann wrote: On 12 Mar 2012, at 18:15, David Garcia wrote: Not sure what you're asking about. getCurrentVertex() should only ever create one vertex. Presumably it returns this vertex to the calling function. . .which is called in loadVertices() I think. Thanks David. I am asking this question because I have a text input format which is very different from a node adjacency list. The most important difference, is that each line of the input file describes two nodes. The other important difference is that a node might be described on more then one line of the input. I have multiple gigabits of input, so it would be very beneficial to directly load the input into Giraph. Otherwise the overhead of converting the input to some sort of node adjacency list is so big, that it might be a show-stopper regarding the suitability of Giraph. For more details, here is the text from my previous email: =[snip]=== I am wondering if it would be possible to parse RDF input files from a TextInputFormat
Please vote for our Giraph proposal for the upcoming Hadoop Summit
Hi Giraphers, We have a submission for the 2012 Hadoop summit and part of deciding whether it gets accepted is based on community voting. It would be great to get more folks interested and involved in what is going on with Giraph so please vote! Here's the link: https://hadoopsummit2012.uservoice.com/forums/151413-track-1-future-of-apache-hadoop/suggestions/2663542-processing-over-a-billion-edges-on-apache-giraph We had some great exposure at last year's Hadoop Summit and hope to be a part of this year's program as well. Thanks! Avery
Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
Thanks for your input. Response inline. Avery On 3/13/12 7:14 AM, Dionysios Logothetis wrote: Hi all, I'm a new Giraph user, and I'm facing a similar situation. My input graph is basically in the form of edges defined simply as a source and destination pair (optionally there could be an edge value). And these edges might be distributed across multiple files (this is actually a format I've seen in several graph data sets). Without having looked at the internals of Giraph, I originally imagined that creating a MutableVertex and calling addVertexRequest for both vertices in an edge and addEdgeRequest from within the VertexReader would do the trick. I agree that this idea can work, we also have to have a default vertex value in case folks add edges to a vertex index only. Now, this doesn't really work since there needs to be a graph state created in advance. The graph state is not created until all vertices have been loaded. I wouldn't work about graph state here since it's the input superstep. We can set it for all vertices after creation if need be. There's also another implication with potentially multiple workers trying to create the same vertex, but I think a vertex resolver can handle this, assuming the resolver is instantiated before the vertices are loaded. Yup. Is there a workaround to do this currently apart from pre-processing the graph? Not currently. Can you please open a JIRA on https://issues.apache.org/jira/browse/GIRAPH to put track this issue? I think we should do it. Do you think it would be useful to have such functionality? Yes! I think it makes sense to handle graph mutations either at the very beginning or during a execution in a uniform way. By the way, I'd be interested in contributing to the project. We'd love to have your contributions, it's a great fit. =) Looking forward to your response! Thanks! On Mon, Mar 12, 2012 at 9:09 PM, Avery Ching ach...@apache.org mailto:ach...@apache.org wrote: Benjamin, By the way, you're not the first to ask for a feature of this kind. Perhaps we should consider an alternative format for loading input vertex data that is based on the edges or data of the vertices rather than totally vertex-centric. We could load an edge, or a vertex value and join then all based on the vertex id. Handling conflicts could be a little difficult, but perhaps the vertex resolver could handle this as well. Avery On 3/12/12 12:41 PM, Benjamin Heitmann wrote: On 12 Mar 2012, at 18:15, David Garcia wrote: Not sure what you're asking about. getCurrentVertex() should only ever create one vertex. Presumably it returns this vertex to the calling function. . .which is called in loadVertices() I think. Thanks David. I am asking this question because I have a text input format which is very different from a node adjacency list. The most important difference, is that each line of the input file describes two nodes. The other important difference is that a node might be described on more then one line of the input. I have multiple gigabits of input, so it would be very beneficial to directly load the input into Giraph. Otherwise the overhead of converting the input to some sort of node adjacency list is so big, that it might be a show-stopper regarding the suitability of Giraph. For more details, here is the text from my previous email: =[snip]=== I am wondering if it would be possible to parse RDF input files from a TextInputFormat class. The most suitable text format for RDF is called NTriples, and it has this very simple format: subject1 predicate1 object1 .\n subject1 predicate2 object2 .\n ... So each line contains the subject, which is a vertex, a predicate, which is a typed edge, and the object, which is another vertex. Then the line is terminated by a dot and a new-line. In Giraph terms, the result of parsing the first line would be the creation of a vertex for subject1 with an edge of type predicate1, and then the creation of a second vertex for object1. So two vertices need to be created for that one line. Now the second line contains more information about the vertex subject1. So in Giraph terms, the vertex which was created for subject1 needs to be retrieved/revisited and an edge of type predicate2, which points to the new vertex object2 needs to be created. And vertex object2 needs to be created. Just to point it out, such RDF NTriples files are unsorted, so information about the same vertex might appear e.g. at the first
Re: Question about TextInputFormat pattern for parsing e.g. RDF
Sorry for the delayed response. Responses inline. Avery On 3/8/12 7:14 AM, Benjamin Heitmann wrote: Hello again, I am wondering if it would be possible to parse RDF input files from a TextInputFormat class. The most suitable text format for RDF is called NTriples, and it has this very simple format: subject1 predicate1 object1 .\n subject1 predicate2 object2 .\n ... So each line contains the subject, which is a vertex, a predicate, which is a typed edge, and the object, which is another vertex. Then the line is terminated by a dot and a new-line. In Giraph terms, the result of parsing the first line would be the creation of a vertex for subject1 with an edge of type predicate1, and then the creation of a second vertex for object1. So two vertices need to be created for that one line. Now the second line contains more information about the vertex subject1. So in Giraph terms, the vertex which was created for subject1 needs to be retrieved/revisited and an edge of type predicate2, which points to the new vertex object2 needs to be created. And vertex object2 needs to be created. Just to point it out, such RDF NTriples files are unsorted, so information about the same vertex might appear e.g. at the first and at the last line of a multiple GB big file. Which interface can be used in a TextInputFormat/VertexReader in order to find an already created vertex ? This is not possible unfortunately. It's similar to the Hadoop InputFormat. Vertices (analogous to key-value pairs) are read one at a time. They are not saved for later access (just like Hadoop). Are there any other issues when VertexReader.getCurrentVertex() creates two vertices at the same time ? A second related question: If I have multiple formats for my input files, how would I implement that ? Just by adding a switch to the logic in getCurrentVertex() ? Or is there a better way to switch the input logic based on the file type ? All my input files would result in the same kind of Vertex being created. My motivation for doing this, in short: I have a large amount of RDF NTriples data which is provided by DBPedia. It amounts to somewhere between 5 GB and 20 GB, depending on which subset is used. Expressing this RDF data, so that each vertex is completely described in one text line, would require me to load it into an RDF store first, and then reprocess the data. In terms of RDF stores, that is already a non-trivial amount of data requiring quite a bit of hardware and tweaking. That is the reason why it would be valuable to directly load the RDF data into Giraph. My suggestion would be the following: Run a MR job to join all your RDFs on the vertex key and you can either convert them to an easy format to parse with a custom VertexInputFormat of your choice. If these are one way relationships, you need not create the target vertex. If they are undirect relationships, when you are processing your RDFs in the MR job, add a directed relationship to both vertices. cheers, Benjamin.
Re: Calling BspUtils.createVertexResolver from a TextVertexReader ?
Benjamin, By the way, you're not the first to ask for a feature of this kind. Perhaps we should consider an alternative format for loading input vertex data that is based on the edges or data of the vertices rather than totally vertex-centric. We could load an edge, or a vertex value and join then all based on the vertex id. Handling conflicts could be a little difficult, but perhaps the vertex resolver could handle this as well. Avery On 3/12/12 12:41 PM, Benjamin Heitmann wrote: On 12 Mar 2012, at 18:15, David Garcia wrote: Not sure what you're asking about. getCurrentVertex() should only ever create one vertex. Presumably it returns this vertex to the calling function. . .which is called in loadVertices() I think. Thanks David. I am asking this question because I have a text input format which is very different from a node adjacency list. The most important difference, is that each line of the input file describes two nodes. The other important difference is that a node might be described on more then one line of the input. I have multiple gigabits of input, so it would be very beneficial to directly load the input into Giraph. Otherwise the overhead of converting the input to some sort of node adjacency list is so big, that it might be a show-stopper regarding the suitability of Giraph. For more details, here is the text from my previous email: =[snip]=== I am wondering if it would be possible to parse RDF input files from a TextInputFormat class. The most suitable text format for RDF is called NTriples, and it has this very simple format: subject1 predicate1 object1 .\n subject1 predicate2 object2 .\n ... So each line contains the subject, which is a vertex, a predicate, which is a typed edge, and the object, which is another vertex. Then the line is terminated by a dot and a new-line. In Giraph terms, the result of parsing the first line would be the creation of a vertex for subject1 with an edge of type predicate1, and then the creation of a second vertex for object1. So two vertices need to be created for that one line. Now the second line contains more information about the vertex subject1. So in Giraph terms, the vertex which was created for subject1 needs to be retrieved/revisited and an edge of type predicate2, which points to the new vertex object2 needs to be created. And vertex object2 needs to be created. Just to point it out, such RDF NTriples files are unsorted, so information about the same vertex might appear e.g. at the first and at the last line of a multiple GB big file. Which interface can be used in a TextInputFormat/VertexReader in order to find an already created vertex ? Are there any other issues when VertexReader.getCurrentVertex() creates two vertices at the same time ? A second related question: If I have multiple formats for my input files, how would I implement that ? Just by adding a switch to the logic in getCurrentVertex() ? Or is there a better way to switch the input logic based on the file type ? All my input files would result in the same kind of Vertex being created. My motivation for doing this, in short: I have a large amount of RDF NTriples data which is provided by DBPedia. It amounts to somewhere between 5 GB and 20 GB, depending on which subset is used. Expressing this RDF data, so that each vertex is completely described in one text line, would require me to load it into an RDF store first, and then reprocess the data. In terms of RDF stores, that is already a non-trivial amount of data requiring quite a bit of hardware and tweaking. That is the reason why it would be valuable to directly load the RDF data into Giraph.
Re: Error in instantiating custom Vertex class via InternalVertexRunner.run
Inline responses. We look forward to hearing about your work Benjamin! On 3/5/12 9:12 AM, Benjamin Heitmann wrote: On 2 Mar 2012, at 23:15, Avery Ching wrote: If I'm reading this right, you're using a public abstract class for the vertex. The vertex class must be instantiable and cannot be abstract. Hope that helps, Thanks, that was the right issue to point out. I removed the abstract keyword, which solved the issue. (Of course, then I found lots of other bugs in my code... ;) Glad to hear it. After adding the abstract keyword, I ran into some problems in overriding package private methods of BasicVertex. Almost all of the abstract methods in BasicVertex are declared as public, e.g. public abstract IterableM getMessages(); However, there are two methods which do not have the public keyword: abstract void putMessages(IterableM messages); abstract void releaseResources(); I am guessing that this inconsistency is just on oversight. Actually, it is not. =) So the issue is that if we do make these methods not package-private (i.e. protected/public), then when a user subclasses a vertex, they will be able to shoot themselves in the foot by calling these methods which are only meant for internal use. Any other suggestions are welcome. However, if I understood everything correctly, then this provides problems for developers who want to implement BasicVertex *outside* of the Giraph source tree. As the public keyword is missing, it is not possible to override these two method signatures from another package. The result, is that if I do not need IntIntNullIntVertex, but instead IntMyStateNullIntVertex which implements BasicVertex, then I will need to either copy BasicVe Is that the right reasoning, or is there some other pattern for using BasicVertex which I missed ? Should I file a bug report somewhere ? cheers, Benjamin.
Re: PageRankBenchmark failing with zooKeeper.KeeperException
Hi Abhishek, Nice to meet you. Can you try it with less workers? For instance -w 1 or -w 2? I think the likely issue is that you need have as many map slots as the number of workers + at least one master. If you don't have enough slots, the job will fail. Also, you might want to dial down the number of vertices a bit, unless you have oodles of memory. Please let us know if that helps. Avery On 3/5/12 9:03 PM, Abhishek Srivastava wrote: Hi All, I have been trying (quite unsuccessfully for a while now) to run the PageRankBenchmark to play around with Giraph. I got hadoop running in a single node setup and hadoop jobs and jars run just fine. When I try to run the PageRankBenchmark, I get this incomprehensible error which I'm not able to diagnose. ---CUT HERE- abhi@darkstar:trunk $ hadoop jar target/giraph-0.70-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 5000 -w 30 Warning: $HADOOP_HOME is deprecated. Using org.apache.giraph.benchmark.PageRankBenchmark$PageRankVertex 12/03/04 03:44:08 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything 12/03/04 03:44:09 INFO mapred.JobClient: Running job: job_201203031851_0004 12/03/04 03:44:10 INFO mapred.JobClient: map 0% reduce 0% 12/03/04 03:44:26 INFO mapred.JobClient: map 3% reduce 0% 12/03/04 10:43:52 INFO mapred.JobClient: map 0% reduce 0% 12/03/04 10:43:57 INFO mapred.JobClient: Task Id : attempt_201203031851_0004_m_00_0, Status : FAILED Task attempt_201203031851_0004_m_00_0 failed to report status for 24979 seconds. Killing! 12/03/04 10:44:00 INFO mapred.JobClient: Task Id : attempt_201203031851_0004_m_01_0, Status : FAILED Task attempt_201203031851_0004_m_01_0 failed to report status for 25159 seconds. Killing! 12/03/04 10:44:07 INFO mapred.JobClient: map 3% reduce 0% 12/03/04 10:49:07 INFO mapred.JobClient: map 0% reduce 0% 12/03/04 10:49:12 INFO mapred.JobClient: Task Id : attempt_201203031851_0004_m_00_1, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/03/04 10:49:22 INFO mapred.JobClient: map 3% reduce 0% 12/03/04 10:54:23 INFO mapred.JobClient: map 0% reduce 0% 12/03/04 10:54:28 INFO mapred.JobClient: Task Id : attempt_201203031851_0004_m_00_2, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) 12/03/04 10:54:38 INFO mapred.JobClient: map 3% reduce 0% 12/03/04 10:59:10 INFO mapred.JobClient: Task Id : attempt_201203031851_0004_m_01_1, Status : FAILED java.lang.IllegalStateException: unregisterHealth: KeeperException - Couldn't delete /_hadoopBsp/job_201203031851_0004/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/darkstar_1 at org.apache.giraph.graph.BspServiceWorker.unregisterHealth(BspServiceWorker.java:727) at org.apache.giraph.graph.BspServiceWorker.failureCleanup(BspServiceWorker.java:735) at org.apache.giraph.graph.GraphMapper.run(GraphMapper.java:648) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:249) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /_hadoopBsp/job_201203031851_0004/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/darkstar_1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:728) at org.apache.giraph.graph.BspServiceWorker.unregisterHealth(BspServiceWorker.java:721) ... 9 more Task attempt_201203031851_0004_m_01_1 failed to report status for 601 seconds. Killing! attempt_201203031851_0004_m_01_1: log4j:WARN No appenders could be found for logger (org.apache.zookeeper.ClientCnxn). attempt_201203031851_0004_m_01_1: log4j:WARN Please initialize the log4j system properly. 12/03/04 10:59:47 INFO mapred.JobClient: map 0% reduce 0% 12/03/04 10:59:58 INFO mapred.JobClient: Job complete: job_201203031851_0004 12/03/04 10:59:58 INFO mapred.JobClient: Counters: 6 12/03/04
Re: Giraph input format restrictions
Sorry about the old documentation. I just updated the shortest paths example. Before major changes to the graph distribution, the vertex ids were required to be sorted. That is no longer the case. You can input vertices in any order. The only restriction is that the vertex ids must be unique (no duplicate vertices). If there are duplicates an exception will be thrown since duplicates are probably not expected and this is probably an error. This could be relaxed in the future as well if need be, but we would need to decide on how to handle duplicates. Thanks for all the great questions! Avery On 2/19/12 11:25 AM, yavuz gokirmak wrote: Hi, In Shortest Paths Example it is written that Currently there is one restriction on the VertexInputFormat that is not obvious. The vertices must be sorted.. I didn't understand the reason of this restriction, why vertices should be ordered? Secondly, as I understood, we have to transform our initial data into a form that each line corresponds to a vertex(with edge and values if exists) in the graph. For example, I have a data that each row is corresponds to an edge between to vertices format1: a b a c a d b c b a c d Do I have to convert this file into a format similar to below in order to use with giraph algorithms? format2: a b c d b c a c d thanks..
Re: how to use SimplePageRankVertex
IntIntNullIntTextInputFormat in the examples package (extending TextVertexInputFormat as David suggests) is very similar to what you need I think, although the types might be different for your application. You can start with that perhaps. Avery On 2/18/12 7:48 AM, David Garcia wrote: The easiest thing to do is to extend text vertex or/and textvertext input format and/or the record reader. The record reader will give you the vertices you want. Look at the record reader for textvertexinputformat. It's an innerclass on this format class. Sent from my HTC Inspire™ 4G on ATT - Reply message - From: yavuz gokirmak ygokir...@gmail.com To: giraph-user@incubator.apache.org giraph-user@incubator.apache.org Subject: how to use SimplePageRankVertex Date: Sat, Feb 18, 2012 9:08 am Hi, I am planning to use giraph for network analysis. First I am trying to fully understand SimplePageRankVertex implementation and modify in order to serve my needs. I have a question about example, What is the expected input format for SimplePageRankVertex, I couldn't understand the input format although SimplePageRankVertexReader class has few lines. My input file is contains of rows such as: usera, userb usera, userc userc, usera userb, userc userc, userb . . . Each row represents a relation between two users, *usera,userb* means that *usera is clicked userb's profile * Is it possible to make social network analysis over such kind of data using giraph? I will be glad if you can give advices.. thanks in advance best regards ygokirmak
Re: counter limit question
Yes, there is a way to disable the counters at runtime. See GiraphJob: /** Use superstep counters? (boolean) */ public static final String USE_SUPERSTEP_COUNTERS = giraph.useSuperstepCounters; and set to false. Avery On 2/16/12 1:41 PM, David Garcia wrote: I have a job that could conceivably involve thousands of supersteps. I know that I can adjust this in mapped-site.xml, but what are the framework's limitations for the number of counters possible? Is there a better way to address this (I.e. Prevent giraph from using Hadoop counters for every super step)? -David
Re: maven, hadoop, zookeeper, and giraph!
Hi Jeffrey, Best attempt as answers inline. On 2/16/12 6:12 PM, Jeffrey Yunes wrote: Hi Giraph community, I think I followed all of the directions (for a Giraph on a psuedo-cluster), and it looks like mvn clean test -Dprop.mapred.job.tracker=localhost:9001 runs fine. However, I'm new to the Hadoop infrastructure, and have a couple of questions about getting started with Giraph. 1) hadoop jar target/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 3 -v -V 50 -w 3 gives me the error java.lang.NullPointerException at at org.apache.giraph.benchmark.PageRankBenchmark.run(PageRankBenchmark.java:127) It looks like some error with configuration? This is a bug. I have a quick fix for it. Sorry about that. I opened an issue for it. https://issues.apache.org/jira/browse/GIRAPH-150 diff --git a/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java b/ index 0e76122..4d08929 100644 --- a/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java +++ b/src/main/java/org/apache/giraph/benchmark/PageRankBenchmark.java @@ -124,7 +124,8 @@ public class PageRankBenchmark extends EdgeListVertex } else { job.setVertexClass(PageRankBenchmark.class); } -LOG.info(Using class + BspUtils.getVertexClass(getConf()).getName()); +LOG.info(Using class + +BspUtils.getVertexClass(job.getConfiguration()).getName()); job.setVertexInputFormatClass(PseudoRandomVertexInputFormat.class); job.setWorkerConfiguration(workers, workers, 100.0f); 2) How should I / do I enable the log4j? An appender that writes to the HDFS? How else could I grep all my logs for errors and things? log4j is used by the task trackers to dump to the job logs. If you click on your running job in the web page, you can then click into each task and look at the logs under 'Task Logs'. You can configure the task tracker log4jproperties to set the log level, but the default is info I believe. 3) With regard to Giraph and maven, none of the directions suggested doing local overrides. Therefore, why should I expect my Giraph installation to refer to libraries and configuration in ~/Applications/hadoop or zookeeper rather than those in ~.m2/repo? Giraph builts a massive jar that has all the required classes and jars to launch ZooKeeper and interact with Hadoop. This makes for easy deployment to a running cluster. 4) Why doesn't running maven for Giraph install hadoop along the way (or does it)? Because there are so many versions of Hadoop and if you are lauching Hadoop, then the hadoop jar should be in your classpath automatically. I'd appreciate if you'd help improve my understanding! No problem. Welcome to Giraph! Thanks! -Jeff
Re: Giraph Architecture bug in
AFAIK we don't have any SOP for opening issues. Maybe I'll take a crack at this one tonight if I find some time, unless you were planning to work on it David. Avery On 2/8/12 5:46 PM, David Garcia wrote: I opened up * GIRAPH-144https://issues.apache.org/jira/browse/GIRAPH-144 I apologize if I didn't do it up according to project SOP's. I haven't had time to read it thoroughly. -David On 2/8/12 7:29 PM, David Garciadgar...@potomacfusion.com wrote: Yeah, I'll write something up. On 2/8/12 7:26 PM, Avery Chingach...@apache.org wrote: Since we call waitForCompletion() (which calls submit() internally) in GiraphJob#run(), we cannot override those methods. A better fix would probably be to use composition rather than inheritance (i.e. public class GiraphJob { Job internalJob; } and expose the methods we would like as necessary. There are other methods we don't want the user to call, (i.e. setMapperClass(), etc.). David, can you please open an issue for this? Avery On 2/8/12 5:17 PM, David Garcia wrote: This is a very subtle bug. GiraphJob inherits from org.apache.mapreduce.Job. However, the methods submit() and waitForCompletion() are not overridden. I assumed that they were implemented, so when I called either one of these methods, the framework started up identity mappers/reducers. A simple fix is to throw unsupported operation exceptions or to implement these methods. Perhaps this has been done already? -David On 2/7/12 7:46 PM, David Garciadgar...@potomacfusion.com wrote: I am running into a weird error that I haven't seen yet (I suppose I've been lucky). I see the following in the logging: org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable In the job definition, the property mapreduce.map.class is not even defined. For Giraph, this is usually set to mapreduce.map.class=org.apache.giraph.graph.GraphMapper I'm building my project with hadoop 0.20.204. When I build the GiraphProject myself (and run my own tests with the projects dependencies), I have no problems. The main difference is that I'm using a Giraph dependency in my work project. All input is welcome. Thx!! -David
Re: running job with giraph dependency anomaly
If you're using GiraphJob, the mapper class should be set for you. That's weird. Avery On 2/7/12 5:58 PM, David Garcia wrote: That's interesting. Yes, I don't need native libraries. The problem I'm having is that after I run job.waitForCompletion(..), The job runs a mapper that is something other than GraphMapper. It doesn't complain that a Mapper isn't defined or anything. It runs something else. As I mentioned below, the map-class doesn't appear to be defined. On 2/7/12 7:50 PM, Jakob Homanjgho...@gmail.com wrote: That's not necessarily a bad thing. Hadoop (not Giraph) has native code library it can use for improved performance. You'll see this message when running on a cluster that's not been deployed to use the native libraries. If I follow what you wrote, most likely your work project cluster is so configured. Unless you actively expect to have the native libraries loaded, I wouldn't be concerned. On Tue, Feb 7, 2012 at 5:46 PM, David Garciadgar...@potomacfusion.com wrote: I am running into a weird error that I haven't seen yet (I suppose I've been lucky). I see the following in the logging: org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable In the job definition, the property mapreduce.map.class is not even defined. For Giraph, this is usually set to mapreduce.map.class=org.apache.giraph.graph.GraphMapper I'm building my project with hadoop 0.20.204. When I build the GiraphProject myself (and run my own tests with the projects dependencies), I have no problems. The main difference is that I'm using a Giraph dependency in my work project. All input is welcome. Thx!! -David
Re: creating non existing vertices by sending messages
Thanks for the comments David. The behavior of what happens is completely defined by the chosen VertexResolver, see (GiraphJob#setWorkerContextClass). Developers can implement any behavior they want. I believe the only reason to bypass was as a performance optimization. Avery On 2/3/12 8:34 AM, Claudio Martella wrote: Agreed, probably making the path configurable is the way to go. On Fri, Feb 3, 2012 at 5:30 PM, David Garciadgar...@potomacfusion.com wrote: I just wanted to send this out because I remember reading a discussion on this topic. Currently, graph will create a vertex in the graph if a message is sent to a vertexID that doesn't exist. Personally, I really really like this behavior. It enables me to forgo vertex creation if I don't need it. If I need the vertex, I can simply send a message to create it, and process the message that was sent. I understand that there are some concerns with this. . .I would suggest making this behavior configurable at job creation. This would be an awesome compromise, and would not preclude either type of behavior. -David
Re: multi-graph support in giraph
We can diverge from the Pregel API as long as we have a good reason for it. I do agree that while we can support multi-graphs with a user-chosen edge type, some built-in support that makes programming easier sounds like a good goal. Andre or Claudio, feel free to open a JIRA to discuss this. We should also figure out the appropriate APIs as well that make it the most convenient to use. Avery On 2/3/12 9:14 AM, Claudio Martella wrote: On Fri, Feb 3, 2012 at 6:07 PM, André Kelpe efeshundert...@googlemail.com wrote: 2012/2/3 Claudio Martellaclaudio.marte...@gmail.com: Hi Andre, Hi! As I see it, we'd basically have to move all the API about edges from single object to Iterable (i.e. returning multiple edges for a given vertex endpoint as you suggested), and maybe also returning multiple vertices for a given edge(label). If the goal of giraph is to be close to the pregel paper, then that kind of API makes more sense. From how I see it, we've already taken a distance from Pregel in many API decisions. Personally I believe we don't have to stick to Pregel, we definitely have just to design Giraph for it's useful. For what we know, the API of the paper could be just the smallest subset of the real Pregel API that could fit clearly into the paper. I am going to look into your code and see if I can integrated it in the copy of giraph I use internally here right now. Be ware that the code is not meant for general purpose but for a specific task. The extended api methods though should be quite general. The single-graph, as it's implemented now compared to multi-graph, would be a subcase of this which internally it would return getEdgeValue().iterator().next(). That would mean, you'd have two different kinds of vertex, one compatible with single-graphs and one with multi-graphs. Sounds tricky to maintain on the long run, but could be an idea. I see the single-graph vertex as a subclass of the multi-graph vertex, something along with what's already going on with mutablevertex, so i don't see a problem in maintaining it. André (@fs111)
Re: [VOTE] Release Giraph 0.1-incubating (rc0)
To address the issues of binaries, could we release multiple binaries of Giraph that coincide with the different versions of Hadoop? On 1/31/12 7:44 PM, David Garcia wrote: I think these concerns preclude the entire idea of a release. A release should be something that users can use as a dependency. . .like a maven coordinate. I think you guys should wait until you have made these decisions. . .and then cut a binary. On 1/31/12 5:36 PM, Jakob Homanjgho...@gmail.com wrote: Giraphers- I've created a candidate for our first release. It's a source release without a binary for two reasons: first, there's still discussion going on about what needs to be done for the NOTICE and LICENSE files for projects that bring in transitive dependencies to the binary release (http://www.mail-archive.com/general@incubator.apache.org/msg32693.html) and second because we're still munging our binary against three types of Hadoop, which would mean we'd need to release three different binary artifacts, which seems suboptimal. Hopefully both of these issues will be addressed by 0.2. I've tested the release against an unsecure 20.2 cluster. It'd be great to test it against other configurations. Note that we're voting on the tag; the files are provided as a convenience. Release notes: http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/RELEASE_NOTE S.html Release artifacts: http://people.apache.org/~jghoman/giraph-0.1.0-incubating-rc0/ Corresponding svn tag: http://svn.apache.org/repos/asf/incubator/giraph/tags/release-0.1-rc0/ Our signing keys (my key doesn't seem to be being picked up by http://people.apache.org/keys/group/giraph.asc): http://svn.apache.org/repos/asf/incubator/giraph/KEYS The vote runs for 72 hours, until Friday 4pm PST. After a successful vote here, Incubator will vote on the release as well. Thanks, Jakob
Re: giraph stability problem
Glad to hear you fixed your problem. It would be great if you could describe any improvements that would help you have found the issues earlier. Maybe we (or you) could add them =). Avery On 1/23/12 8:31 AM, André Kelpe wrote: Hi all, thanks for all the answers so far, it turns out that it actually isn't that much of a problem: I just had some inconsistencies in my input, which made giraph explode. I did a rerun with correct input data and now the whole thing finishes in a few seconds. It would of course be nice to have the described out-of-process messaging with spill over to disk for bigger problems, but that seems to be not necessary for the problem space I am in right now :-). --André
Re: Scalability results for GoldenOrb and comparison with Giraph
algorithms display similar properties for configurations in the regime not dominated by a framework overhead bottleneck. And second, the GoldenOrb SSSP results being compared are also from configurations which have reached a steady power law slope over the range of nodes considered, for runs using the same algorithm as the Pregel results. These two points, I feel, justify the comparisons made (though, again, it would be better to have a standardized set of configurations for testing to facilitate comparing results, even between algorithms). Since all three sets of scalability tests yield fairly linear complexity plots (execution time vs. number of vertices in the graph, slide 29 of your talk), it also makes sense to compare weak scaling results, a proposition supported by the consistency of the observed GoldenOrb weak scaling results for SSSP across multiple test configurations. As for the results found in your October 2011 talk, they are impressive and clearly demonstrate an ability to effectively scale to large graph problems (shown by the weak scaling slope of ~ 0.01) and to maximize the benefit of throwing additional computational resources at a known problem (shown by the strong scaling slope of ~ -0.93), so I'm interested to see the results of the improvements that have been made. I'm a big proponent of routine scalability testing using a fixed set of configurations as part of the software testing process, as the comparable results help to quantify improvement as the software is developed further and can often help to identify unintended side effects of changes / find optimal configurations for various regimes of problems, and would like to see Giraph succeed, so let me know if there's any open issues which I might be able to dig into (I'm on the dev mailing list as well, though haven't posted there). Thanks, Jon On Dec 11, 2011, at 1:02 PM, Avery Ching wrote: Hi Jon, -golden...@googlegroups.com (so as to not clog up their mailing list uninvited) First of all, thank you for sharing this comparison. I would like to note a few things. The results I posted in October 2011 were actually a bit old (done in June 2011) and do not have several improvements that reduce memory usage significantly (i.e. GIRAPH-12 and GIRAPH-91). The number of vertices loadable per worker is highly dependent on the number of edges per worker, the amount of available heap memory, number of messages, the balancing of the graph across the workers, etc. In recent tests at Facebook, I have been able to load over 10 million vertices / worker easily with 20 edges / vertex. I know that you wrote that the maximum per worker was at least 1.6 million vertices for Giraph, I just wanted to let folks know that it's in fact much higher. We'll work on continuing to improve that in the future as today's graph problems are in the billions of vertices or rather hundreds of billions =). Also, with respect to scalability, if I'm interpreting these results correctly, does it mean that GoldenOrb is currently unable to load more than 250k vertices / cluster as observed by former Ravel developers? if so, given the small tests and overhead per superstep, I wouldn't expect the scalability to be much improved by more workers. Also, the max value and shortest paths algorithms are highly data dependent to how many messages are passed around per superstep and perhaps not a fair scaling comparison with Giraph's scalability designed page rank benchmark test (equal messages per superstep distributed evenly across vertices). Would be nice to see an apples-to-apples comparison if someone has the time...=) Thanks, Avery On 12/10/11 3:16 PM, Jon Allen wrote: Since GoldenOrb was released this past summer, a number of people have asked questions regarding scalability and performance testing, as well as a comparison of these results with those of Giraph ( http://incubator.apache.org/giraph/ ), so I went forward with running tests to help answer some of these questions. A full report of the scalability testing results, along with methodology details, relevant information regarding testing and analysis, links to data points for Pregel and Giraph, scalability testing references, and background mathematics, can be found here: http://wwwrel.ph.utexas.edu/Members/jon/golden_orb/ Since this data will also be of interest to the Giraph community (for methodology, background references, and analysis reasons), I am cross posting to the Giraph user mailing list. A synopsis of the scalability results for GoldenOrb, and comparison data points for Giraph and Google's Pregel framework are provided below. The setup and execution of GoldenOrb scalability tests were conducted by three former Ravel (http://www.raveldata.com ) developers, including myself, with extensive knowledge of the GoldenOrb code base and optimal system configurations, ensuring the most optimal settings were used for scalability testing. RESULTS SUMMARY: MAX CAPACITY: Pregel (at least
Re: Packaging a Giraph application in a jar
Would be great if you can document what you did. =) Thanks, Avery On 11/8/11 3:13 PM, Claudio Martella wrote: Sorry guys, may bad. Was calling job.waitForCompletion() directly. I've been coding standard mapreduce whole weekend... Anyway I got a solution for clean packaging of your own application over giraph, and that is exactly using maven-shade-plugin. it will prepare the uberjar for you. On Tue, Nov 8, 2011 at 9:33 PM, Claudio Martella claudio.marte...@gmail.com wrote: Hello list, I'm actually having troubles as well to get my application running. I've give a shot to maven-shade plugin which unpacks my dependencies and packs them all together with my classes in a new jar. I attach the hierarchy of the jar so that somebody can maybe spot what's missing, because i can't get it working. I get an identity map-reduce job with jobconf complaining about no job jar being set. Any idea? On Sat, Nov 5, 2011 at 5:09 PM, Avery Chingach...@apache.org wrote: Hi Gianmarco, You're right, most of us (to my knowledge) have been using Giraph with an uberjar as you've put it. However, Jakob has been doing some work to make this easier. See the below issue: https://issues.apache.org/jira/browse/GIRAPH-64 If you can suggest a better approach, please add to the issue or create a new one if appropriate. Thanks, Avery On 11/5/11 4:11 AM, Gianmarco De Francisci Morales wrote: Hi community, I was wondering what is the current best practice to package an application in a jar for deployment. I tried the 'hadoop way' by putting giraph-*.jar in the /lib directory of my jar, and using the -libjars option but none of them worked. It looks like the backend classloader is doing some mess and it doesn't find my own classes in the jar. I resorted to uncompressing the giraph-*.jar and repackaging my classes with it, all at the same level (an uber-fat jar), but even though it works it doesn't sound like the right approach. Any suggestions? Thanks, -- Gianmarco -- Claudio Martella claudio.marte...@gmail.com
Re: way to run unit tests from inside IDE?
I use Eclipse and it's okay for running unittests, but I need to set the VM args in the junit run configuration for each specific test to -Dprop.jarLocation=target/giraph-0.70-jar-with-dependencies.jar. I assume you need to do the same for Intellij. This is done in pom.xml when doing 'mvn test' and other mvn commands. Avery On 10/28/11 11:21 PM, Jake Mannix wrote: I seem to be getting weird stuff like: setup: Using local job runner with location for testBspCombiner 11/10/28 23:21:00 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/10/28 23:21:00 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-jake/mapred/staging/jake1475251079/.staging/job_local_0005 java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:82) at org.apache.hadoop.fs.Path.init(Path.java:90) at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:720) at org.apache.hadoop.mapred.JobClient.copyAndConfigureFiles(JobClient.java:596) at org.apache.hadoop.mapred.JobClient.access$200(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:806) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:791) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:791) at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:494) at org.apache.giraph.graph.GiraphJob.run(GiraphJob.java:495) at org.apache.giraph.TestBspBasic.testBspCombiner(TestBspBasic.java:261) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at com.intellij.junit3.TestRunnerUtil$SuiteMethodWrapper.run(TestRunnerUtil.java:262) at com.intellij.junit3.JUnit3IdeaTestRunner.doRun(JUnit3IdeaTestRunner.java:139) at com.intellij.junit3.JUnit3IdeaTestRunner.startRunnerWithArgs(JUnit3IdeaTestRunner.java:52) at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:199) at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:62) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) When I try to run in IntelliJ, but not from command line. Anyone run into this? -jake
Re: Restriction of VertexInputFormat
Hi Gianmarco, Welcome to Giraph! We definitely look forward to having your input/contributions. Answers inline. On 10/26/11 8:07 AM, Gianmarco De Francisci Morales wrote: Hi, First of all let me introduce myself, my name is Gianmarco and I am a researcher. Second, let me congratulate with the developers for the project. It looks very promising and I am very interested in it. I have two questions. 1) I was trying to understand better the system, and I came across this sentence in the documentation: Currently there is one restriction on the VertexInputFormat that is not obvious. The vertices must be sorted. Does this still apply? And if so, could someone explain me the reason? Yes it still applies. Please see https://issues.apache.org/jira/browse/GIRAPH-11. I am getting closer to having this done, but got derailed by work. Hopefully I'll have a patch by next week to finally address it (touches pretty much all the code). 2) Do the superstep times that get reported in hadoop counters at the end of the job include communication time or only processing time? It includes the time of the superstep from the master's perspective (waiting for workers to register health, assigning work, checkpointing (maybe), vertex exchange (maybe), vertex processing, waiting for all workers to finish, etc.). Thanks, -- Gianmarco De Francisci Morales
Re: Message processing
The GraphLab model is more asynchronous than BSP They allow you to update your neighbors rather than the BSP model of messaging per superstep. Rather than one massive barrier in BSP, they implement this with vertex locking. They also all a vertex to modify the state of its neighbors. We could certainly add something similar as an alternative computing model, perhaps without locking. Here's one idea: 1) No explicit supersteps (asynchronous) 2) All vertices execute compute() (and may or may not send messages) initially 3) Vertices can examine their neighbors or any vertex in the graph (issue RPCs to get their state) 4) When messages are received by a vertex, compute() is executed on it (and state is locally locked to compute only) 5) Vertices stlll vote to halt when done, indicating the end of the application. 6) Combiners can still be used to reduce the number of messages sent (and the number of times compute is executed). This could be fun. And provide an interesting comparison platform barrier based vs vertex based synchronization. On Fri, Sep 9, 2011 at 6:36 AM, Jake Mannix jake.man...@gmail.com wrote: On Fri, Sep 9, 2011 at 3:22 AM, Claudio Martella claudio.marte...@gmail.com wrote: One misunderstanding my side. Isn't it true that the messages have to be buffered as they all have to be collected before they can be processed (by definition of superstep)? So you cannot really process them as they come? This is the current implementation, yes, but I'm trying to see if an alternative is also possible in this framework, for Vertex implementations which are able to handle asynchronous updates. In this model, a vertex would be required to be able to handle multiple calls to compute() in a single superstep, and would instead signal that it's superstep computations are done at some (application specific) point. I know this goes outside of the concept of a BSP model, but I didn't want to get into too many details before I figure out how possible it was to implement some of this. -jake