Hello, 

after my experiences with giraph and hadoop in the last weeks, I would strongly 
suggest that a maven archetype for a simple giraph job 
should be made available for new developers. 

Figuring out how to change the provided giraph examples, in order to make them 
error free in an IDE, 
and then how to run a unit test and a InternalVertexRunner is manageable. 

However deploying that same code to a real hadoop cluster can be very time 
consuming and frustrating. 

There is a strong chance that a few people from my research unit will also need 
to learn about giraph and hadoop, 
and providing a maven archetype  is the way in which I would document my 
experiences for them. 


For that archetype I would suggest the following contents: 
* pom.xml which has dependencies to hadoop, and which specifies the assembly 
instructions for a jar that hadoop can use 
(not ./lib as everybody on the web says, but unpcked jars in / ) 
* empty vertex class which is a subclass of HashMapVertex (with comments to 
explain that other classes like BasicVertex should never be subclassed by the 
user) 
* empty TextInputFormat
* empty TextOutputFormat
* empty class with run() and ToolRunner invocation, and comments to explain 
that this is an alternative to bin/giraph, and how to use bin/giraph for the 
same effect
(also explain the more advanced things which a custom run() can do) 
* make sure that all classes can be called through bin/giraph as well (and 
debug GiraphRunner if there still is some error) 
* empty Test class using internalvertexrunner 
* everything should be able to run via the Test, the ToolRunner or bin/giraph 
just without doing anything. 

I also consider this a good opportunity to learn about the best practices of 
using giraph, 
and I think that I can probably work on that archetype in April. 

The archetype would be based on a cleaned up and domain/use-case agnostic 
version of my code which is currently here: 
 https://github.com/2nd-metaman/sa-rdf-giraph

I am not sure how that would be distributed, probably using the same 
infrastructure
which is required for distributing an giraph maven artefact to the apache maven 
servers anyway. 

Please let me know if you as the giraph community thinks this is a good idea, 
and if you have additions and/or changes to what should go inside of the 
archetype. 


cheers, Benjamin. 

Reply via email to