Hi Avery, There is also the stateful Mapper/Reducer feature that could potentially be nice for Giraph since users could write Serializable stuff in their business logic that would be automatically available by being serialized/deserialized with the DistributedCache underneath.
Regarding serialization, Pangool is very efficient in intermediate serialization (the one used in shuffle/sort) and less efficient for persisting (Avro). As a downside, for every user custom data type there is an extra byte used (to identify its size). Besides, I will analyze further the project to get a better picture and suggest something more specific if that's the case. On Wed, Sep 19, 2012 at 11:12 PM, Avery Ching <[email protected]> wrote: > Thanks for contacting us Pere. > > We use Writable for serialization/deserialization given it's speed. We > are open to other APIs, but speed is an important concern (a lot of time is > spent doing serialization/deserialization)**. We don't use the actual > Hadoop framework for much except for scheduling, so I'm not sure how we can > take advantage of Pangool's interesting features. > > Avery > > > On 9/19/12 11:19 AM, Pere Ferrera wrote: > >> Hi to all, >> >> I have been taking a look to Giraph's source code. I have noticed the >> heavy >> usage of Writables in it and, even though I don't know many of the details >> of the project, I think it would be a good idea to at least consider the >> usage of Pangool instead of the Java Hadoop API. >> >> Pangool (http://pangool.net) is a low-level Java API on top of Hadoop >> that >> aims to make several things easier, one of them is dealing with compound >> types. Most of the others don't apply to Giraph since you are doing >> Map-Only jobs. >> >> The most interesting part of it for Giraph is that you would be able to >> have a Vertexs with Java classes (Integer, Float, ... or arbitrary >> serializable Objects) without needing to worry them being Writable. This >> would reduce some of the code and complexity of the project and it would >> allow for a more expressive, decoupled from Hadoop code where user >> functions (business logic) operate directly on Java types rather than on >> Hadoop types. >> >> Pangool has been designed for performance so it should perform in the same >> order than plain Hadoop (we did a benchmark to show that). Pangool uses >> Avro for persisting data. It is being used in production in some of our >> consulting projects (datasalt.com) successfully so we contribute actively >> to it. >> >> So, if this could be interesting at all I will be glad to submit a >> proposal >> in a patch and contribute. It will be a win-win situation where Pangool >> will benefit a lot from being actively used by a serious open-source >> project like Giraph. Of course, many details will need to be discussed. >> Take this as a preliminar suggestion just to see how it sounds. Feel free >> to ask any questions or concerns you may have. >> >> Thanks, >> >> Pere. >> >> >
