Re: serialization / deserialization improvement suggestion

Pere Ferrera Wed, 19 Sep 2012 15:02:20 -0700

Hi Avery,

There is also the stateful Mapper/Reducer feature that could potentially be
nice for Giraph since users could write Serializable stuff in their
business logic that would be automatically available by being
serialized/deserialized with the DistributedCache underneath.


Regarding serialization, Pangool is very efficient in intermediate
serialization (the one used in shuffle/sort) and less efficient for
persisting (Avro). As a downside, for every user custom data type there is
an extra byte used (to identify its size).

Besides, I will analyze further the project to get a better picture and
suggest something more specific if that's the case.

On Wed, Sep 19, 2012 at 11:12 PM, Avery Ching <[email protected]> wrote:

> Thanks for contacting us Pere.
>
> We use Writable for serialization/deserialization given it's speed.  We
> are open to other APIs, but speed is an important concern (a lot of time is
> spent doing serialization/deserialization)**.  We don't use the actual
> Hadoop framework for much except for scheduling, so I'm not sure how we can
> take advantage of Pangool's interesting features.
>
> Avery
>
>
> On 9/19/12 11:19 AM, Pere Ferrera wrote:
>
>> Hi to all,
>>
>> I have been taking a look to Giraph's source code. I have noticed the
>> heavy
>> usage of Writables in it and, even though I don't know many of the details
>> of the project, I think it would be a good idea to at least consider the
>> usage of Pangool instead of the Java Hadoop API.
>>
>> Pangool (http://pangool.net) is a low-level Java API on top of Hadoop
>> that
>> aims to make several things easier, one of them is dealing with compound
>> types. Most of the others don't apply to Giraph since you are doing
>> Map-Only jobs.
>>
>> The most interesting part of it for Giraph is that you would be able to
>> have a Vertexs with Java classes (Integer, Float, ... or arbitrary
>> serializable Objects) without needing to worry them being Writable. This
>> would reduce some of the code and complexity of the project and it would
>> allow for a more expressive, decoupled from Hadoop code where user
>> functions (business logic) operate directly on Java types rather than on
>> Hadoop types.
>>
>> Pangool has been designed for performance so it should perform in the same
>> order than plain Hadoop (we did a benchmark to show that). Pangool uses
>> Avro for persisting data. It is being used in production in some of our
>> consulting projects (datasalt.com) successfully so we contribute actively
>> to it.
>>
>> So, if this could be interesting at all I will be glad to submit a
>> proposal
>> in a patch and contribute. It will be a win-win situation where Pangool
>> will benefit a lot from being actively used by a serious open-source
>> project like Giraph. Of course, many details will need to be discussed.
>> Take this as a preliminar suggestion just to see how it sounds. Feel free
>> to ask any questions or concerns you may have.
>>
>> Thanks,
>>
>> Pere.
>>
>>
>

Re: serialization / deserialization improvement suggestion

Reply via email to