Not sure behind the rationale behind bytearray in pig (other than the need for wanting to decouple from hadoop dependency earlier on : pig was expected to run against any backend - hadoop, dryad, local, etc) - but the direct impact of that is the need to serialize and deserialize from internal objects to byte[] and vice versa ...
As a trivial example, at times, this has been a major drain on performance - a Writable based impl could have done smart things like de-serialize once and keep using the internal impl until needing to serialize at 'end of pipeline' (either to store or to serialize from map to reduce) : while the byte[] based current impl's have to serialize and deserialize at each udf input/output.
The philosophy of pig has definitely changed since the initial versions, with the hadoop interfaces leaking out to pig udf's ... but this one is probably too late to make these changes now :-)
(And if it does support Writable, then having byte[] is just silly overhead) Regards, Mridul On Tuesday 31 May 2011 11:00 PM, Jonathan Coveney wrote:
Disclaimer: I'm still learning my way around the Pig and Hadoop internals, so this question is aimed at better understanding that and some of the pig design choices... Is there a reason why in Pig we are restricted to a set of types (roughly corresponding to types in java), instead of having an abstract type like in Hadoop ie Writable or WritableComparable? I guess I got to thinking about this when thinking about the Algebraic interface... in Hadoop if you want to have some crazy intermediate objects, you can do that easily as long as they are serializable (ie Writable, and WritableComparable if they are going to the reducer in the shuffle). In fact, in Hadoop there is no notion of some special class of objects which we work with -- everything is simply Writable or WritableComparable. In Pig we are more limited, and I was just thinking about why that needs to be the case. Is there any reason why we can't have abstract types at the same level as String or Integer? My guess would be it has to do with how these objects are treated internally, but beyond that am not sure. Thanks for helping me think about this Jon
