Re: Should splittable Gzip be a "core" hadoop feature?

Edward Capriolo Wed, 29 Feb 2012 09:07:03 -0800

Too bad we can not up the replication on the first few blocks of the
file or distributed cache it.


The crontrib statement is arguable. I could make a case that the
majority of stuff should not be in hadoop-core. NLineInputFormat for
example, nice to have. Took a long time to get ported to the new map
reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but
does not need to be part of core. I could see hadoop as just coming
with TextInputFormat and SequenceInputFormat and everything else is
after market from github,

On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans <ev...@yahoo-inc.com> wrote:
> I can see a use for it, but I have two concerns about it.  My biggest concern 
> is maintainability.  We have had lots of things get thrown into contrib in 
> the past, very few people use them, and inevitably they start to suffer from 
> bit rot.  I am not saying that it will happen with this, but if you have to 
> ask if people will use it and there has been no overwhelming yes, it makes me 
> nervous about it.  My second concern is with knowing when to use this.  
> Anything that adds this in would have to come with plenty of documentation 
> about how it works, how it is different from the normal gzip format, 
> explanations about what type of a load it might put on data nodes that hold 
> the start of the file, etc.
>
> From both of these I would prefer to see this as a github project for a while 
> first, and one it shows that it has a significant following, or a community 
> with it, then we can pull it in.  But if others disagree I am not going to 
> block it.  I am a -0 on pulling this in now.
>
> --Bobby
>
> On 2/29/12 10:00 AM, "Niels Basjes" <ni...@basjes.nl> wrote:
>
> Hi,
>
> On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <edlinuxg...@gmail.com>wrote:
> ...
>
>> But being able to generate split info for them and processing them
>> would be good as well. I remember that was a hot thing to do with lzo
>> back in the day. The pain of once overing the gz files to generate the
>> split info is detracting but it is nice to know it is there if you
>> want it.
>>
>
> Note that the solution I created (HADOOP-7076) does not require any
> preprocessing.
> It can split ANY gzipped file as-is.
> The downside is that this effectively costs some additional performance
> because the task has to decompress the first part of the file that is to be
> discarded.
>
> The other two ways of splitting gzipped files either require
> - creating come kind of "compression index" before actually using the file
> (HADOOP-6153)
> - creating a file in a format that is gerenated in such a way that it is
> really a set of concatenated gzipped files. (HADOOP-7909)
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Re: Should splittable Gzip be a "core" hadoop feature?

Reply via email to