Too bad we can not up the replication on the first few blocks of the file or distributed cache it.
The crontrib statement is arguable. I could make a case that the majority of stuff should not be in hadoop-core. NLineInputFormat for example, nice to have. Took a long time to get ported to the new map reduce format. DBInputFormat DataDriverDBInputFormat sexy for sure but does not need to be part of core. I could see hadoop as just coming with TextInputFormat and SequenceInputFormat and everything else is after market from github, On Wed, Feb 29, 2012 at 11:31 AM, Robert Evans <ev...@yahoo-inc.com> wrote: > I can see a use for it, but I have two concerns about it. My biggest concern > is maintainability. We have had lots of things get thrown into contrib in > the past, very few people use them, and inevitably they start to suffer from > bit rot. I am not saying that it will happen with this, but if you have to > ask if people will use it and there has been no overwhelming yes, it makes me > nervous about it. My second concern is with knowing when to use this. > Anything that adds this in would have to come with plenty of documentation > about how it works, how it is different from the normal gzip format, > explanations about what type of a load it might put on data nodes that hold > the start of the file, etc. > > From both of these I would prefer to see this as a github project for a while > first, and one it shows that it has a significant following, or a community > with it, then we can pull it in. But if others disagree I am not going to > block it. I am a -0 on pulling this in now. > > --Bobby > > On 2/29/12 10:00 AM, "Niels Basjes" <ni...@basjes.nl> wrote: > > Hi, > > On Wed, Feb 29, 2012 at 16:52, Edward Capriolo <edlinuxg...@gmail.com>wrote: > ... > >> But being able to generate split info for them and processing them >> would be good as well. I remember that was a hot thing to do with lzo >> back in the day. The pain of once overing the gz files to generate the >> split info is detracting but it is nice to know it is there if you >> want it. >> > > Note that the solution I created (HADOOP-7076) does not require any > preprocessing. > It can split ANY gzipped file as-is. > The downside is that this effectively costs some additional performance > because the task has to decompress the first part of the file that is to be > discarded. > > The other two ways of splitting gzipped files either require > - creating come kind of "compression index" before actually using the file > (HADOOP-6153) > - creating a file in a format that is gerenated in such a way that it is > really a set of concatenated gzipped files. (HADOOP-7909) > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes >