Re: Best practices for handling many small files
Joydeep Sen Sarma wrote: There seems to be two problems with small files: 1. namenode overhead. (3307 seems like _a_ solution) 2. map-reduce processing overhead and locality It's not clear from 3307 description, how the archives interface with map-reduce. How are the splits done? Will they solve problem #2? Yes, I think 3307 will address (2). Many small files will be packed into fewer larger files, each file typically substantially larger than a block. A splitter can read the index files and then use MultiFileInputFormat, so that each split could contain files that are contained almost entirely in a single block. Good MapReduce performance is a requirement for the design of 3307. Doug
RE: Best practices for handling many small files
There seems to be two problems with small files: 1. namenode overhead. (3307 seems like _a_ solution) 2. map-reduce processing overhead and locality It's not clear from 3307 description, how the archives interface with map-reduce. How are the splits done? Will they solve problem #2? To some extent, the goals of a archive and the (ideal) multifileinputformat (MFIF) differ. The archive wants to preserve the identity of each of the subobjects. MFIF offers a way for users to homogenize small objects into larger ones (while processing). When a user wants to use MFIF - they don't care about the identities of each of the small files. In the interest of layering - it seems we should keep these separate. 3307 can offer an efficient way to store small files. A good MFIF implementation can offer a way to efficiently do map-reduce on small files (whether those small files are a regular hdfs file or are backed by a archive). This way the user will also have an option to use regular fileinputformat (say they _do_ care about the file being processed). -Original Message- From: Konstantin Shvachko [mailto:[EMAIL PROTECTED] Sent: Friday, April 25, 2008 10:46 AM To: core-user@hadoop.apache.org Subject: Re: Best practices for handling many small files Would the new archive feature HADOOP-3307 that is currently being developed help this problem? http://issues.apache.org/jira/browse/HADOOP-3307 --Konstantin Subramaniam Krishnan wrote: > > We have actually written a custom Multi File Splitter that collapses all > the small files to a single split till the DFS Block Size is hit. > We also take care of handling big files by splitting them on Block Size > and adding up all the reminders(if any) to a single split. > > It works great for us:-) > We are working on optimizing it further to club all the small files in a > single data node together so that the Map can have maximum local data. > > We plan to share this(provided it's found acceptable, of course) once > this is done. > > Regards, > Subru > > Stuart Sierra wrote: > >> Thanks for the advice, everyone. I'm going to go with #2, packing my >> million files into a small number of SequenceFiles. This is slow, but >> only has to be done once. My "datacenter" is Amazon Web Services :), >> so storing a few large, compressed files is the easiest way to go. >> >> My code, if anyone's interested, is here: >> http://stuartsierra.com/2008/04/24/a-million-little-files >> >> -Stuart >> altlaw.org >> >> >> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra >> <[EMAIL PROTECTED]> wrote: >> >> >>> Hello all, Hadoop newbie here, asking: what's the preferred way to >>> handle large (~1 million) collections of small files (10 to 100KB) in >>> which each file is a single "record"? >>> >>> 1. Ignore it, let Hadoop create a million Map processes; >>> 2. Pack all the files into a single SequenceFile; or >>> 3. Something else? >>> >>> I started writing code to do #2, transforming a big tar.bz2 into a >>> BLOCK-compressed SequenceFile, with the file names as keys. Will that >>> work? >>> >>> Thanks, >>> -Stuart, altlaw.org >>> >>> > >
Re: Best practices for handling many small files
Would the new archive feature HADOOP-3307 that is currently being developed help this problem? http://issues.apache.org/jira/browse/HADOOP-3307 --Konstantin Subramaniam Krishnan wrote: We have actually written a custom Multi File Splitter that collapses all the small files to a single split till the DFS Block Size is hit. We also take care of handling big files by splitting them on Block Size and adding up all the reminders(if any) to a single split. It works great for us:-) We are working on optimizing it further to club all the small files in a single data node together so that the Map can have maximum local data. We plan to share this(provided it's found acceptable, of course) once this is done. Regards, Subru Stuart Sierra wrote: Thanks for the advice, everyone. I'm going to go with #2, packing my million files into a small number of SequenceFiles. This is slow, but only has to be done once. My "datacenter" is Amazon Web Services :), so storing a few large, compressed files is the easiest way to go. My code, if anyone's interested, is here: http://stuartsierra.com/2008/04/24/a-million-little-files -Stuart altlaw.org On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: Best practices for handling many small files
We have actually written a custom Multi File Splitter that collapses all the small files to a single split till the DFS Block Size is hit. We also take care of handling big files by splitting them on Block Size and adding up all the reminders(if any) to a single split. It works great for us:-) We are working on optimizing it further to club all the small files in a single data node together so that the Map can have maximum local data. We plan to share this(provided it's found acceptable, of course) once this is done. Regards, Subru Stuart Sierra wrote: Thanks for the advice, everyone. I'm going to go with #2, packing my million files into a small number of SequenceFiles. This is slow, but only has to be done once. My "datacenter" is Amazon Web Services :), so storing a few large, compressed files is the easiest way to go. My code, if anyone's interested, is here: http://stuartsierra.com/2008/04/24/a-million-little-files -Stuart altlaw.org On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
RE: Best practices for handling many small files
Wouldn't it be possible to take the recordreader class as an config variable and then have a concrete implementation that instantiates the configured record reader? (like streaminputformat) What I meant about about the splits wasn't so much about the number of maps - but how the allocation of files to each map job is done. Currently the logic (in MultiFileInputFormat) doesn't take the location into account. All we need to do is sort the files by location in the getSplits() method and then do the binning. That way - files in the same split will be co-located. (ok - there are multiple locations for each file - but I would think choosing _a_ location and binning based on that would be better then doing so randomly). -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Thursday, April 24, 2008 12:27 AM To: core-user@hadoop.apache.org Subject: Re: Best practices for handling many small files A shameless attempt to defend MultiFileInputFormat : A concrete implementation of MultiFileInputFormat is not needed, since every InputFormat relying on MultiFileInputFormat is expected to have its custom RecordReader implementation, thus they need to override getRecordReader(). An implementation which returns (sort of) LineRecordReader is under src/examples/.../MultiFileWordCount. However we may include it if any generic (for example returning SequenceFileRecordReader) implementation pops up. An InputFormat returns many Splits from getSplits(JobConf job, int numSplits), which is the number of maps, not the number of machines in the cluster. Last of all, MultiFileSplit class implements getLocations() method, which returns the files' locations. Thus it's the JT's job to assign tasks to leverage local processing. Coming to the original question, I think #2 is better, if the construction of the sequence file is not a bottleneck. You may, for example, create several sequence files in parallel and use all of them as input w/o merging. Joydeep Sen Sarma wrote: > million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) > > well - even for #2 - it begs the question of how the packing itself will be parallelized .. > > There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. > > > -Original Message- > From: [EMAIL PROTECTED] on behalf of Stuart Sierra > Sent: Wed 4/23/2008 8:55 AM > To: core-user@hadoop.apache.org > Subject: Best practices for handling many small files > > Hello all, Hadoop newbie here, asking: what's the preferred way to > handle large (~1 million) collections of small files (10 to 100KB) in > which each file is a single "record"? > > 1. Ignore it, let Hadoop create a million Map processes; > 2. Pack all the files into a single SequenceFile; or > 3. Something else? > > I started writing code to do #2, transforming a big tar.bz2 into a > BLOCK-compressed SequenceFile, with the file names as keys. Will that > work? > > Thanks, > -Stuart, altlaw.org > > >
Re: Best practices for handling many small files
Stuart, This will have the (slightly) desirable side-effect of making your total disk foot-print smaller. I don't suppose that matters all that much any more, but it is still a nice thought. On 4/24/08 8:28 AM, "Stuart Sierra" <[EMAIL PROTECTED]> wrote: > Thanks for the advice, everyone. I'm going to go with #2, packing my > million files into a small number of SequenceFiles. This is slow, but > only has to be done once. My "datacenter" is Amazon Web Services :), > so storing a few large, compressed files is the easiest way to go. > > My code, if anyone's interested, is here: > http://stuartsierra.com/2008/04/24/a-million-little-files > > -Stuart > altlaw.org > > > On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: >> Hello all, Hadoop newbie here, asking: what's the preferred way to >> handle large (~1 million) collections of small files (10 to 100KB) in >> which each file is a single "record"? >> >> 1. Ignore it, let Hadoop create a million Map processes; >> 2. Pack all the files into a single SequenceFile; or >> 3. Something else? >> >> I started writing code to do #2, transforming a big tar.bz2 into a >> BLOCK-compressed SequenceFile, with the file names as keys. Will that >> work? >> >> Thanks, >> -Stuart, altlaw.org >>
Re: Best practices for handling many small files
Thanks for the advice, everyone. I'm going to go with #2, packing my million files into a small number of SequenceFiles. This is slow, but only has to be done once. My "datacenter" is Amazon Web Services :), so storing a few large, compressed files is the easiest way to go. My code, if anyone's interested, is here: http://stuartsierra.com/2008/04/24/a-million-little-files -Stuart altlaw.org On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > Hello all, Hadoop newbie here, asking: what's the preferred way to > handle large (~1 million) collections of small files (10 to 100KB) in > which each file is a single "record"? > > 1. Ignore it, let Hadoop create a million Map processes; > 2. Pack all the files into a single SequenceFile; or > 3. Something else? > > I started writing code to do #2, transforming a big tar.bz2 into a > BLOCK-compressed SequenceFile, with the file names as keys. Will that > work? > > Thanks, > -Stuart, altlaw.org >
Re: Best practices for handling many small files
A shameless attempt to defend MultiFileInputFormat : A concrete implementation of MultiFileInputFormat is not needed, since every InputFormat relying on MultiFileInputFormat is expected to have its custom RecordReader implementation, thus they need to override getRecordReader(). An implementation which returns (sort of) LineRecordReader is under src/examples/.../MultiFileWordCount. However we may include it if any generic (for example returning SequenceFileRecordReader) implementation pops up. An InputFormat returns many Splits from getSplits(JobConf job, int numSplits), which is the number of maps, not the number of machines in the cluster. Last of all, MultiFileSplit class implements getLocations() method, which returns the files' locations. Thus it's the JT's job to assign tasks to leverage local processing. Coming to the original question, I think #2 is better, if the construction of the sequence file is not a bottleneck. You may, for example, create several sequence files in parallel and use all of them as input w/o merging. Joydeep Sen Sarma wrote: million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: Best practices for handling many small files
are the files to be stored on HDFS long term, or do they need to be fetched from an external authoritative source? depending on how things are setup in your datacenter etc... you could aggregate them into a fat sequence file (or a few). keep in mind how long it would take to fetch the files and aggregate them (this is a serial process) and if the corpus changes often (how often will you need to make these sequence files). another option is to make a manifest (list of docs to fetch), feed that to your mapper and have it fetch each file individually. this would be useful if the corpus is reasonably arbitrary between runs and could eliminate much of the load time. but painful if the data is external to your datacenter and the cost to refetch is high. there really is no simple answer.. ckw On Apr 23, 2008, at 9:16 AM, Joydeep Sen Sarma wrote: million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
RE: Best practices for handling many small files
million map processes are horrible. aside from overhead - don't do it if u share the cluster with other jobs (all other jobs will get killed whenever the million map job is finished - see https://issues.apache.org/jira/browse/HADOOP-2393) well - even for #2 - it begs the question of how the packing itself will be parallelized .. There's a MultiFileInputFormat that can be extended - that allows processing of multiple files in a single map job. it needs improvement. For one - it's an abstract class - and a concrete implementation for (at least) text files would help. also - the splitting logic is not very smart (from what i last saw). ideally - it should take the million files and form it into N groups (say N is size of your cluster) where each group has files local to the Nth machine and then process them on that machine. currently it doesn't do this (the groups are arbitrary). But it's still the way to go .. -Original Message- From: [EMAIL PROTECTED] on behalf of Stuart Sierra Sent: Wed 4/23/2008 8:55 AM To: core-user@hadoop.apache.org Subject: Best practices for handling many small files Hello all, Hadoop newbie here, asking: what's the preferred way to handle large (~1 million) collections of small files (10 to 100KB) in which each file is a single "record"? 1. Ignore it, let Hadoop create a million Map processes; 2. Pack all the files into a single SequenceFile; or 3. Something else? I started writing code to do #2, transforming a big tar.bz2 into a BLOCK-compressed SequenceFile, with the file names as keys. Will that work? Thanks, -Stuart, altlaw.org
Re: Best practices for handling many small files
Yes. That (2) should work well. On 4/23/08 8:55 AM, "Stuart Sierra" <[EMAIL PROTECTED]> wrote: > Hello all, Hadoop newbie here, asking: what's the preferred way to > handle large (~1 million) collections of small files (10 to 100KB) in > which each file is a single "record"? > > 1. Ignore it, let Hadoop create a million Map processes; > 2. Pack all the files into a single SequenceFile; or > 3. Something else? > > I started writing code to do #2, transforming a big tar.bz2 into a > BLOCK-compressed SequenceFile, with the file names as keys. Will that > work? > > Thanks, > -Stuart, altlaw.org