Wouldn't it be possible to take the recordreader class as an config
variable and then have a concrete implementation that instantiates the
configured record reader? (like streaminputformat)


What I meant about about the splits wasn't so much about the number of
maps - but how the allocation of files to each map job is done.
Currently the logic (in MultiFileInputFormat) doesn't take the location
into account. All we need to do is sort the files by location in the
getSplits() method and then do the binning. That way - files in the same
split will be co-located. (ok - there are multiple locations for each
file - but I would think choosing _a_ location and binning based on that
would be better then doing so randomly).

-----Original Message-----
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 24, 2008 12:27 AM
To: core-user@hadoop.apache.org
Subject: Re: Best practices for handling many small files

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.

An InputFormat returns <numSplits> many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.

Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.

Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.


Joydeep Sen Sarma wrote:
> million map processes are horrible. aside from overhead - don't do it
if u share the cluster with other jobs (all other jobs will get killed
whenever the million map job is finished - see
https://issues.apache.org/jira/browse/HADOOP-2393)
>
> well - even for #2 - it begs the question of how the packing itself
will be parallelized ..
>
> There's a MultiFileInputFormat that can be extended - that allows
processing of multiple files in a single map job. it needs improvement.
For one - it's an abstract class - and a concrete implementation for (at
least)  text files would help. also - the splitting logic is not very
smart (from what i last saw). ideally - it should take the million files
and form it into N groups (say N is size of your cluster) where each
group has files local to the Nth machine and then process them on that
machine. currently it doesn't do this (the groups are arbitrary). But
it's still the way to go ..
>
>
> -----Original Message-----
> From: [EMAIL PROTECTED] on behalf of Stuart Sierra
> Sent: Wed 4/23/2008 8:55 AM
> To: core-user@hadoop.apache.org
> Subject: Best practices for handling many small files
>  
> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
>
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
>
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
>
> Thanks,
> -Stuart, altlaw.org
>
>
>   

Reply via email to