Re: Faster alternative to FSDataInputStream

Edward Capriolo Wed, 19 Aug 2009 08:12:12 -0700

>>It would be as fast as underlying filesystem goes.
I would not agree with that statement. There is overhead. If you have
a single threaded process writing many small files you do not get the
parallel write speed. In some testing I did writing a small file can
take 30-300 ms. So if you have 9000 small files (like I did) and you
are single threaded this takes a long time.


If you orchestrate your task to use FSDataInput and FSDataOutput in
the map or reduce phase then each mapper or reducer is writing a file
at a time. Now that is fast.

Ananth, are you doing your r/w inside a map/reduce job or are you just
using FS* in a top down program?



On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<[email protected]> wrote:
> Ananth T. Sarathy wrote:
>>
>> I am trying to download binary files stored in Hadoop but there is like a
>> 2
>> minute wait on a 20mb file when I try to execute the in.read(buf).
>
> What does this mean : 2 min to pipe 20mb or one or your one of the in.read()
> calls took 2 minutes? Your code actually measures team for read and write.
>
> There is nothing in FSInputstream to cause this slow down. Do you think
> anyone would use Hadoop otherwise? It would be as fast as underlying
> filesystem goes.
>
> Raghu.
>
>> is there a better way to be doing this?
>>
>>    private void pipe(InputStream in, OutputStream out) throws IOException
>>    {    System.out.println(System.currentTimeMillis()+" Starting to Pipe
>> Data");
>>        byte[] buf = new byte[1024];
>>        int read = 0;
>>        while ((read = in.read(buf)) >= 0)
>>        {
>>            out.write(buf, 0, read);
>>            System.out.println(System.currentTimeMillis()+" Piping Data");
>>        }
>>        out.flush();
>>        System.out.println(System.currentTimeMillis()+" Finished Piping
>> Data");
>>
>>    }
>>
>> public void readFile(String fileToRead, OutputStream out)
>>            throws IOException
>>    {
>>        System.out.println(System.currentTimeMillis()+" Start Read File");
>>        Path inFile = new Path(fileToRead);
>>        System.out.println(System.currentTimeMillis()+" Set Path");
>>        // Validate the input/output paths before reading/writing.
>>
>>        if (!fs.exists(inFile))
>>        {
>>            throw new HadoopFileException("Specified file  " + fileToRead
>>                    + " not found.");
>>        }
>>        if (!fs.isFile(inFile))
>>        {
>>            throw new HadoopFileException("Specified file  " + fileToRead
>>                    + " not found.");
>>        }
>>        // Open inFile for reading.
>>        System.out.println(System.currentTimeMillis()+" Opening Data
>> Stream");
>>        FSDataInputStream in = fs.open(inFile);
>>
>>        System.out.println(System.currentTimeMillis()+" Opened Data
>> Stream");
>>        // Open outFile for writing.
>>
>>        // Read from input stream and write to output stream until EOF.
>>        pipe(in, out);
>>
>>        // Close the streams when done.
>>        out.close();
>>        in.close();
>>    }
>> Ananth T Sarathy
>>
>
>

Re: Faster alternative to FSDataInputStream

Reply via email to