Re: Faster alternative to FSDataInputStream

Ananth T. Sarathy Wed, 19 Aug 2009 08:46:21 -0700

Right now just in top down program. I am still learning this, so I need put
this in a map and reduce to optimize speed I will. Right now I am just
testing certain things, and getting a skeleton to write and pull files from
the s3 storage. Actual implementation is still being engineered.



Ananth T Sarathy


On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo <[email protected]>wrote:

> >>It would be as fast as underlying filesystem goes.
> I would not agree with that statement. There is overhead. If you have
> a single threaded process writing many small files you do not get the
> parallel write speed. In some testing I did writing a small file can
> take 30-300 ms. So if you have 9000 small files (like I did) and you
> are single threaded this takes a long time.
>
> If you orchestrate your task to use FSDataInput and FSDataOutput in
> the map or reduce phase then each mapper or reducer is writing a file
> at a time. Now that is fast.
>
> Ananth, are you doing your r/w inside a map/reduce job or are you just
> using FS* in a top down program?
>
>
>
> On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<[email protected]>
> wrote:
> > Ananth T. Sarathy wrote:
> >>
> >> I am trying to download binary files stored in Hadoop but there is like
> a
> >> 2
> >> minute wait on a 20mb file when I try to execute the in.read(buf).
> >
> > What does this mean : 2 min to pipe 20mb or one or your one of the
> in.read()
> > calls took 2 minutes? Your code actually measures team for read and
> write.
> >
> > There is nothing in FSInputstream to cause this slow down. Do you think
> > anyone would use Hadoop otherwise? It would be as fast as underlying
> > filesystem goes.
> >
> > Raghu.
> >
> >> is there a better way to be doing this?
> >>
> >>    private void pipe(InputStream in, OutputStream out) throws
> IOException
> >>    {    System.out.println(System.currentTimeMillis()+" Starting to Pipe
> >> Data");
> >>        byte[] buf = new byte[1024];
> >>        int read = 0;
> >>        while ((read = in.read(buf)) >= 0)
> >>        {
> >>            out.write(buf, 0, read);
> >>            System.out.println(System.currentTimeMillis()+" Piping
> Data");
> >>        }
> >>        out.flush();
> >>        System.out.println(System.currentTimeMillis()+" Finished Piping
> >> Data");
> >>
> >>    }
> >>
> >> public void readFile(String fileToRead, OutputStream out)
> >>            throws IOException
> >>    {
> >>        System.out.println(System.currentTimeMillis()+" Start Read
> File");
> >>        Path inFile = new Path(fileToRead);
> >>        System.out.println(System.currentTimeMillis()+" Set Path");
> >>        // Validate the input/output paths before reading/writing.
> >>
> >>        if (!fs.exists(inFile))
> >>        {
> >>            throw new HadoopFileException("Specified file  " + fileToRead
> >>                    + " not found.");
> >>        }
> >>        if (!fs.isFile(inFile))
> >>        {
> >>            throw new HadoopFileException("Specified file  " + fileToRead
> >>                    + " not found.");
> >>        }
> >>        // Open inFile for reading.
> >>        System.out.println(System.currentTimeMillis()+" Opening Data
> >> Stream");
> >>        FSDataInputStream in = fs.open(inFile);
> >>
> >>        System.out.println(System.currentTimeMillis()+" Opened Data
> >> Stream");
> >>        // Open outFile for writing.
> >>
> >>        // Read from input stream and write to output stream until EOF.
> >>        pipe(in, out);
> >>
> >>        // Close the streams when done.
> >>        out.close();
> >>        in.close();
> >>    }
> >> Ananth T Sarathy
> >>
> >
> >
>

Re: Faster alternative to FSDataInputStream

Reply via email to