Right now just in top down program. I am still learning this, so I need put this in a map and reduce to optimize speed I will. Right now I am just testing certain things, and getting a skeleton to write and pull files from the s3 storage. Actual implementation is still being engineered.
Ananth T Sarathy On Wed, Aug 19, 2009 at 11:11 AM, Edward Capriolo <[email protected]>wrote: > >>It would be as fast as underlying filesystem goes. > I would not agree with that statement. There is overhead. If you have > a single threaded process writing many small files you do not get the > parallel write speed. In some testing I did writing a small file can > take 30-300 ms. So if you have 9000 small files (like I did) and you > are single threaded this takes a long time. > > If you orchestrate your task to use FSDataInput and FSDataOutput in > the map or reduce phase then each mapper or reducer is writing a file > at a time. Now that is fast. > > Ananth, are you doing your r/w inside a map/reduce job or are you just > using FS* in a top down program? > > > > On Wed, Aug 19, 2009 at 1:26 AM, Raghu Angadi<[email protected]> > wrote: > > Ananth T. Sarathy wrote: > >> > >> I am trying to download binary files stored in Hadoop but there is like > a > >> 2 > >> minute wait on a 20mb file when I try to execute the in.read(buf). > > > > What does this mean : 2 min to pipe 20mb or one or your one of the > in.read() > > calls took 2 minutes? Your code actually measures team for read and > write. > > > > There is nothing in FSInputstream to cause this slow down. Do you think > > anyone would use Hadoop otherwise? It would be as fast as underlying > > filesystem goes. > > > > Raghu. > > > >> is there a better way to be doing this? > >> > >> private void pipe(InputStream in, OutputStream out) throws > IOException > >> { System.out.println(System.currentTimeMillis()+" Starting to Pipe > >> Data"); > >> byte[] buf = new byte[1024]; > >> int read = 0; > >> while ((read = in.read(buf)) >= 0) > >> { > >> out.write(buf, 0, read); > >> System.out.println(System.currentTimeMillis()+" Piping > Data"); > >> } > >> out.flush(); > >> System.out.println(System.currentTimeMillis()+" Finished Piping > >> Data"); > >> > >> } > >> > >> public void readFile(String fileToRead, OutputStream out) > >> throws IOException > >> { > >> System.out.println(System.currentTimeMillis()+" Start Read > File"); > >> Path inFile = new Path(fileToRead); > >> System.out.println(System.currentTimeMillis()+" Set Path"); > >> // Validate the input/output paths before reading/writing. > >> > >> if (!fs.exists(inFile)) > >> { > >> throw new HadoopFileException("Specified file " + fileToRead > >> + " not found."); > >> } > >> if (!fs.isFile(inFile)) > >> { > >> throw new HadoopFileException("Specified file " + fileToRead > >> + " not found."); > >> } > >> // Open inFile for reading. > >> System.out.println(System.currentTimeMillis()+" Opening Data > >> Stream"); > >> FSDataInputStream in = fs.open(inFile); > >> > >> System.out.println(System.currentTimeMillis()+" Opened Data > >> Stream"); > >> // Open outFile for writing. > >> > >> // Read from input stream and write to output stream until EOF. > >> pipe(in, out); > >> > >> // Close the streams when done. > >> out.close(); > >> in.close(); > >> } > >> Ananth T Sarathy > >> > > > > >
