I could, however, the "small" files could grow beyond what I want to allocate memory for. I could drop the files to disk, and load them as well in the job, but that seems less efficient then just saving the files and processing with a secondary job to create sequence files.
Thanks. On Mar 12, 2010, at 2:20 PM, Zak Stone wrote: > Why not write a Hadoop map task that fetches the remote files into > memory and then emits them as key-value pairs into a SequenceFile? > > Zak > > > On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote: >> Hi - >> >> I'd like to create a job that pulls small files from a remote server (using >> FTP, SCP, etc.) and stores them directly to sequence files on HDFS. Looking >> at the sequence file APi, I don't see an obvious way to do this. It looks >> like what I have to do is pull the remote file to disk, then read the file >> into memory to place in the sequence file. Is there a better way? >> >> Looking at the API, am I forced to use the append method? >> >> FileSystem hdfs = FileSystem.get(context.getConfiguration()); >> FSDataOutputStream outputStream = hdfs.create(new >> Path(outputPath)); >> writer = SequenceFile.createWriter(context.getConfiguration(), >> outputStream, Text.class, BytesWritable.class, null, null); >> >> // read in file to remotefilebytes >> >> writer.append(filekey, remotefilebytes); >> >> >> The alternative would be to have one job pull the remote files, and a >> secondary job write them into sequence files. >> >> I'm using the latest Cloudera release, which I believe is Hadoop 20.1 >> >> Thanks. >> >> >> >> >>