Scott, The code you have below should work, provided that the 'outputPath' points to an HDFS file. The trick is to get FTP/SCP access to the remote files using a Java client and receive the contents into a byte buffer.You can then set that byte buffer into your BytesWritable and call writer.append().
On Fri, Mar 12, 2010 at 9:22 AM, Scott Whitecross <sc...@dataxu.com> wrote: > Hi - > > I'd like to create a job that pulls small files from a remote server (using > FTP, SCP, etc.) and stores them directly to sequence files on HDFS. Looking > at the sequence file APi, I don't see an obvious way to do this. It looks > like what I have to do is pull the remote file to disk, then read the file > into memory to place in the sequence file. Is there a better way? > > Looking at the API, am I forced to use the append method? > > FileSystem hdfs = FileSystem.get(context.getConfiguration()); > FSDataOutputStream outputStream = hdfs.create(new > Path(outputPath)); > writer = SequenceFile.createWriter(context.getConfiguration(), > outputStream, Text.class, BytesWritable.class, null, null); > > // read in file to remotefilebytes > > writer.append(filekey, remotefilebytes); > > > The alternative would be to have one job pull the remote files, and a > secondary job write them into sequence files. > > I'm using the latest Cloudera release, which I believe is Hadoop 20.1 > > Thanks. > > > > >