Hi - I'd like to create a job that pulls small files from a remote server (using FTP, SCP, etc.) and stores them directly to sequence files on HDFS. Looking at the sequence file APi, I don't see an obvious way to do this. It looks like what I have to do is pull the remote file to disk, then read the file into memory to place in the sequence file. Is there a better way?
Looking at the API, am I forced to use the append method? FileSystem hdfs = FileSystem.get(context.getConfiguration()); FSDataOutputStream outputStream = hdfs.create(new Path(outputPath)); writer = SequenceFile.createWriter(context.getConfiguration(), outputStream, Text.class, BytesWritable.class, null, null); // read in file to remotefilebytes writer.append(filekey, remotefilebytes); The alternative would be to have one job pull the remote files, and a secondary job write them into sequence files. I'm using the latest Cloudera release, which I believe is Hadoop 20.1 Thanks.