Hi -

I'd like to create a job that pulls small files from a remote server (using 
FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking at 
the sequence file APi, I don't see an obvious way to do this.  It looks like 
what I have to do is pull the remote file to disk, then read the file into 
memory to place in the sequence file.  Is there a better way?

Looking at the API, am I forced to use the append method?

            FileSystem hdfs = FileSystem.get(context.getConfiguration());
            FSDataOutputStream outputStream = hdfs.create(new Path(outputPath));
            writer = SequenceFile.createWriter(context.getConfiguration(), 
outputStream, Text.class, BytesWritable.class, null, null);
                
           // read in file to remotefilebytes            

            writer.append(filekey, remotefilebytes);


The alternative would be to have one job pull the remote files, and a secondary 
job write them into sequence files.    

I'm using the latest Cloudera release, which I believe is Hadoop 20.1

Thanks.




Reply via email to