I could, however, the "small" files could grow beyond what I want to allocate 
memory for.  I could drop the files to disk, and load them as well in the job, 
but that seems less efficient then just saving the files and processing with a 
secondary job to create sequence files.

Thanks.

On Mar 12, 2010, at 2:20 PM, Zak Stone wrote:

> Why not write a Hadoop map task that fetches the remote files into
> memory and then emits them as key-value pairs into a SequenceFile?
> 
> Zak
> 
> 
> On Fri, Mar 12, 2010 at 8:22 AM, Scott Whitecross <sc...@dataxu.com> wrote:
>> Hi -
>> 
>> I'd like to create a job that pulls small files from a remote server (using 
>> FTP, SCP, etc.) and stores them directly to sequence files on HDFS.  Looking 
>> at the sequence file APi, I don't see an obvious way to do this.  It looks 
>> like what I have to do is pull the remote file to disk, then read the file 
>> into memory to place in the sequence file.  Is there a better way?
>> 
>> Looking at the API, am I forced to use the append method?
>> 
>>            FileSystem hdfs = FileSystem.get(context.getConfiguration());
>>            FSDataOutputStream outputStream = hdfs.create(new 
>> Path(outputPath));
>>            writer = SequenceFile.createWriter(context.getConfiguration(), 
>> outputStream, Text.class, BytesWritable.class, null, null);
>> 
>>           // read in file to remotefilebytes
>> 
>>            writer.append(filekey, remotefilebytes);
>> 
>> 
>> The alternative would be to have one job pull the remote files, and a 
>> secondary job write them into sequence files.
>> 
>> I'm using the latest Cloudera release, which I believe is Hadoop 20.1
>> 
>> Thanks.
>> 
>> 
>> 
>> 
>> 

Reply via email to