You can use any kind of format for files in the distributed cache, so
yes you can use sequence files. They should be faster to parse than
most text formats.

-Joey

On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki
<[email protected]> wrote:
> Thank you for the reply!
> In each map(), I need to open-read-close these files (more than 2 in the 
> general case, and maybe up to 20 or more), in order to make some checks. 
> Considering the huge amount of data in the input, making all these file 
> operations on HDFS will kill the performance!!! So I think it would be better 
> to store these files in distributed Cache, so that the whole process would be 
> more efficient -I guess this is the point of using Distributed Cache in the 
> first place!
>
> My question is, if I can store sequence files in distributed Cache and handle 
> them using e.g. the SequenceFile.Reader class, or if I should only keep 
> regular text files in distributed Cache and handle them using the usual java 
> API.
>
> Thank you very much
> Sofia
>
> PS: The files have small size, a few KB to few MB maximum.
>
>
>
> ________________________________
> From: Dino Kečo <[email protected]>
> To: [email protected]; Sofia Georgiakaki <[email protected]>
> Sent: Friday, August 12, 2011 11:30 AM
> Subject: Re: Hadoop--store a sequence file in distributed cache?
>
> Hi Sofia,
>
> I assume that output of first job is stored on HDFS. In that case I would
> directly read file from Mappers without using distributed cache. If you put
> file into distributed cache that would add one more copy operation into your
> process.
>
> Thanks,
> dino
>
>
> On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki
> <[email protected]>wrote:
>
>> Good morning,
>>
>> I would like to store some files in the distributed cache, in order to be
>> opened and read from the mappers.
>> The files are produced by an other Job and are sequence files.
>> I am not sure if that format is proper for the distributed cache, as the
>> files in distr.cache are stored and read locally. Should I change the format
>> of the files in the previous Job and make them Text Files maybe and read
>> them from the Distr.Cache using tha simple Java API?
>> Or can I still handle them with the usual way we use sequence files, even
>> if they reside in the local directory? Performance is extremely important
>> for my project, so I don't know what the best solution would be.
>>
>> Thank you in advance,
>> Sofia Georgiakaki



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Reply via email to