You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats.
-Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki <[email protected]> wrote: > Thank you for the reply! > In each map(), I need to open-read-close these files (more than 2 in the > general case, and maybe up to 20 or more), in order to make some checks. > Considering the huge amount of data in the input, making all these file > operations on HDFS will kill the performance!!! So I think it would be better > to store these files in distributed Cache, so that the whole process would be > more efficient -I guess this is the point of using Distributed Cache in the > first place! > > My question is, if I can store sequence files in distributed Cache and handle > them using e.g. the SequenceFile.Reader class, or if I should only keep > regular text files in distributed Cache and handle them using the usual java > API. > > Thank you very much > Sofia > > PS: The files have small size, a few KB to few MB maximum. > > > > ________________________________ > From: Dino Kečo <[email protected]> > To: [email protected]; Sofia Georgiakaki <[email protected]> > Sent: Friday, August 12, 2011 11:30 AM > Subject: Re: Hadoop--store a sequence file in distributed cache? > > Hi Sofia, > > I assume that output of first job is stored on HDFS. In that case I would > directly read file from Mappers without using distributed cache. If you put > file into distributed cache that would add one more copy operation into your > process. > > Thanks, > dino > > > On Fri, Aug 12, 2011 at 9:53 AM, Sofia Georgiakaki > <[email protected]>wrote: > >> Good morning, >> >> I would like to store some files in the distributed cache, in order to be >> opened and read from the mappers. >> The files are produced by an other Job and are sequence files. >> I am not sure if that format is proper for the distributed cache, as the >> files in distr.cache are stored and read locally. Should I change the format >> of the files in the previous Job and make them Text Files maybe and read >> them from the Distr.Cache using tha simple Java API? >> Or can I still handle them with the usual way we use sequence files, even >> if they reside in the local directory? Performance is extremely important >> for my project, so I don't know what the best solution would be. >> >> Thank you in advance, >> Sofia Georgiakaki -- Joseph Echeverria Cloudera, Inc. 443.305.9434
