Re: Combining AVRO files efficiently within HDFS

Joey Echeverria Fri, 06 Jan 2012 10:05:31 -0800

I would do it by staging the machine data into a temporary directory
and then renaming the directory when it's been verified. So, data
would be written into directories like this:


2012-01/02/00/stage/machine1.log.avro
2012-01/02/00/stage/machine2.log.avro
2012-01/02/00/stage/machine3.log.avro

After verification, you'd rename the 2012-01/02/00/stage directory to
2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
operation, you get the guarantee the you're looking for without having
to do extra IO. There shouldn't be a benefit to merging the individual
files unless they're too small.

-Joey

On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> wrote:
> Hi Bobby,
>
> Actually, the problem we're trying to solve is one of completeness.
>
> Say we have 3 machines generating log events and putting them to HDFS on an
> hourly basis.
> e.g.
> 2012-01/01/00/machine1.log.avro
> 2012-01/01/00/machine2.log.avro
> 2012-01/01/00/machine3.log.avro
>
> Sometime after the hour, we would have a scheduled job verify that all the
> expected machines' log files are present and complete in HDFS.
>
> Before launching MapReduce jobs for a given date range, we want to verify
> that the job will run over complete data.
> If not, the query would error out.
>
> We want our query/MapReduce layer to not need to be aware of logs at the
> machine level, only the presence or not of an hour's worth of logs.
>
> We were thinking that after verifying all in individual log files for an
> hour, they could be combined into 2012-01/01/00/log.avro.
> The presence of 2012-01-01-00.log.avro would be all that needs to be
> verified.
>
> However, we're new to both Avro and Hadoop so not sure of the most efficient
> (and reliable) way to accomplish this.
>
> Thanks,
>
> Frank Grimes
>
>
> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
>
> Frank,
>
> That depends on what you mean by combining. It sounds like you are trying to
> aggregate data from several days, which may involve doing a join so I would
> say a MapReduce job is your best bet.  If you are not going to do any
> processing at all then why are you trying to combine them?  Is there
> something that requires them all to be part of a single file?  MapReduce
> processing should be able to handle reading in multiple files just as well
> as reading in a single file.
>
> --Bobby Evans
>
> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote:
>
> Hi All,
>
> I was wondering if there was an easy way to combing multiple .avro files
> efficiently.
> e.g. combining multiple hours of logs into a daily aggregate
>
> Note that our Avro schema might evolve to have new (nullable) fields added
> but no fields will be removed.
>
> I'd like to avoid needing to pull the data down for combining and subsequent
> "hadoop dfs -put".
>
> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that
> automatically?
> FYI, the following seems to indicate that Avro files might be easily
> combinable: https://issues.apache.org/jira/browse/AVRO-127
>
> Or is an M/R job the best way to go for this?
>
> Thanks,
>
> Frank Grimes
>
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Combining AVRO files efficiently within HDFS

Reply via email to