I would do it by staging the machine data into a temporary directory and then renaming the directory when it's been verified. So, data would be written into directories like this:
2012-01/02/00/stage/machine1.log.avro 2012-01/02/00/stage/machine2.log.avro 2012-01/02/00/stage/machine3.log.avro After verification, you'd rename the 2012-01/02/00/stage directory to 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic operation, you get the guarantee the you're looking for without having to do extra IO. There shouldn't be a benefit to merging the individual files unless they're too small. -Joey On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> wrote: > Hi Bobby, > > Actually, the problem we're trying to solve is one of completeness. > > Say we have 3 machines generating log events and putting them to HDFS on an > hourly basis. > e.g. > 2012-01/01/00/machine1.log.avro > 2012-01/01/00/machine2.log.avro > 2012-01/01/00/machine3.log.avro > > Sometime after the hour, we would have a scheduled job verify that all the > expected machines' log files are present and complete in HDFS. > > Before launching MapReduce jobs for a given date range, we want to verify > that the job will run over complete data. > If not, the query would error out. > > We want our query/MapReduce layer to not need to be aware of logs at the > machine level, only the presence or not of an hour's worth of logs. > > We were thinking that after verifying all in individual log files for an > hour, they could be combined into 2012-01/01/00/log.avro. > The presence of 2012-01-01-00.log.avro would be all that needs to be > verified. > > However, we're new to both Avro and Hadoop so not sure of the most efficient > (and reliable) way to accomplish this. > > Thanks, > > Frank Grimes > > > On 2012-01-06, at 11:46 AM, Robert Evans wrote: > > Frank, > > That depends on what you mean by combining. It sounds like you are trying to > aggregate data from several days, which may involve doing a join so I would > say a MapReduce job is your best bet. If you are not going to do any > processing at all then why are you trying to combine them? Is there > something that requires them all to be part of a single file? MapReduce > processing should be able to handle reading in multiple files just as well > as reading in a single file. > > --Bobby Evans > > On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote: > > Hi All, > > I was wondering if there was an easy way to combing multiple .avro files > efficiently. > e.g. combining multiple hours of logs into a daily aggregate > > Note that our Avro schema might evolve to have new (nullable) fields added > but no fields will be removed. > > I'd like to avoid needing to pull the data down for combining and subsequent > "hadoop dfs -put". > > Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that > automatically? > FYI, the following seems to indicate that Avro files might be easily > combinable: https://issues.apache.org/jira/browse/AVRO-127 > > Or is an M/R job the best way to go for this? > > Thanks, > > Frank Grimes > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434