Hi Joey, That's a very good suggestion and might suit us just fine.
However, many of the files will be much smaller than the HDFS block size. That could affect the performance of the MapReduce jobs, correct? Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. Assuming we did want to combine the actual files... how would you suggest we might go about it? Thanks, Frank Grimes On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > I would do it by staging the machine data into a temporary directory > and then renaming the directory when it's been verified. So, data > would be written into directories like this: > > 2012-01/02/00/stage/machine1.log.avro > 2012-01/02/00/stage/machine2.log.avro > 2012-01/02/00/stage/machine3.log.avro > > After verification, you'd rename the 2012-01/02/00/stage directory to > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic > operation, you get the guarantee the you're looking for without having > to do extra IO. There shouldn't be a benefit to merging the individual > files unless they're too small. > > -Joey > > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <frankgrime...@gmail.com> wrote: >> Hi Bobby, >> >> Actually, the problem we're trying to solve is one of completeness. >> >> Say we have 3 machines generating log events and putting them to HDFS on an >> hourly basis. >> e.g. >> 2012-01/01/00/machine1.log.avro >> 2012-01/01/00/machine2.log.avro >> 2012-01/01/00/machine3.log.avro >> >> Sometime after the hour, we would have a scheduled job verify that all the >> expected machines' log files are present and complete in HDFS. >> >> Before launching MapReduce jobs for a given date range, we want to verify >> that the job will run over complete data. >> If not, the query would error out. >> >> We want our query/MapReduce layer to not need to be aware of logs at the >> machine level, only the presence or not of an hour's worth of logs. >> >> We were thinking that after verifying all in individual log files for an >> hour, they could be combined into 2012-01/01/00/log.avro. >> The presence of 2012-01-01-00.log.avro would be all that needs to be >> verified. >> >> However, we're new to both Avro and Hadoop so not sure of the most efficient >> (and reliable) way to accomplish this. >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >> >> Frank, >> >> That depends on what you mean by combining. It sounds like you are trying to >> aggregate data from several days, which may involve doing a join so I would >> say a MapReduce job is your best bet. If you are not going to do any >> processing at all then why are you trying to combine them? Is there >> something that requires them all to be part of a single file? MapReduce >> processing should be able to handle reading in multiple files just as well >> as reading in a single file. >> >> --Bobby Evans >> >> On 1/6/12 9:55 AM, "Frank Grimes" <frankgrime...@gmail.com> wrote: >> >> Hi All, >> >> I was wondering if there was an easy way to combing multiple .avro files >> efficiently. >> e.g. combining multiple hours of logs into a daily aggregate >> >> Note that our Avro schema might evolve to have new (nullable) fields added >> but no fields will be removed. >> >> I'd like to avoid needing to pull the data down for combining and subsequent >> "hadoop dfs -put". >> >> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that >> automatically? >> FYI, the following seems to indicate that Avro files might be easily >> combinable: https://issues.apache.org/jira/browse/AVRO-127 >> >> Or is an M/R job the best way to go for this? >> >> Thanks, >> >> Frank Grimes >> >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434