Thanks Jerome for the quick answer. 1&2. We are not sure if we need more than one machine. But I think the data size is large, so my guess is we might need that in the future.
My personal thought is that, if we have hadoop platform in the company, it may be helpful for some other large batch processing. e.g. We also want to do data mining and there is Apache Mahout project which leverages hadoop capabilities to do that. The raw data is in text format, but it may or may not be converted to database before my module kicks in to process the data. The size of the data is approximately 40-50GB per day and it is archived for a month or so. So total data for a month would be around 1.2 - 1.5 TB Again thanks for your time and efforts. - Harshad On Tue, Aug 24, 2010 at 1:00 PM, Jerome Boulon <[email protected]> wrote: > If the data is in 1 machine then there’s probably no need to move the > data. > So the question is more: > > - Do you need more than one machine to do your ETL? > - Would you ever need more than one machine? > > > So if you need more than 1 machine then chukwa could be the right answer. > I have a tool that I could publish to transform any input file to Chukwa > compressed dataSink file. This could be a first step. > Also hadoop has a JDBC InputReader/Writer so you may want to take a look. > > Could you give more info on your data(size and ETL)? > > /Jerome. > > > On 8/24/10 12:39 PM, "hdev ml" <[email protected]> wrote: > > HI all, > > This question is related partly to hadoop and partly to chukwa. > > We have huge number of logged information sitting in one machine. I am not > sure whether the storage is in multiple files or in a database. > > But what we want to do is get that log information, transform it and store > it into the some database for data mining/ data warehousing/ reporting > purposes. > > 1. Since it is on one machine, is Chukwa the right kind of frame work to do > this ETL process? > > 2. I understand that generally Hadoop works on large files. But assuming > that the data sits in a database, what if we somehow partition data for > Hadoop/Chukwa? Is that the right strategy? > > Any help will be appreciated. > > Thanks, > > Harshad > >
