If the data is in 1 machine then there's probably no need to move the data. So the question is more:
* Do you need more than one machine to do your ETL? * Would you ever need more than one machine? So if you need more than 1 machine then chukwa could be the right answer. I have a tool that I could publish to transform any input file to Chukwa compressed dataSink file. This could be a first step. Also hadoop has a JDBC InputReader/Writer so you may want to take a look. Could you give more info on your data(size and ETL)? /Jerome. On 8/24/10 12:39 PM, "hdev ml" <[email protected]> wrote: HI all, This question is related partly to hadoop and partly to chukwa. We have huge number of logged information sitting in one machine. I am not sure whether the storage is in multiple files or in a database. But what we want to do is get that log information, transform it and store it into the some database for data mining/ data warehousing/ reporting purposes. 1. Since it is on one machine, is Chukwa the right kind of frame work to do this ETL process? 2. I understand that generally Hadoop works on large files. But assuming that the data sits in a database, what if we somehow partition data for Hadoop/Chukwa? Is that the right strategy? Any help will be appreciated. Thanks, Harshad
