Hi Kirk, This is great news. I will come to you with questions once I have done more studies. While reading voldemort blog ( http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-link edin-with-hadoop-and-project-voldemort/), it looks like the logging service + Hadoop is equivalent to Chukwa. It looks like Chukwa and voldemort could be a great combination. :)
Regards, Eric On 3/19/10 2:52 PM, "Kirk True" <k...@mustardgrain.com> wrote: > Hi guys, > > I'm a committer on the Voldemort project, so perhaps I could lend a hand with > some of this work should you decide to go that route. > > Thanks, > Kirk > > Eric Yang wrote: >> >> JIRA is CHUKWA-444. Voldemort looks like a good fit on paper. I will >> investigate. Thanks. >> >> Regards, >> Eric >> >> On 3/19/10 12:49 PM, "Jerome Boulon" <jbou...@netflix.com> >> <mailto:jbou...@netflix.com> wrote: >> >> >> >>> >>> Do you have a Jira for that, so we can continue the discussion there? >>> >>> The reason, I'm asking this, is because I guess that if you need to move out >>> of Mysql it's because you need to scale. And if you need to scale then you >>> need partitioning and Voldemort and Hbase are already working on this (or >>> all No-SQL implementations)?. >>> >>> Voldemort index/data files can be built using Hadoop and Hbase is already >>> using Tfile. >>> >>> Thanks, >>> /Jerome. >>> >>> On 3/19/10 12:33 PM, "Eric Yang" <ey...@yahoo-inc.com> >>> <mailto:ey...@yahoo-inc.com> wrote: >>> >>> >>> >>>> >>>> Hi Jerome, >>>> >>>> I am not planning to have SQL on top of HDFS. Chukwa MetricDataLoader >>>> subsystem is a index builder. The replacement part of the index builder is >>>> either Tfile or a streaming job to build the index, and having distributed >>>> processes to cache the index by keeping the Tfile open or load the index to >>>> memory. For aggregation, this could be replaced with second stage >>>> mapreduce, or workflow subsystem like Oozie. It could also be replaced >>>> with >>>> Hive, if the community likes this approach. >>>> >>>> Regards, >>>> Eric >>>> >>>> On 3/19/10 11:30 AM, "Jerome Boulon" <jbou...@netflix.com> >>>> <mailto:jbou...@netflix.com> wrote: >>>> >>>> >>>> >>>>> >>>>> Hi Eric, >>>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is >>>>> deprecated, and the HDFS-based replacement is six months out”, >>>>> You need a SQL like engine otherwise is not a replacement. >>>>> So does that mean you’re planning to get a SQL like engine working on top >>>>> of >>>>> HDFS in less than 6 months ? >>>>> If yes, do you already have some working code? >>>>> What are the performance that you’re targeting since even if Mysql is not >>>>> scalable, you can still do a bunch of things ... >>>>> >>>>> Thanks, >>>>> /Jerome. >>>>> >>>>> On 3/18/10 8:59 PM, "Kirk True" <k...@mustardgrain.com> >>>>> <mailto:k...@mustardgrain.com> wrote: >>>>> >>>>> >>>>> >>>>>> >>>>>> Hi Eric, >>>>>> >>>>>> Awesome - everything's working great now. >>>>>> >>>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the >>>>>> HDFS-based replacement is six months out. What should I do to get the >>>>>> data >>>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC >>>>>> replacement >>>>>> spec'ed out enough for others to contribute? >>>>>> >>>>>> Thanks, >>>>>> Kirk >>>>>> >>>>>> Eric Yang wrote: >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Kirk, >>>>>>> >>>>>>> 1. Host select is currently showing hostname collected from >>>>>>> SystemMetrics >>>>>>> table, hence, you need to have top, iostat, df, sar collected to >>>>>>> populate >>>>>>> SystemMetrics table correctly. The hostname is also cached in the user >>>>>>> session, hence you will need to “switch to a different cluster, and >>>>>>> switch >>>>>>> back” or restart hicc to flush the cached hostnames from user session. >>>>>>> The >>>>>>> hostname selector probably should pickup hostname from a different data >>>>>>> source in the future release. >>>>>>> >>>>>>> 2. The server should run in UTC. Timezone was never implemented >>>>>>> completely. Hence, server in other timezone will not work correctly. >>>>>>> >>>>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.sh, >>>>>>> this subsystem will down sample data from weekly table to monthly, >>>>>>> yearly, >>>>>>> decade tables. I wrote this submodule over a weekend for prototype show >>>>>>> and >>>>>>> tell. I strongly recommend to avoid SQL part of Chukwa all together. >>>>>>> >>>>>>> Regards, >>>>>>> Eric >>>>>>> >>>>>>> On 3/18/10 1:15 PM, "Kirk True" <k...@mustardgrain.com> >>>>>>> <mailto:k...@mustardgrain.com> >>>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Eric, >>>>>>>> >>>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" is >>>>>>>> being >>>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and >>>>>>>> placed >>>>>>>> into >>>>>>>> MySQL. However, HICC isn't showing me my data just yet... >>>>>>>> >>>>>>>> The disk_2098_week table is filled out with several entries and looks >>>>>>>> great. >>>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours" >>>>>>>> from >>>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data >>>>>>>> available." >>>>>>>> >>>>>>>> It appears to be because part of the SQL query includes the host name >>>>>>>> which >>>>>>>> is >>>>>>>> coming across in the SQL parameters as "". However, since the >>>>>>>> disk_2098_week >>>>>>>> table properly includes the host name, nothing is returned by the >>>>>>>> query. >>>>>>>> Just >>>>>>>> for grins, I updated the table manually in MySQL to blank out the host >>>>>>>> names >>>>>>>> and I get a super cool, pretty graph (which looks great, BTW). >>>>>>>> >>>>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I >>>>>>>> see >>>>>>>> the >>>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is using >>>>>>>> a >>>>>>>> range >>>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no >>>>>>>> matches >>>>>>>> are >>>>>>>> found. It appears that the "time_zone" session attribute contains the >>>>>>>> value >>>>>>>> "UTC". Where is this coming from and how can I change it? >>>>>>>> >>>>>>>> Problems: >>>>>>>> >>>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name so >>>>>>>> that >>>>>>>> the generated SQL queries are correct? >>>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC? >>>>>>>> 3. How do I populate the other tables, such as "disk_489_month"? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kirk >>>>>>>> >>>>>>>> Eric Yang wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I >>>>>>>> remember >>>>>>>> correctly. In mysql are the database tables getting created? >>>>>>>> Make sure that you have: >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>chukwa.post.demux.data.loader</name> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >> <value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>> >> >> >> >> >>> >>> e >>> >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> .h >>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value> >>>>>>>> </property> >>>>>>>> >>>>>>>> In Chukwa-demux.conf. >>>>>>>> >>>>>>>> The rough picture of the data flows looks like this: >>>>>>>> >>>>>>>> 1. demux -> Generate chukwa record outputs. >>>>>>>> 2. archive -> Generate bigger files by compacting data sink files. >>>>>>>> (Concurrent with step 1) >>>>>>>> 3. postProcess -> Look up what files are generated by demux process and >>>>>>>> dispatch using different data loaders. >>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa >>>>>>>> record files to different MDL. >>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type >>>>>>>> defined in mdl.xml. >>>>>>>> 6. HICC widget has a descriptor language in json. You can find the >>>>>>>> widget >>>>>>>> descriptor files in hdfs://namenode:port/chukwa/hicc/widgets which >>>>>>>> embedded the full SQL template like: >>>>>>>> >>>>>>>> Query=²select cpu_user_pcnt from [system_metrics] where timestamp >>>>>>>> between >>>>>>>> [start] and [end]² >>>>>>>> >>>>>>>> This will output everything the metrics in JSON format and the HICC >>>>>>>> graphing widget will render the graph. >>>>>>>> >>>>>>>> If there is no data, look at postProcess.log and make sure the data >>>>>>>> loading >>>>>>>> is not throwing exceptions. Step 3 to 6 are deprecated, and will be >>>>>>>> replaced with something else. Hope this helps. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eric >>>>>>>> >>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Eric, >>>>>>>> >>>>>>>> Eric Yang wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Kirk, >>>>>>>> >>>>>>>> I am working on a design which removes MySQL from Chukwa. I am making >>>>>>>> this >>>>>>>> departure from MySQL because MDL framework was for prototype purpose. >>>>>>>> It >>>>>>>> will not scale in production system where Chukwa could be host on >>>>>>>> large >>>>>>>> hadoop cluster. HICC will serve data directly from HDFS in the >>>>>>>> future. >>>>>>>> >>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with >>>>>>>> trunk >>>>>>>> version of Chukwa. You can load ChukwaRecords using >>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.sh >>>>>>>> from >>>>>>>> Chukwa 0.3. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'm to the point where the "df" example is working and demux is storing >>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no data is >>>>>>>> getting updated in the database. >>>>>>>> >>>>>>>> My question is: what's the process to get a custom Demux implementation >>>>>>>> to >>>>>>>> be >>>>>>>> viewable in HICC? Are the database tables magically created and >>>>>>>> populated >>>>>>>> for >>>>>>>> me? Does HICC generate a widget for me? >>>>>>>> >>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboard, >>>>>>>> the >>>>>>>> preview always reads, "No Data Available." I'm running >>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh >>>>>>>> (which >>>>>>>> I've manually copied to the bin directory). >>>>>>>> >>>>>>>> What am I missing? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kirk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not be >>>>>>>> supported once we make transition to Avro + Tfile. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eric >>>>>>>> >>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I recently switched to trunk as I was experiencing a lot of issues >>>>>>>> with >>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and try >>>>>>>> to >>>>>>>> stick data in MySQL from HDFS. However, that script is gone and when >>>>>>>> I >>>>>>>> run the system as built from trunk, nothing is ever populated in the >>>>>>>> database. Where are the instructions for setting up the HDFS -> MySQL >>>>>>>> data migration for HICC? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kirk >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> >>> >> >> >> >