Hi guys, I'm a committer on the Voldemort project, so perhaps I could lend a hand with some of this work should you decide to go that route.
Thanks, Kirk Eric Yang wrote: > JIRA is CHUKWA-444. Voldemort looks like a good fit on paper. I will > investigate. Thanks. > > Regards, > Eric > > On 3/19/10 12:49 PM, "Jerome Boulon" <jbou...@netflix.com> wrote: > > >> Do you have a Jira for that, so we can continue the discussion there? >> >> The reason, I'm asking this, is because I guess that if you need to move out >> of Mysql it's because you need to scale. And if you need to scale then you >> need partitioning and Voldemort and Hbase are already working on this (or >> all No-SQL implementations)?. >> >> Voldemort index/data files can be built using Hadoop and Hbase is already >> using Tfile. >> >> Thanks, >> /Jerome. >> >> On 3/19/10 12:33 PM, "Eric Yang" <ey...@yahoo-inc.com> wrote: >> >> >>> Hi Jerome, >>> >>> I am not planning to have SQL on top of HDFS. Chukwa MetricDataLoader >>> subsystem is a index builder. The replacement part of the index builder is >>> either Tfile or a streaming job to build the index, and having distributed >>> processes to cache the index by keeping the Tfile open or load the index to >>> memory. For aggregation, this could be replaced with second stage >>> mapreduce, or workflow subsystem like Oozie. It could also be replaced with >>> Hive, if the community likes this approach. >>> >>> Regards, >>> Eric >>> >>> On 3/19/10 11:30 AM, "Jerome Boulon" <jbou...@netflix.com> wrote: >>> >>> >>>> Hi Eric, >>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is >>>> deprecated, and the HDFS-based replacement is six months out”, >>>> You need a SQL like engine otherwise is not a replacement. >>>> So does that mean you’re planning to get a SQL like engine working on top >>>> of >>>> HDFS in less than 6 months ? >>>> If yes, do you already have some working code? >>>> What are the performance that you’re targeting since even if Mysql is not >>>> scalable, you can still do a bunch of things ... >>>> >>>> Thanks, >>>> /Jerome. >>>> >>>> On 3/18/10 8:59 PM, "Kirk True" <k...@mustardgrain.com> wrote: >>>> >>>> >>>>> Hi Eric, >>>>> >>>>> Awesome - everything's working great now. >>>>> >>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the >>>>> HDFS-based replacement is six months out. What should I do to get the data >>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC >>>>> replacement >>>>> spec'ed out enough for others to contribute? >>>>> >>>>> Thanks, >>>>> Kirk >>>>> >>>>> Eric Yang wrote: >>>>> >>>>>> >>>>>> Hi Kirk, >>>>>> >>>>>> 1. Host select is currently showing hostname collected from SystemMetrics >>>>>> table, hence, you need to have top, iostat, df, sar collected to populate >>>>>> SystemMetrics table correctly. The hostname is also cached in the user >>>>>> session, hence you will need to “switch to a different cluster, and >>>>>> switch >>>>>> back” or restart hicc to flush the cached hostnames from user session. >>>>>> The >>>>>> hostname selector probably should pickup hostname from a different data >>>>>> source in the future release. >>>>>> >>>>>> 2. The server should run in UTC. Timezone was never implemented >>>>>> completely. Hence, server in other timezone will not work correctly. >>>>>> >>>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.sh, >>>>>> this subsystem will down sample data from weekly table to monthly, >>>>>> yearly, >>>>>> decade tables. I wrote this submodule over a weekend for prototype show >>>>>> and >>>>>> tell. I strongly recommend to avoid SQL part of Chukwa all together. >>>>>> >>>>>> Regards, >>>>>> Eric >>>>>> >>>>>> On 3/18/10 1:15 PM, "Kirk True" <k...@mustardgrain.com> >>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Hi Eric, >>>>>>> >>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" is >>>>>>> being >>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and placed >>>>>>> into >>>>>>> MySQL. However, HICC isn't showing me my data just yet... >>>>>>> >>>>>>> The disk_2098_week table is filled out with several entries and looks >>>>>>> great. >>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours" >>>>>>> from >>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data >>>>>>> available." >>>>>>> >>>>>>> It appears to be because part of the SQL query includes the host name >>>>>>> which >>>>>>> is >>>>>>> coming across in the SQL parameters as "". However, since the >>>>>>> disk_2098_week >>>>>>> table properly includes the host name, nothing is returned by the query. >>>>>>> Just >>>>>>> for grins, I updated the table manually in MySQL to blank out the host >>>>>>> names >>>>>>> and I get a super cool, pretty graph (which looks great, BTW). >>>>>>> >>>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I >>>>>>> see >>>>>>> the >>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is using a >>>>>>> range >>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no >>>>>>> matches >>>>>>> are >>>>>>> found. It appears that the "time_zone" session attribute contains the >>>>>>> value >>>>>>> "UTC". Where is this coming from and how can I change it? >>>>>>> >>>>>>> Problems: >>>>>>> >>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name so >>>>>>> that >>>>>>> the generated SQL queries are correct? >>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC? >>>>>>> 3. How do I populate the other tables, such as "disk_489_month"? >>>>>>> >>>>>>> Thanks, >>>>>>> Kirk >>>>>>> >>>>>>> Eric Yang wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I >>>>>>>> remember >>>>>>>> correctly. In mysql are the database tables getting created? >>>>>>>> Make sure that you have: >>>>>>>> >>>>>>>> <property> >>>>>>>> <name>chukwa.post.demux.data.loader</name> >>>>>>>> >>>>>>>> >>>>>>>> > <value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>>>> > >> e >> >>>>>>>> .h >>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value> >>>>>>>> </property> >>>>>>>> >>>>>>>> In Chukwa-demux.conf. >>>>>>>> >>>>>>>> The rough picture of the data flows looks like this: >>>>>>>> >>>>>>>> 1. demux -> Generate chukwa record outputs. >>>>>>>> 2. archive -> Generate bigger files by compacting data sink files. >>>>>>>> (Concurrent with step 1) >>>>>>>> 3. postProcess -> Look up what files are generated by demux process and >>>>>>>> dispatch using different data loaders. >>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa >>>>>>>> record files to different MDL. >>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type >>>>>>>> defined in mdl.xml. >>>>>>>> 6. HICC widget has a descriptor language in json. You can find the >>>>>>>> widget >>>>>>>> descriptor files in hdfs://namenode:port/chukwa/hicc/widgets which >>>>>>>> embedded the full SQL template like: >>>>>>>> >>>>>>>> Query=²select cpu_user_pcnt from [system_metrics] where timestamp >>>>>>>> between >>>>>>>> [start] and [end]² >>>>>>>> >>>>>>>> This will output everything the metrics in JSON format and the HICC >>>>>>>> graphing widget will render the graph. >>>>>>>> >>>>>>>> If there is no data, look at postProcess.log and make sure the data >>>>>>>> loading >>>>>>>> is not throwing exceptions. Step 3 to 6 are deprecated, and will be >>>>>>>> replaced with something else. Hope this helps. >>>>>>>> >>>>>>>> Regards, >>>>>>>> Eric >>>>>>>> >>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Eric, >>>>>>>>> >>>>>>>>> Eric Yang wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi Kirk, >>>>>>>>>> >>>>>>>>>> I am working on a design which removes MySQL from Chukwa. I am >>>>>>>>>> making >>>>>>>>>> this >>>>>>>>>> departure from MySQL because MDL framework was for prototype purpose. >>>>>>>>>> It >>>>>>>>>> will not scale in production system where Chukwa could be host on >>>>>>>>>> large >>>>>>>>>> hadoop cluster. HICC will serve data directly from HDFS in the >>>>>>>>>> future. >>>>>>>>>> >>>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with >>>>>>>>>> trunk >>>>>>>>>> version of Chukwa. You can load ChukwaRecords using >>>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.sh >>>>>>>>>> from >>>>>>>>>> Chukwa 0.3. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I'm to the point where the "df" example is working and demux is >>>>>>>>> storing >>>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no data >>>>>>>>> is >>>>>>>>> getting updated in the database. >>>>>>>>> >>>>>>>>> My question is: what's the process to get a custom Demux >>>>>>>>> implementation >>>>>>>>> to >>>>>>>>> be >>>>>>>>> viewable in HICC? Are the database tables magically created and >>>>>>>>> populated >>>>>>>>> for >>>>>>>>> me? Does HICC generate a widget for me? >>>>>>>>> >>>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboard, >>>>>>>>> the >>>>>>>>> preview always reads, "No Data Available." I'm running >>>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh >>>>>>>>> (which >>>>>>>>> I've manually copied to the bin directory). >>>>>>>>> >>>>>>>>> What am I missing? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Kirk >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not be >>>>>>>>>> supported once we make transition to Avro + Tfile. >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Eric >>>>>>>>>> >>>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <k...@mustardgrain.com> >>>>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>>>> <mailto:k...@mustardgrain.com> >>>>>>>>>> <mailto:k...@mustardgrain.com> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I recently switched to trunk as I was experiencing a lot of issues >>>>>>>>>>> with >>>>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and >>>>>>>>>>> try >>>>>>>>>>> to >>>>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and when >>>>>>>>>>> I >>>>>>>>>>> run the system as built from trunk, nothing is ever populated in the >>>>>>>>>>> database. Where are the instructions for setting up the HDFS -> >>>>>>>>>>> MySQL >>>>>>>>>>> data migration for HICC? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Kirk >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>> > >