Re: How to set up HDFS -> MySQL from trunk?

Kirk True Tue, 23 Mar 2010 14:46:03 -0700

Hi Eric,

I wanted to mention that for my client (Cisco), having the MySQL
reporting is key for Chukwa adoption. Not that they want MySQL per se,
but they want something that works today :) I'm working on getting the
legal OK to work on the code so that we can maintain it for our use.


That said, I'd love to see Cisco adopt Voldemort via Chukwa, but in the
near term, they need a solution that allows them to get access to their
data through HICC.

Thanks,
Kirk

Eric Yang wrote:
> Hi Kirk,
>
> This is great news.  I will come to you with questions once I have done more
> studies.  While reading voldemort blog (
> http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-link
> edin-with-hadoop-and-project-voldemort/), it looks like the logging service
> + Hadoop is equivalent to Chukwa.  It looks like Chukwa and voldemort could
> be a great combination.  :)
>
> Regards,
> Eric
>
> On 3/19/10 2:52 PM, "Kirk True" <k...@mustardgrain.com> wrote:
>
>   
>> Hi guys,
>>
>> I'm a committer on the Voldemort project, so perhaps I could lend a hand with
>> some of this work should you decide to go that route.
>>
>> Thanks,
>> Kirk
>>
>> Eric Yang wrote: 
>>     
>>>  
>>> JIRA is CHUKWA-444.  Voldemort looks like a good fit on paper.  I will
>>> investigate.  Thanks.
>>>
>>> Regards,
>>> Eric
>>>
>>> On 3/19/10 12:49 PM, "Jerome Boulon" <jbou...@netflix.com>
>>> <mailto:jbou...@netflix.com>  wrote:
>>>
>>>   
>>>  
>>>       
>>>>  
>>>> Do you have a Jira for that, so we can continue the discussion there?
>>>>
>>>> The reason, I'm asking this, is because I guess that if you need to move 
>>>> out
>>>> of Mysql it's because you need to scale. And if you need to scale then you
>>>> need partitioning and Voldemort and Hbase are already working on this (or
>>>> all No-SQL implementations)?.
>>>>
>>>> Voldemort index/data files can be built using Hadoop and Hbase is already
>>>> using Tfile.
>>>>
>>>> Thanks,
>>>> /Jerome.
>>>>
>>>> On 3/19/10 12:33 PM, "Eric Yang" <ey...@yahoo-inc.com>
>>>> <mailto:ey...@yahoo-inc.com>  wrote:
>>>>
>>>>     
>>>>  
>>>>         
>>>>>  
>>>>> Hi Jerome,
>>>>>
>>>>> I am not planning to have SQL on top of HDFS.  Chukwa MetricDataLoader
>>>>> subsystem is a index builder.  The replacement part of the index builder 
>>>>> is
>>>>> either Tfile or a streaming job to build the index, and having distributed
>>>>> processes to cache the index by keeping the Tfile open or load the index 
>>>>> to
>>>>> memory.  For aggregation, this could be replaced with second stage
>>>>> mapreduce, or workflow subsystem like Oozie.  It could also be replaced
>>>>> with
>>>>> Hive, if the community likes this approach.
>>>>>
>>>>> Regards,
>>>>> Eric
>>>>>
>>>>> On 3/19/10 11:30 AM, "Jerome Boulon" <jbou...@netflix.com>
>>>>> <mailto:jbou...@netflix.com>  wrote:
>>>>>
>>>>>       
>>>>>  
>>>>>           
>>>>>>  
>>>>>> Hi Eric,
>>>>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is
>>>>>> deprecated, and the HDFS-based replacement is six months out”,
>>>>>> You need a SQL like engine otherwise is not a replacement.
>>>>>> So does that mean you’re planning to get a SQL like engine working on top
>>>>>> of
>>>>>> HDFS in less than 6 months ?
>>>>>> If yes, do you already have some working code?
>>>>>> What are the performance that you’re targeting since even if Mysql is not
>>>>>> scalable, you can still do a bunch of things ...
>>>>>>
>>>>>> Thanks,
>>>>>>   /Jerome.
>>>>>>
>>>>>> On 3/18/10 8:59 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>
>>>>>>         
>>>>>>  
>>>>>>             
>>>>>>>  
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> Awesome - everything's working great now.
>>>>>>>
>>>>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the
>>>>>>> HDFS-based replacement is six months out. What should I do to get the
>>>>>>> data
>>>>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC
>>>>>>> replacement
>>>>>>> spec'ed out enough for others to contribute?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kirk
>>>>>>>
>>>>>>> Eric Yang wrote:
>>>>>>>           
>>>>>>>  
>>>>>>>               
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi Kirk,
>>>>>>>>
>>>>>>>> 1. Host select is currently showing hostname collected from
>>>>>>>> SystemMetrics
>>>>>>>> table, hence, you need to have top, iostat, df, sar collected to
>>>>>>>> populate
>>>>>>>> SystemMetrics table correctly.  The hostname is also cached in the user
>>>>>>>> session, hence you will need to “switch to a different cluster, and
>>>>>>>> switch
>>>>>>>> back” or restart hicc to flush the cached hostnames from user session.
>>>>>>>> The
>>>>>>>> hostname selector probably should pickup hostname from a different data
>>>>>>>> source in the future release.
>>>>>>>>
>>>>>>>> 2.  The server should run in UTC.  Timezone was never implemented
>>>>>>>> completely.  Hence, server in other timezone will not work correctly.
>>>>>>>>
>>>>>>>> 3. SQL aggregator (deprecated by the way) running as part of 
>>>>>>>> dbAdmin.sh,
>>>>>>>> this subsystem will down sample data from weekly table to monthly,
>>>>>>>> yearly,
>>>>>>>> decade tables.  I wrote this submodule over a weekend for prototype 
>>>>>>>> show
>>>>>>>> and
>>>>>>>> tell.  I strongly recommend to avoid SQL part of Chukwa all together.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Eric
>>>>>>>>
>>>>>>>> On 3/18/10 1:15 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>>
>>>>>>>>   
>>>>>>>>  
>>>>>>>>            
>>>>>>>>  
>>>>>>>>                 
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi Eric,
>>>>>>>>>
>>>>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" is
>>>>>>>>> being
>>>>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and
>>>>>>>>> placed
>>>>>>>>> into
>>>>>>>>> MySQL. However, HICC isn't showing me my data just yet...
>>>>>>>>>
>>>>>>>>> The disk_2098_week table is filled out with several entries and looks
>>>>>>>>> great.
>>>>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours"
>>>>>>>>> from
>>>>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data
>>>>>>>>> available."
>>>>>>>>>
>>>>>>>>> It appears to be because part of the SQL query includes the host name
>>>>>>>>> which
>>>>>>>>> is
>>>>>>>>> coming across in the SQL parameters as "". However, since the
>>>>>>>>> disk_2098_week
>>>>>>>>> table properly includes the host name, nothing is returned by the
>>>>>>>>> query.
>>>>>>>>> Just
>>>>>>>>> for grins, I updated the table manually in MySQL to blank out the host
>>>>>>>>> names
>>>>>>>>> and I get a super cool, pretty graph (which looks great, BTW).
>>>>>>>>>
>>>>>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I
>>>>>>>>> see
>>>>>>>>> the
>>>>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is using
>>>>>>>>> a
>>>>>>>>> range
>>>>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no
>>>>>>>>> matches
>>>>>>>>> are
>>>>>>>>> found. It appears that the "time_zone" session attribute contains the
>>>>>>>>> value
>>>>>>>>> "UTC". Where is this coming from and how can I change it?
>>>>>>>>>
>>>>>>>>> Problems:
>>>>>>>>>
>>>>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name 
>>>>>>>>> so
>>>>>>>>> that
>>>>>>>>> the generated SQL queries are correct?
>>>>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC?
>>>>>>>>> 3. How do I populate the other tables, such as "disk_489_month"?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kirk
>>>>>>>>>
>>>>>>>>> Eric Yang wrote:
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I
>>>>>>>>> remember
>>>>>>>>> correctly.  In mysql are the database tables getting created?
>>>>>>>>> Make sure that you have:
>>>>>>>>>
>>>>>>>>>   <property>
>>>>>>>>>     <name>chukwa.post.demux.data.loader</name>
>>>>>>>>>     
>>>>>>>>>
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>                   
>>>>>>>>  
>>>>>>>>                 
>>>>>>>  
>>>>>>>               
>>>>>>  
>>>>>>             
>>>>>  
>>>>>           
>>>>  
>>>>         
>>>  
>>> <value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>>
>>>       
>>>   
>>>  
>>>       
>>>>  
>>>> e
>>>>     
>>>>  
>>>>         
>>>>>  
>>>>>           
>>>>>>  
>>>>>>             
>>>>>>>  
>>>>>>>               
>>>>>>>>  
>>>>>>>>                 
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> .h
>>>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value>
>>>>>>>>>   </property>
>>>>>>>>>
>>>>>>>>> In Chukwa-demux.conf.
>>>>>>>>>
>>>>>>>>> The rough picture of the data flows looks like this:
>>>>>>>>>
>>>>>>>>> 1. demux -> Generate chukwa record outputs.
>>>>>>>>> 2. archive -> Generate bigger files by compacting data sink files.
>>>>>>>>>    (Concurrent with step 1)
>>>>>>>>> 3. postProcess -> Look up what files are generated by demux process 
>>>>>>>>> and
>>>>>>>>>    dispatch using different data loaders.
>>>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa
>>>>>>>>>    record files to different MDL.
>>>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type
>>>>>>>>>    defined in mdl.xml.
>>>>>>>>> 6. HICC widget has a descriptor language in json.  You can find the
>>>>>>>>> widget
>>>>>>>>>    descriptor files in hdfs://namenode:port/chukwa/hicc/widgets which
>>>>>>>>>    embedded the full SQL template like:
>>>>>>>>>
>>>>>>>>>    Query=²select cpu_user_pcnt from [system_metrics] where timestamp
>>>>>>>>> between
>>>>>>>>>    [start] and [end]²
>>>>>>>>>
>>>>>>>>>    This will output everything the metrics in JSON format and the HICC
>>>>>>>>>    graphing widget will render the graph.
>>>>>>>>>
>>>>>>>>> If there is no data, look at postProcess.log and make sure the data
>>>>>>>>> loading
>>>>>>>>> is not throwing exceptions.  Step 3 to 6 are deprecated, and will be
>>>>>>>>> replaced with something else.  Hope this helps.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Eric
>>>>>>>>>
>>>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>  
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi Eric,
>>>>>>>>>
>>>>>>>>> Eric Yang wrote:
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi Kirk,
>>>>>>>>>
>>>>>>>>> I am working on a design which removes MySQL from Chukwa.  I am making
>>>>>>>>> this
>>>>>>>>> departure from MySQL because MDL framework was for prototype purpose.
>>>>>>>>> It
>>>>>>>>> will not scale in production system where Chukwa could be host on
>>>>>>>>> large
>>>>>>>>> hadoop cluster.  HICC will serve data directly from HDFS in the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with
>>>>>>>>> trunk
>>>>>>>>> version of Chukwa.  You can load ChukwaRecords using
>>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.sh
>>>>>>>>> from
>>>>>>>>> Chukwa 0.3.
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> I'm to the point where the "df" example is working and demux is 
>>>>>>>>> storing
>>>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no data 
>>>>>>>>> is
>>>>>>>>> getting updated in the database.
>>>>>>>>>
>>>>>>>>> My question is: what's the process to get a custom Demux 
>>>>>>>>> implementation
>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>> viewable in HICC? Are the database tables magically created and
>>>>>>>>> populated
>>>>>>>>> for
>>>>>>>>> me? Does HICC generate a widget for me?
>>>>>>>>>
>>>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboard,
>>>>>>>>> the
>>>>>>>>> preview always reads, "No Data Available." I'm running
>>>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh
>>>>>>>>> (which
>>>>>>>>> I've manually copied to the bin directory).
>>>>>>>>>
>>>>>>>>> What am I missing?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kirk
>>>>>>>>>
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not be
>>>>>>>>> supported once we make transition to Avro + Tfile.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Eric
>>>>>>>>>
>>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>  
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I recently switched to trunk as I was experiencing a lot of issues
>>>>>>>>> with
>>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and try
>>>>>>>>> to
>>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and when
>>>>>>>>> I
>>>>>>>>> run the system as built from trunk, nothing is ever populated in the
>>>>>>>>> database. Where are the instructions for setting up the HDFS -> MySQL
>>>>>>>>> data migration for HICC?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kirk
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>          
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>
>>>>>>>>>   
>>>>>>>>>       
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>           
>>>>>>>>>  
>>>>>>>>>                   
>>>>>>>>  
>>>>>>>>  
>>>>>>>>
>>>>>>>>   
>>>>>>>>            
>>>>>>>>  
>>>>>>>>                 
>>>>>>>  
>>>>>>>               
>>>>>>  
>>>>>>             
>>>>>  
>>>>>
>>>>>       
>>>>>  
>>>>>           
>>>>  
>>>>         
>>>  
>>>
>>>   
>>>       
>
>

Re: How to set up HDFS -> MySQL from trunk?

Reply via email to