Re: How to set up HDFS -> MySQL from trunk?

Eric Yang Fri, 19 Mar 2010 16:39:58 -0700

Hi Kirk,

This is great news.  I will come to you with questions once I have done more
studies.  While reading voldemort blog (
http://project-voldemort.com/blog/2009/06/building-a-1-tb-data-cycle-at-link
edin-with-hadoop-and-project-voldemort/), it looks like the logging service
+ Hadoop is equivalent to Chukwa.  It looks like Chukwa and voldemort could
be a great combination.  :)


Regards,
Eric

On 3/19/10 2:52 PM, "Kirk True" <k...@mustardgrain.com> wrote:

> Hi guys,
> 
> I'm a committer on the Voldemort project, so perhaps I could lend a hand with
> some of this work should you decide to go that route.
> 
> Thanks,
> Kirk
> 
> Eric Yang wrote: 
>>  
>> JIRA is CHUKWA-444.  Voldemort looks like a good fit on paper.  I will
>> investigate.  Thanks.
>> 
>> Regards,
>> Eric
>> 
>> On 3/19/10 12:49 PM, "Jerome Boulon" <jbou...@netflix.com>
>> <mailto:jbou...@netflix.com>  wrote:
>> 
>>   
>>  
>>>  
>>> Do you have a Jira for that, so we can continue the discussion there?
>>> 
>>> The reason, I'm asking this, is because I guess that if you need to move out
>>> of Mysql it's because you need to scale. And if you need to scale then you
>>> need partitioning and Voldemort and Hbase are already working on this (or
>>> all No-SQL implementations)?.
>>> 
>>> Voldemort index/data files can be built using Hadoop and Hbase is already
>>> using Tfile.
>>> 
>>> Thanks,
>>> /Jerome.
>>> 
>>> On 3/19/10 12:33 PM, "Eric Yang" <ey...@yahoo-inc.com>
>>> <mailto:ey...@yahoo-inc.com>  wrote:
>>> 
>>>     
>>>  
>>>>  
>>>> Hi Jerome,
>>>> 
>>>> I am not planning to have SQL on top of HDFS.  Chukwa MetricDataLoader
>>>> subsystem is a index builder.  The replacement part of the index builder is
>>>> either Tfile or a streaming job to build the index, and having distributed
>>>> processes to cache the index by keeping the Tfile open or load the index to
>>>> memory.  For aggregation, this could be replaced with second stage
>>>> mapreduce, or workflow subsystem like Oozie.  It could also be replaced
>>>> with
>>>> Hive, if the community likes this approach.
>>>> 
>>>> Regards,
>>>> Eric
>>>> 
>>>> On 3/19/10 11:30 AM, "Jerome Boulon" <jbou...@netflix.com>
>>>> <mailto:jbou...@netflix.com>  wrote:
>>>> 
>>>>       
>>>>  
>>>>>  
>>>>> Hi Eric,
>>>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is
>>>>> deprecated, and the HDFS-based replacement is six months out”,
>>>>> You need a SQL like engine otherwise is not a replacement.
>>>>> So does that mean you’re planning to get a SQL like engine working on top
>>>>> of
>>>>> HDFS in less than 6 months ?
>>>>> If yes, do you already have some working code?
>>>>> What are the performance that you’re targeting since even if Mysql is not
>>>>> scalable, you can still do a bunch of things ...
>>>>> 
>>>>> Thanks,
>>>>>   /Jerome.
>>>>> 
>>>>> On 3/18/10 8:59 PM, "Kirk True" <k...@mustardgrain.com>
>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>> 
>>>>>         
>>>>>  
>>>>>>  
>>>>>> Hi Eric,
>>>>>> 
>>>>>> Awesome - everything's working great now.
>>>>>> 
>>>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the
>>>>>> HDFS-based replacement is six months out. What should I do to get the
>>>>>> data
>>>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC
>>>>>> replacement
>>>>>> spec'ed out enough for others to contribute?
>>>>>> 
>>>>>> Thanks,
>>>>>> Kirk
>>>>>> 
>>>>>> Eric Yang wrote:
>>>>>>           
>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>> Hi Kirk,
>>>>>>> 
>>>>>>> 1. Host select is currently showing hostname collected from
>>>>>>> SystemMetrics
>>>>>>> table, hence, you need to have top, iostat, df, sar collected to
>>>>>>> populate
>>>>>>> SystemMetrics table correctly.  The hostname is also cached in the user
>>>>>>> session, hence you will need to “switch to a different cluster, and
>>>>>>> switch
>>>>>>> back” or restart hicc to flush the cached hostnames from user session.
>>>>>>> The
>>>>>>> hostname selector probably should pickup hostname from a different data
>>>>>>> source in the future release.
>>>>>>> 
>>>>>>> 2.  The server should run in UTC.  Timezone was never implemented
>>>>>>> completely.  Hence, server in other timezone will not work correctly.
>>>>>>> 
>>>>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.sh,
>>>>>>> this subsystem will down sample data from weekly table to monthly,
>>>>>>> yearly,
>>>>>>> decade tables.  I wrote this submodule over a weekend for prototype show
>>>>>>> and
>>>>>>> tell.  I strongly recommend to avoid SQL part of Chukwa all together.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Eric
>>>>>>> 
>>>>>>> On 3/18/10 1:15 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>> 
>>>>>>>   
>>>>>>>  
>>>>>>>            
>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi Eric,
>>>>>>>> 
>>>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" is
>>>>>>>> being
>>>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and
>>>>>>>> placed
>>>>>>>> into
>>>>>>>> MySQL. However, HICC isn't showing me my data just yet...
>>>>>>>> 
>>>>>>>> The disk_2098_week table is filled out with several entries and looks
>>>>>>>> great.
>>>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours"
>>>>>>>> from
>>>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data
>>>>>>>> available."
>>>>>>>> 
>>>>>>>> It appears to be because part of the SQL query includes the host name
>>>>>>>> which
>>>>>>>> is
>>>>>>>> coming across in the SQL parameters as "". However, since the
>>>>>>>> disk_2098_week
>>>>>>>> table properly includes the host name, nothing is returned by the
>>>>>>>> query.
>>>>>>>> Just
>>>>>>>> for grins, I updated the table manually in MySQL to blank out the host
>>>>>>>> names
>>>>>>>> and I get a super cool, pretty graph (which looks great, BTW).
>>>>>>>> 
>>>>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I
>>>>>>>> see
>>>>>>>> the
>>>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is using
>>>>>>>> a
>>>>>>>> range
>>>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no
>>>>>>>> matches
>>>>>>>> are
>>>>>>>> found. It appears that the "time_zone" session attribute contains the
>>>>>>>> value
>>>>>>>> "UTC". Where is this coming from and how can I change it?
>>>>>>>> 
>>>>>>>> Problems:
>>>>>>>> 
>>>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name so
>>>>>>>> that
>>>>>>>> the generated SQL queries are correct?
>>>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC?
>>>>>>>> 3. How do I populate the other tables, such as "disk_489_month"?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kirk
>>>>>>>> 
>>>>>>>> Eric Yang wrote:
>>>>>>>>     
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I
>>>>>>>> remember
>>>>>>>> correctly.  In mysql are the database tables getting created?
>>>>>>>> Make sure that you have:
>>>>>>>> 
>>>>>>>>   <property>
>>>>>>>>     <name>chukwa.post.demux.data.loader</name>
>>>>>>>>     
>>>>>>>> 
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>  
>>>>>>  
>>>>>  
>>>>  
>>>  
>>  
>> <value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>>
>> >>
>>   
>>  
>>>  
>>> e
>>>     
>>>  
>>>>  
>>>>>  
>>>>>>  
>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> .h
>>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value>
>>>>>>>>   </property>
>>>>>>>> 
>>>>>>>> In Chukwa-demux.conf.
>>>>>>>> 
>>>>>>>> The rough picture of the data flows looks like this:
>>>>>>>> 
>>>>>>>> 1. demux -> Generate chukwa record outputs.
>>>>>>>> 2. archive -> Generate bigger files by compacting data sink files.
>>>>>>>>    (Concurrent with step 1)
>>>>>>>> 3. postProcess -> Look up what files are generated by demux process and
>>>>>>>>    dispatch using different data loaders.
>>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa
>>>>>>>>    record files to different MDL.
>>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type
>>>>>>>>    defined in mdl.xml.
>>>>>>>> 6. HICC widget has a descriptor language in json.  You can find the
>>>>>>>> widget
>>>>>>>>    descriptor files in hdfs://namenode:port/chukwa/hicc/widgets which
>>>>>>>>    embedded the full SQL template like:
>>>>>>>> 
>>>>>>>>    Query=²select cpu_user_pcnt from [system_metrics] where timestamp
>>>>>>>> between
>>>>>>>>    [start] and [end]²
>>>>>>>> 
>>>>>>>>    This will output everything the metrics in JSON format and the HICC
>>>>>>>>    graphing widget will render the graph.
>>>>>>>> 
>>>>>>>> If there is no data, look at postProcess.log and make sure the data
>>>>>>>> loading
>>>>>>>> is not throwing exceptions.  Step 3 to 6 are deprecated, and will be
>>>>>>>> replaced with something else.  Hope this helps.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Eric
>>>>>>>> 
>>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>> 
>>>>>>>>   
>>>>>>>>  
>>>>>>>>       
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi Eric,
>>>>>>>> 
>>>>>>>> Eric Yang wrote:
>>>>>>>>     
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi Kirk,
>>>>>>>> 
>>>>>>>> I am working on a design which removes MySQL from Chukwa.  I am making
>>>>>>>> this
>>>>>>>> departure from MySQL because MDL framework was for prototype purpose.
>>>>>>>> It
>>>>>>>> will not scale in production system where Chukwa could be host on
>>>>>>>> large
>>>>>>>> hadoop cluster.  HICC will serve data directly from HDFS in the
>>>>>>>> future.
>>>>>>>> 
>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with
>>>>>>>> trunk
>>>>>>>> version of Chukwa.  You can load ChukwaRecords using
>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.sh
>>>>>>>> from
>>>>>>>> Chukwa 0.3.
>>>>>>>> 
>>>>>>>>   
>>>>>>>>       
>>>>>>>>  
>>>>>>>>          
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> I'm to the point where the "df" example is working and demux is storing
>>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no data is
>>>>>>>> getting updated in the database.
>>>>>>>> 
>>>>>>>> My question is: what's the process to get a custom Demux implementation
>>>>>>>> to
>>>>>>>> be
>>>>>>>> viewable in HICC? Are the database tables magically created and
>>>>>>>> populated
>>>>>>>> for
>>>>>>>> me? Does HICC generate a widget for me?
>>>>>>>> 
>>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboard,
>>>>>>>> the
>>>>>>>> preview always reads, "No Data Available." I'm running
>>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh
>>>>>>>> (which
>>>>>>>> I've manually copied to the bin directory).
>>>>>>>> 
>>>>>>>> What am I missing?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kirk
>>>>>>>> 
>>>>>>>>     
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not be
>>>>>>>> supported once we make transition to Avro + Tfile.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Eric
>>>>>>>> 
>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>> 
>>>>>>>>   
>>>>>>>>  
>>>>>>>>       
>>>>>>>>  
>>>>>>>>          
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> I recently switched to trunk as I was experiencing a lot of issues
>>>>>>>> with
>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and try
>>>>>>>> to
>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and when
>>>>>>>> I
>>>>>>>> run the system as built from trunk, nothing is ever populated in the
>>>>>>>> database. Where are the instructions for setting up the HDFS -> MySQL
>>>>>>>> data migration for HICC?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Kirk
>>>>>>>>     
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> 
>>>>>>>>   
>>>>>>>>       
>>>>>>>>  
>>>>>>>>          
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>         
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>> 
>>>>>>>>   
>>>>>>>>       
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>>  
>>>>>>>>  
>>>>>>>>           
>>>>>>>>  
>>>>>>>  
>>>>>>>  
>>>>>>> 
>>>>>>>   
>>>>>>>            
>>>>>>>  
>>>>>>  
>>>>>  
>>>>  
>>>> 
>>>>       
>>>>  
>>>  
>>  
>> 
>>   
>

Re: How to set up HDFS -> MySQL from trunk?

Reply via email to