Re: How to set up HDFS -> MySQL from trunk?

Kirk True Fri, 19 Mar 2010 14:53:00 -0700

Hi guys,

I'm a committer on the Voldemort project, so perhaps I could lend a hand
with some of this work should you decide to go that route.


Thanks,
Kirk

Eric Yang wrote:
> JIRA is CHUKWA-444.  Voldemort looks like a good fit on paper.  I will
> investigate.  Thanks.
>
> Regards,
> Eric
>
> On 3/19/10 12:49 PM, "Jerome Boulon" <jbou...@netflix.com> wrote:
>
>   
>> Do you have a Jira for that, so we can continue the discussion there?
>>
>> The reason, I'm asking this, is because I guess that if you need to move out
>> of Mysql it's because you need to scale. And if you need to scale then you
>> need partitioning and Voldemort and Hbase are already working on this (or
>> all No-SQL implementations)?.
>>
>> Voldemort index/data files can be built using Hadoop and Hbase is already
>> using Tfile.
>>
>> Thanks,
>> /Jerome.
>>
>> On 3/19/10 12:33 PM, "Eric Yang" <ey...@yahoo-inc.com> wrote:
>>
>>     
>>> Hi Jerome,
>>>
>>> I am not planning to have SQL on top of HDFS.  Chukwa MetricDataLoader
>>> subsystem is a index builder.  The replacement part of the index builder is
>>> either Tfile or a streaming job to build the index, and having distributed
>>> processes to cache the index by keeping the Tfile open or load the index to
>>> memory.  For aggregation, this could be replaced with second stage
>>> mapreduce, or workflow subsystem like Oozie.  It could also be replaced with
>>> Hive, if the community likes this approach.
>>>
>>> Regards,
>>> Eric
>>>
>>> On 3/19/10 11:30 AM, "Jerome Boulon" <jbou...@netflix.com> wrote:
>>>
>>>       
>>>> Hi Eric,
>>>> Correct me if I’m wrong but to get that “the SQL portion of Chukwa is
>>>> deprecated, and the HDFS-based replacement is six months out”,
>>>> You need a SQL like engine otherwise is not a replacement.
>>>> So does that mean you’re planning to get a SQL like engine working on top 
>>>> of
>>>> HDFS in less than 6 months ?
>>>> If yes, do you already have some working code?
>>>> What are the performance that you’re targeting since even if Mysql is not
>>>> scalable, you can still do a bunch of things ...
>>>>
>>>> Thanks,
>>>>   /Jerome.
>>>>
>>>> On 3/18/10 8:59 PM, "Kirk True" <k...@mustardgrain.com> wrote:
>>>>
>>>>         
>>>>> Hi Eric,
>>>>>
>>>>> Awesome - everything's working great now.
>>>>>
>>>>> So, as you've said, the SQL portion of Chukwa is deprecated, and the
>>>>> HDFS-based replacement is six months out. What should I do to get the data
>>>>> from the adapters->collectors->HDFS->HICC? Is the HDFS-based HICC
>>>>> replacement
>>>>> spec'ed out enough for others to contribute?
>>>>>
>>>>> Thanks,
>>>>> Kirk
>>>>>
>>>>> Eric Yang wrote:
>>>>>           
>>>>>>  
>>>>>> Hi Kirk,
>>>>>>
>>>>>> 1. Host select is currently showing hostname collected from SystemMetrics
>>>>>> table, hence, you need to have top, iostat, df, sar collected to populate
>>>>>> SystemMetrics table correctly.  The hostname is also cached in the user
>>>>>> session, hence you will need to “switch to a different cluster, and 
>>>>>> switch
>>>>>> back” or restart hicc to flush the cached hostnames from user session.
>>>>>> The
>>>>>> hostname selector probably should pickup hostname from a different data
>>>>>> source in the future release.
>>>>>>
>>>>>> 2.  The server should run in UTC.  Timezone was never implemented
>>>>>> completely.  Hence, server in other timezone will not work correctly.
>>>>>>
>>>>>> 3. SQL aggregator (deprecated by the way) running as part of dbAdmin.sh,
>>>>>> this subsystem will down sample data from weekly table to monthly, 
>>>>>> yearly,
>>>>>> decade tables.  I wrote this submodule over a weekend for prototype show
>>>>>> and
>>>>>> tell.  I strongly recommend to avoid SQL part of Chukwa all together.
>>>>>>
>>>>>> Regards,
>>>>>> Eric
>>>>>>
>>>>>> On 3/18/10 1:15 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>
>>>>>>   
>>>>>>  
>>>>>>             
>>>>>>>  
>>>>>>> Hi Eric,
>>>>>>>
>>>>>>> I believe I have most of steps 1-5 working. Data from "/usr/bin/df" is
>>>>>>> being
>>>>>>> collected, parsed, stuck into HDFS, and then pulled out again and placed
>>>>>>> into
>>>>>>> MySQL. However, HICC isn't showing me my data just yet...
>>>>>>>
>>>>>>> The disk_2098_week table is filled out with several entries and looks
>>>>>>> great.
>>>>>>> If I select my cluster from the "Cluster Selector" and "Last 12 Hours"
>>>>>>> from
>>>>>>> the "Time" widget, the "Disk Statistics" widget still says "No Data
>>>>>>> available."
>>>>>>>
>>>>>>> It appears to be because part of the SQL query includes the host name
>>>>>>> which
>>>>>>> is
>>>>>>> coming across in the SQL parameters as "". However, since the
>>>>>>> disk_2098_week
>>>>>>> table properly includes the host name, nothing is returned by the query.
>>>>>>> Just
>>>>>>> for grins, I updated the table manually in MySQL to blank out the host
>>>>>>> names
>>>>>>> and I get a super cool, pretty graph (which looks great, BTW).
>>>>>>>
>>>>>>> Additionally, if I select other time periods such as "Last 1 Hour", I 
>>>>>>> see
>>>>>>> the
>>>>>>> query is using UTC or something (at 1:00 PDT, I see the query is using a
>>>>>>> range
>>>>>>> of 19:00-20:00). However, the data in MySQL is based on PDT, so no
>>>>>>> matches
>>>>>>> are
>>>>>>> found. It appears that the "time_zone" session attribute contains the
>>>>>>> value
>>>>>>> "UTC". Where is this coming from and how can I change it?
>>>>>>>
>>>>>>> Problems:
>>>>>>>
>>>>>>> 1. How do I get the "Hosts Selector" in HICC to include my host name so
>>>>>>> that
>>>>>>> the generated SQL queries are correct?
>>>>>>> 2. How do I make the "time_zone" session parameter use PDT vs. UTC?
>>>>>>> 3. How do I populate the other tables, such as "disk_489_month"?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kirk
>>>>>>>
>>>>>>> Eric Yang wrote:
>>>>>>>     
>>>>>>>  
>>>>>>>               
>>>>>>>>  
>>>>>>>>  
>>>>>>>> Df command is converted into disk_xxxx_week table in mysql, if I
>>>>>>>> remember
>>>>>>>> correctly.  In mysql are the database tables getting created?
>>>>>>>> Make sure that you have:
>>>>>>>>
>>>>>>>>   <property>
>>>>>>>>     <name>chukwa.post.demux.data.loader</name>
>>>>>>>>     
>>>>>>>>
>>>>>>>>                 
> <value>org.apache.hadoop.chukwa.dataloader.MetricDataLoaderPool,org.apach>>>>>>
>   
>> e
>>     
>>>>>>>> .h
>>>>>>>> adoop.chukwa.dataloader.FSMDataLoader</value>
>>>>>>>>   </property>
>>>>>>>>
>>>>>>>> In Chukwa-demux.conf.
>>>>>>>>
>>>>>>>> The rough picture of the data flows looks like this:
>>>>>>>>
>>>>>>>> 1. demux -> Generate chukwa record outputs.
>>>>>>>> 2. archive -> Generate bigger files by compacting data sink files.
>>>>>>>>    (Concurrent with step 1)
>>>>>>>> 3. postProcess -> Look up what files are generated by demux process and
>>>>>>>>    dispatch using different data loaders.
>>>>>>>> 4. MetricDataLoaderPool -> Dispatch multiple threads to load chukwa
>>>>>>>>    record files to different MDL.
>>>>>>>> 5. MetricDataLoader -> Load sequence file to database by record type
>>>>>>>>    defined in mdl.xml.
>>>>>>>> 6. HICC widget has a descriptor language in json.  You can find the
>>>>>>>> widget
>>>>>>>>    descriptor files in hdfs://namenode:port/chukwa/hicc/widgets which
>>>>>>>>    embedded the full SQL template like:
>>>>>>>>
>>>>>>>>    Query=²select cpu_user_pcnt from [system_metrics] where timestamp
>>>>>>>> between
>>>>>>>>    [start] and [end]²
>>>>>>>>
>>>>>>>>    This will output everything the metrics in JSON format and the HICC
>>>>>>>>    graphing widget will render the graph.
>>>>>>>>
>>>>>>>> If there is no data, look at postProcess.log and make sure the data
>>>>>>>> loading
>>>>>>>> is not throwing exceptions.  Step 3 to 6 are deprecated, and will be
>>>>>>>> replaced with something else.  Hope this helps.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Eric
>>>>>>>>
>>>>>>>> On 3/17/10 4:16 PM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>>
>>>>>>>>   
>>>>>>>>  
>>>>>>>>       
>>>>>>>>  
>>>>>>>>                 
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> Hi Eric,
>>>>>>>>>
>>>>>>>>> Eric Yang wrote:
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>                   
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> Hi Kirk,
>>>>>>>>>>
>>>>>>>>>> I am working on a design which removes MySQL from Chukwa.  I am 
>>>>>>>>>> making
>>>>>>>>>> this
>>>>>>>>>> departure from MySQL because MDL framework was for prototype purpose.
>>>>>>>>>> It
>>>>>>>>>> will not scale in production system where Chukwa could be host on
>>>>>>>>>> large
>>>>>>>>>> hadoop cluster.  HICC will serve data directly from HDFS in the
>>>>>>>>>> future.
>>>>>>>>>>
>>>>>>>>>> Meanwhile, the dbAdmin.sh from Chukwa 0.3 is still compatible with
>>>>>>>>>> trunk
>>>>>>>>>> version of Chukwa.  You can load ChukwaRecords using
>>>>>>>>>> org.apache.hadoop.chukwa.dataloader.MetricDataLoader class or mdl.sh
>>>>>>>>>> from
>>>>>>>>>> Chukwa 0.3.
>>>>>>>>>>
>>>>>>>>>>   
>>>>>>>>>>       
>>>>>>>>>>  
>>>>>>>>>>          
>>>>>>>>>>  
>>>>>>>>>>                     
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>> I'm to the point where the "df" example is working and demux is 
>>>>>>>>> storing
>>>>>>>>> ChukwaRecord data in HDFS. When I run dbAdmin.sh from 0.3.0, no data 
>>>>>>>>> is
>>>>>>>>> getting updated in the database.
>>>>>>>>>
>>>>>>>>> My question is: what's the process to get a custom Demux 
>>>>>>>>> implementation
>>>>>>>>> to
>>>>>>>>> be
>>>>>>>>> viewable in HICC? Are the database tables magically created and
>>>>>>>>> populated
>>>>>>>>> for
>>>>>>>>> me? Does HICC generate a widget for me?
>>>>>>>>>
>>>>>>>>> HICC looks very nice, but when I try to add a widget to my dashboard,
>>>>>>>>> the
>>>>>>>>> preview always reads, "No Data Available." I'm running
>>>>>>>>> $CHUKWA_HOME/bin/start-all.sh followed by $CHUKWA_HOME/bin/dbAdmin.sh
>>>>>>>>> (which
>>>>>>>>> I've manually copied to the bin directory).
>>>>>>>>>
>>>>>>>>> What am I missing?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kirk
>>>>>>>>>
>>>>>>>>>     
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>                   
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>> MetricDataLoader class will be mark as deprecated, and it will not be
>>>>>>>>>> supported once we make transition to Avro + Tfile.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Eric
>>>>>>>>>>
>>>>>>>>>> On 3/15/10 11:56 AM, "Kirk True" <k...@mustardgrain.com>
>>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>>> <mailto:k...@mustardgrain.com>
>>>>>>>>>> <mailto:k...@mustardgrain.com>  wrote:
>>>>>>>>>>
>>>>>>>>>>   
>>>>>>>>>>  
>>>>>>>>>>       
>>>>>>>>>>  
>>>>>>>>>>          
>>>>>>>>>>  
>>>>>>>>>>                     
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>>  
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> I recently switched to trunk as I was experiencing a lot of issues
>>>>>>>>>>> with
>>>>>>>>>>> 0.3.0. In 0.3.0, there was a dbAdmin.sh script that would run and 
>>>>>>>>>>> try
>>>>>>>>>>> to
>>>>>>>>>>> stick data in MySQL from HDFS. However, that script is gone and when
>>>>>>>>>>> I
>>>>>>>>>>> run the system as built from trunk, nothing is ever populated in the
>>>>>>>>>>> database. Where are the instructions for setting up the HDFS -> 
>>>>>>>>>>> MySQL
>>>>>>>>>>> data migration for HICC?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Kirk
>>>>>>>>>>>     
>>>>>>>>>>>  
>>>>>>>>>>>         
>>>>>>>>>>>  
>>>>>>>>>>>         
>>>>>>>>>>>  
>>>>>>>>>>>                       
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>  
>>>>>>>>>>
>>>>>>>>>>   
>>>>>>>>>>       
>>>>>>>>>>  
>>>>>>>>>>          
>>>>>>>>>>  
>>>>>>>>>>                     
>>>>>>>>>  
>>>>>>>>>  
>>>>>>>>>         
>>>>>>>>>  
>>>>>>>>>                   
>>>>>>>>  
>>>>>>>>  
>>>>>>>>
>>>>>>>>   
>>>>>>>>       
>>>>>>>>  
>>>>>>>>                 
>>>>>>>  
>>>>>>>               
>>>>>>  
>>>>>>
>>>>>>   
>>>>>>             
>>>       
>
>

Re: How to set up HDFS -> MySQL from trunk?

Reply via email to