[jira] Commented: (CHUKWA-444) Redefine Chukwa time series storage

Bill Graham (JIRA) Mon, 28 Jun 2010 14:15:48 -0700

    [ 
https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883293#action_12883293
 ]


Bill Graham commented on CHUKWA-444:
------------------------------------

I agree with Jerome, in that I think Chukwa should still be able to be used 
without HBase. If you have an HBase install and want real-time, it can be 
enabled. Ideally we would have the ability to configurable which data pipeline 
to follow. I like the idea of adding an HBase component to the mix though.

> Redefine Chukwa time series storage
> -----------------------------------
>
>                 Key: CHUKWA-444
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-444
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>         Attachments: CHUKWA-444.patch
>
>
> The current Chukwa Record format is not suitable for data visualization.  It 
> is more like an archive format which combines data from multiple sources 
> (hosts), and group them into a sorted time partitioned sequence file.  Most 
> of people collected data for two reasons, archive and data analysis.  The 
> current chukwa record format is fine for archive, but it is not so great for 
> data analysis.  Data analysis could be further break down into two different 
> types.  1) Data can be aggregated and summarized, such as metrics.  2) Data 
> that can not be summarized, like job history.  Type 1 data is useful for 
> visualization by graph, and type 2 data is useful by plain text viewing or 
> search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records 
> for data analysis.  Outside of Hadoop world, rrdtools is great for time 
> series data storage, and optimized for metrics from a single source, i.e. a 
> host.  RRD data file fragments badly when there are hundred of thousands of 
> sources.  Chukwa time series data storage should be able to combine multiple 
> data sources into one Chukwa file to combat file fragmentation problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CHUKWA-444) Redefine Chukwa time series storage

Reply via email to