[jira] Updated: (CHUKWA-444) Redefine Chukwa time series storage

Eric Yang (JIRA) Sat, 26 Jun 2010 12:23:14 -0700

     [ 
https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eric Yang updated CHUKWA-444:
-----------------------------

    Attachment: CHUKWA-444.patch

The current data sink structure is suboptimal for visualizing data in near real 
time.  Hence, I purpose to modify the demux parser to work on data inside 
collector.  When data is arriving in the data sink, Hbase acts as the micro 
time series data indexer for small incremental updates.  Other reporting jobs 
could run in the background as mapreduce jobs to organize data as described in 
previous comment.  Data inside of hbase is immediately available for REST API 
like stargate and visualization tool HICC.

The pipeline would look like this:
{noformat}
Adaptor -> Agent -> Collector -> HbaseWriter -> DemuxParsers -> Hbase +-> 
Stargate  --> HICC (For monitoring)
                                                                      +-> 
Mapreduce --> Hbase --> Stargate --> HICC (For reporting)
{noformat}

Existing structure still works, but will be marked as deprecated:
{noformat}
Adaptor -> Agent -> Collector -> SeqFileWriter -> HDFS +-> Archives
                                                       +-> Demux -> Database -> 
HICC (For reporting)
{noformat}

It will also be possible to combine both combination to get archives to work in 
parallel as the new structure:
{noformat}
Adaptor -> Agent -> Collector +-> PipelineWriter
                              +-> SeqFileWriter -> HDFS -> Archive
                              +-> HbaseWriter   -> DemuxParsers -> Hbase +-> 
Stargate  --> HICC (For monitoring)
                                                                         +-> 
Mapreduce --> Hbase --> Stargate --> HICC (For reporting)
{noformat}

> Redefine Chukwa time series storage
> -----------------------------------
>
>                 Key: CHUKWA-444
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-444
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>         Attachments: CHUKWA-444.patch
>
>
> The current Chukwa Record format is not suitable for data visualization.  It 
> is more like an archive format which combines data from multiple sources 
> (hosts), and group them into a sorted time partitioned sequence file.  Most 
> of people collected data for two reasons, archive and data analysis.  The 
> current chukwa record format is fine for archive, but it is not so great for 
> data analysis.  Data analysis could be further break down into two different 
> types.  1) Data can be aggregated and summarized, such as metrics.  2) Data 
> that can not be summarized, like job history.  Type 1 data is useful for 
> visualization by graph, and type 2 data is useful by plain text viewing or 
> search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records 
> for data analysis.  Outside of Hadoop world, rrdtools is great for time 
> series data storage, and optimized for metrics from a single source, i.e. a 
> host.  RRD data file fragments badly when there are hundred of thousands of 
> sources.  Chukwa time series data storage should be able to combine multiple 
> data sources into one Chukwa file to combat file fragmentation problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CHUKWA-444) Redefine Chukwa time series storage

Reply via email to