[ 
https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838970#action_12838970
 ] 

Eric Yang commented on CHUKWA-444:
----------------------------------

More refined plan:

Type 1 Data:

Having a post demux data loader which wait to receive new ChukwaRecords files, 
and merge with the existing ChukwaRecords files through a second MR
job.  The second MR job also produces low resolution of the data for report.

/chukwa/repos/TYPE/DATE <-- Original data goes here.
/chukwa/report/TYPE/[yearly,monthly,weekly,daily] <-- Summarized JSON data goes 
here.

The report JSON will be fixed to 300 data points per series, optimized for 
graphing.

Type 2 data for plain text searching:

After data has been archived, use full body indexer like lucene to build 
searchable indexes.

Architecture look like this:

{noformat}
Adaptor -> Agent -> Collector |-> Archive -> Full Body Index |-> Retention
                              +-> Demux   -> Aggregation     |-> Retention
                                                             +-> Hicc
{noformat}                                             

> Redefine Chukwa time series storage
> -----------------------------------
>
>                 Key: CHUKWA-444
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-444
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>
> The current Chukwa Record format is not suitable for data visualization.  It 
> is more like an archive format which combines data from multiple sources 
> (hosts), and group them into a sorted time partitioned sequence file.  Most 
> of people collected data for two reasons, archive and data analysis.  The 
> current chukwa record format is fine for archive, but it is not so great for 
> data analysis.  Data analysis could be further break down into two different 
> types.  1) Data can be aggregated and summarized, such as metrics.  2) Data 
> that can not be summarized, like job history.  Type 1 data is useful for 
> visualization by graph, and type 2 data is useful by plain text viewing or 
> search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records 
> for data analysis.  Outside of Hadoop world, rrdtools is great for time 
> series data storage, and optimized for metrics from a single source, i.e. a 
> host.  RRD data file fragments badly when there are hundred of thousands of 
> sources.  Chukwa time series data storage should be able to combine multiple 
> data sources into one Chukwa file to combat file fragmentation problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to