[
https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838970#action_12838970
]
Eric Yang commented on CHUKWA-444:
----------------------------------
More refined plan:
Type 1 Data:
Having a post demux data loader which wait to receive new ChukwaRecords files,
and merge with the existing ChukwaRecords files through a second MR
job. The second MR job also produces low resolution of the data for report.
/chukwa/repos/TYPE/DATE <-- Original data goes here.
/chukwa/report/TYPE/[yearly,monthly,weekly,daily] <-- Summarized JSON data goes
here.
The report JSON will be fixed to 300 data points per series, optimized for
graphing.
Type 2 data for plain text searching:
After data has been archived, use full body indexer like lucene to build
searchable indexes.
Architecture look like this:
{noformat}
Adaptor -> Agent -> Collector |-> Archive -> Full Body Index |-> Retention
+-> Demux -> Aggregation |-> Retention
+-> Hicc
{noformat}
> Redefine Chukwa time series storage
> -----------------------------------
>
> Key: CHUKWA-444
> URL: https://issues.apache.org/jira/browse/CHUKWA-444
> Project: Hadoop Chukwa
> Issue Type: New Feature
> Components: Data Processors
> Environment: Redhat EL 5.1, Java 6
> Reporter: Eric Yang
>
> The current Chukwa Record format is not suitable for data visualization. It
> is more like an archive format which combines data from multiple sources
> (hosts), and group them into a sorted time partitioned sequence file. Most
> of people collected data for two reasons, archive and data analysis. The
> current chukwa record format is fine for archive, but it is not so great for
> data analysis. Data analysis could be further break down into two different
> types. 1) Data can be aggregated and summarized, such as metrics. 2) Data
> that can not be summarized, like job history. Type 1 data is useful for
> visualization by graph, and type 2 data is useful by plain text viewing or
> search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records
> for data analysis. Outside of Hadoop world, rrdtools is great for time
> series data storage, and optimized for metrics from a single source, i.e. a
> host. RRD data file fragments badly when there are hundred of thousands of
> sources. Chukwa time series data storage should be able to combine multiple
> data sources into one Chukwa file to combat file fragmentation problem.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.