[ 
https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896396#action_12896396
 ] 

Eric Yang commented on CHUKWA-444:
----------------------------------

Ideally RowKey can use better serialization library like Avro, but it also 
seems excessive to bake avro schema into part of the RowKey.  Hence, the 
current implementation is:

{noformat}
[time partition]-[primary key]
{noformat}

The user can choose to bake his own dimensions into the primary key section.

If we use avro, it will look like:

{noformat}
{
  "type" : "record",
  "name" : "ChukwaKey",
  "namespace" : "org.apache.chukwa.ipc",
  "fields" : [
    { "name" : "timePartition", "type" : "long" },
    { "name" : "cluster", "type" : "string" },
    { "name" : "host", "type" : "string" }
  ]
},{ ... binary data ...}
{noformat}

Using Avro maybe better at type checking, but each row key has the same schema 
repeated over and over.  It doesn't seem efficient for storage and 
serialization of the key.  One benefit to go with avro, is it provides better 
handling of flexible key dimensions.  Using string split, it is more efficient 
for serving data to HICC.  What is the preferred method for the community?


> Redefine Chukwa time series storage
> -----------------------------------
>
>                 Key: CHUKWA-444
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-444
>             Project: Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>         Attachments: CHUKWA-444-1.patch, CHUKWA-444-2.patch, CHUKWA-444.patch
>
>
> The current Chukwa Record format is not suitable for data visualization.  It 
> is more like an archive format which combines data from multiple sources 
> (hosts), and group them into a sorted time partitioned sequence file.  Most 
> of people collected data for two reasons, archive and data analysis.  The 
> current chukwa record format is fine for archive, but it is not so great for 
> data analysis.  Data analysis could be further break down into two different 
> types.  1) Data can be aggregated and summarized, such as metrics.  2) Data 
> that can not be summarized, like job history.  Type 1 data is useful for 
> visualization by graph, and type 2 data is useful by plain text viewing or 
> search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records 
> for data analysis.  Outside of Hadoop world, rrdtools is great for time 
> series data storage, and optimized for metrics from a single source, i.e. a 
> host.  RRD data file fragments badly when there are hundred of thousands of 
> sources.  Chukwa time series data storage should be able to combine multiple 
> data sources into one Chukwa file to combat file fragmentation problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to