[ 
https://issues.apache.org/jira/browse/HADOOP-5040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665200#action_12665200
 ] 

Eric Yang commented on HADOOP-5040:
-----------------------------------

A chukwa record contains both key and value hashes.  The short term goal is to 
index by key, and the long term goal is to be able to generate full body index 
on the value hashes.  The current design of demux is to create multiple of 
spill files, if the same time partition already has data existed.  In order to 
search through the time partition, the index of multiple spill files need to be 
merged to provide a linear view of the time line.  At the same time, the hourly 
roll up or daily roll up and reduce the number of files on disk.  This means 
the indexing system could either rewrite the index multiple times, or having a 
time leased mechanism for indexed keys.

Rewriting index multiple times only works for small set of data because scan 
time for chukwa records grows linearly.  Once the data reach peta bytes, then 
it makes more sense to have a time leased index where each part of the index 
could expire and remerge more easily.

By using KATTA, it may be possible to have the linear time index partitioned 
and updated on multiple server, and multi-cast search will broadcast to all 
index and retrieve the result more efficiently.

> Need index for chukwa sequence files
> ------------------------------------
>
>                 Key: HADOOP-5040
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5040
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: contrib/chukwa
>         Environment: Redhat EL 5.1 and Java 6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>
> Chukwa has ability to collect large volume of data, but the lack of index 
> prevents Chukwa front end to serve data straight from HDFS.  This jira is the 
> place holder for designing a indexing service for Chukwa.  The plan is to 
> create indexing service base on available software like lucene or katta.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to