[
https://issues.apache.org/jira/browse/KAFKA-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310980#comment-14310980
]
Jay Kreps commented on KAFKA-1403:
----------------------------------
Ultimately in order to be accurate the time will actually need to be in the
message itself. Currently we use the write time but this can be arbitrarily
inaccurate: if you delete the data on a server and restart it it will rewrite
everything with new timestamps.
> Adding timestamp to kafka index structure
> -----------------------------------------
>
> Key: KAFKA-1403
> URL: https://issues.apache.org/jira/browse/KAFKA-1403
> Project: Kafka
> Issue Type: Improvement
> Components: core
> Affects Versions: 0.8.1
> Reporter: Xinyao Hu
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> Right now, kafka doesn't have timestamp per message. It makes an assumption
> that all the messages in the same file has the same timestamp which is the
> mtime of the file. This makes it inefficient to scan all the messages within
> a time window, which is a valid use case in a lot of realtime data analysis.
> One way to hack this is to roll a new file in a short period of time.
> However, this will result in opening lots of files (KAFKA-1404) which crashed
> the servers eventually.
> My guess this is not implemented due to the efficiency reason. It will cost
> additional four bytes per message which might be pinned in memory for fast
> access. There might be some simple perf optimization, such as differential
> encoding + var length encoding, which should bring down the cost to 1-2 bytes
> avg per message.
> Let me know if this makes sense.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)