[jira] [Commented] (KAFKA-3726) Enable cold storage option

2016-05-23 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296495#comment-15296495
 ] 

Radoslaw Gruchalski commented on KAFKA-3726:


bq. So my interpretation of your use case, from your post, is: If you're 
ingesting data into Kafka, with the aim of getting into file based storage for 
offline processing, it would be simpler to just copy the Kafka data files 
directly, rather than consume them and recreate new files in cold storage.

Indeed. The goal is to bring a standard mechanism for doing so. I'd be happy to 
contribute such thing but it would be great to work the direction.

> Enable cold storage option
> --
>
> Key: KAFKA-3726
> URL: https://issues.apache.org/jira/browse/KAFKA-3726
> Project: Kafka
>  Issue Type: Wish
>Reporter: Radoslaw Gruchalski
> Attachments: kafka-cold-storage.txt
>
>
> This JIRA builds up on the cold storage article I have published on Medium. 
> The copy of the article attached here.
> The need for cold storage or an "indefinite" log seems to be quite often 
> discussed on the user mailing list.
> The cold storage idea would enable the opportunity for the operator to keep 
> the raw Kafka offset files in a third party storage and allow retrieving the 
> data back for re-consumption.
> The two possible options for enabling such functionality are, from the 
> article:
> First approach: if Kafka provided a notification mechanism and could trigger 
> a program when a segment file is to be discarded, it would become feasible to 
> provide a standard method of moving data to cold storage in reaction to those 
> events. Once the program finishes backing the segments up, it could tell 
> Kafka “it is now safe to delete these segments”.
> The second option is to provide an additional value for the 
> log.cleanup.policy setting, call it cold-storage. In case of this value, 
> Kafka would move the segment files — which otherwise would be deleted — to 
> another destination on the server. They can be picked up from there and moved 
> to the cold storage.
> Both have their limitations. The former one is simply a mechanism exposed to 
> allow operator building up the tooling necessary to enable this. Events could 
> be published in a manner similar to Mesos Event Bus 
> (https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
> could provide a control topic on which such info would be published. The 
> outcome is, the operator can subscribe to the event bus and get notified 
> about, at least, two events:
> - log segment is complete and can be backed up
> - partition leader changed
> These two, together with an option to keep the log segment safe from 
> compaction for a certain amount of time, would be sufficient to reliably 
> implement cold storage.
> The latter option, {{log.cleanup.policy}} setting would be more complete 
> feature but it is also much more difficult to implement.  All brokers would 
> have keep the backup of the data in the cold storage significantly increasing 
> the size requirements, also, the de-duplication of the data for the 
> replicated data would be left completely to the operator.
> In any case, the thing to stay away from is having Kafka to deal with the 
> physical aspect of moving the data to and back from the cold storage. This is 
> not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (KAFKA-3726) Enable cold storage option

2016-05-22 Thread Radoslaw Gruchalski (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15295541#comment-15295541
 ] 

Radoslaw Gruchalski commented on KAFKA-3726:


[~benstopford]: This is exactly what I pointed out in the writeup. The case is 
not to commit to a certain kind of storage for cold storage and consume the 
data. The case is to have an option to take a backup of offset files and 
consume them later, when / if necessary. The raw data is already in Kafka in a 
most optimal format for re-consumption. Having a backup of offset files frees 
the operator from:

- consuming (utilizing resources for consumption) and committing to a certain 
storage format
- injecting back to Kafka for re-consumption

The point of this ticket is to discuss an option of a structured, out of the 
box method for taking backups of the offset files and not relying on 
eventuality of the offset files still being there when the operator wants to 
back them up. In other words, it would be great if the deletion of compacted 
offset files could be controlled a little bit more allowing operators backing 
them up with 100% accuracy.

> Enable cold storage option
> --
>
> Key: KAFKA-3726
> URL: https://issues.apache.org/jira/browse/KAFKA-3726
> Project: Kafka
>  Issue Type: Wish
>Reporter: Radoslaw Gruchalski
> Attachments: kafka-cold-storage.txt
>
>
> This JIRA builds up on the cold storage article I have published on Medium. 
> The copy of the article attached here.
> The need for cold storage or an "indefinite" log seems to be quite often 
> discussed on the user mailing list.
> The cold storage idea would enable the opportunity for the operator to keep 
> the raw Kafka offset files in a third party storage and allow retrieving the 
> data back for re-consumption.
> The two possible options for enabling such functionality are, from the 
> article:
> First approach: if Kafka provided a notification mechanism and could trigger 
> a program when a segment file is to be discarded, it would become feasible to 
> provide a standard method of moving data to cold storage in reaction to those 
> events. Once the program finishes backing the segments up, it could tell 
> Kafka “it is now safe to delete these segments”.
> The second option is to provide an additional value for the 
> log.cleanup.policy setting, call it cold-storage. In case of this value, 
> Kafka would move the segment files — which otherwise would be deleted — to 
> another destination on the server. They can be picked up from there and moved 
> to the cold storage.
> Both have their limitations. The former one is simply a mechanism exposed to 
> allow operator building up the tooling necessary to enable this. Events could 
> be published in a manner similar to Mesos Event Bus 
> (https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
> could provide a control topic on which such info would be published. The 
> outcome is, the operator can subscribe to the event bus and get notified 
> about, at least, two events:
> - log segment is complete and can be backed up
> - partition leader changed
> These two, together with an option to keep the log segment safe from 
> compaction for a certain amount of time, would be sufficient to reliably 
> implement cold storage.
> The latter option, {{log.cleanup.policy}} setting would be more complete 
> feature but it is also much more difficult to implement.  All brokers would 
> have keep the backup of the data in the cold storage significantly increasing 
> the size requirements, also, the de-duplication of the data for the 
> replicated data would be left completely to the operator.
> In any case, the thing to stay away from is having Kafka to deal with the 
> physical aspect of moving the data to and back from the cold storage. This is 
> not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (KAFKA-3726) Enable cold storage option

2016-05-18 Thread Radoslaw Gruchalski (JIRA)
Radoslaw Gruchalski created KAFKA-3726:
--

 Summary: Enable cold storage option
 Key: KAFKA-3726
 URL: https://issues.apache.org/jira/browse/KAFKA-3726
 Project: Kafka
  Issue Type: Wish
Reporter: Radoslaw Gruchalski


This JIRA builds up on the cold storage article I have published on Medium. The 
copy of the article attached here.

The need for cold storage or an "indefinite" log seems to be quite often 
discussed on the user mailing list.

The cold storage idea would enable the opportunity for the operator to keep the 
raw Kafka offset files in a third party storage and allow retrieving the data 
back for re-consumption.

The two possible options for enabling such functionality are, from the article:

First approach: if Kafka provided a notification mechanism and could trigger a 
program when a segment file is to be discarded, it would become feasible to 
provide a standard method of moving data to cold storage in reaction to those 
events. Once the program finishes backing the segments up, it could tell Kafka 
“it is now safe to delete these segments”.

The second option is to provide an additional value for the log.cleanup.policy 
setting, call it cold-storage. In case of this value, Kafka would move the 
segment files — which otherwise would be deleted — to another destination on 
the server. They can be picked up from there and moved to the cold storage.

Both have their limitations. The former one is simply a mechanism exposed to 
allow operator building up the tooling necessary to enable this. Events could 
be published in a manner similar to Mesos Event Bus 
(https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
could provide a control topic on which such info would be published. The 
outcome is, the operator can subscribe to the event bus and get notified about, 
at least, two events:

- log segment is complete and can be backed up
- partition leader changed

These two, together with an option to keep the log segment safe from compaction 
for a certain amount of time, would be sufficient to reliably implement cold 
storage.

The latter option, {{log.cleanup.policy}} setting would be more complete 
feature but it is also much more difficult to implement.  All brokers would 
have keep the backup of the data in the cold storage significantly increasing 
the size requirements, also, the de-duplication of the data for the replicated 
data would be left completely to the operator.

In any case, the thing to stay away from is having Kafka to deal with the 
physical aspect of moving the data to and back from the cold storage. This is 
not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (KAFKA-3726) Enable cold storage option

2016-05-18 Thread Radoslaw Gruchalski (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Radoslaw Gruchalski updated KAFKA-3726:
---
Attachment: kafka-cold-storage.txt

The text version of the mentioned article.

> Enable cold storage option
> --
>
> Key: KAFKA-3726
> URL: https://issues.apache.org/jira/browse/KAFKA-3726
> Project: Kafka
>  Issue Type: Wish
>Reporter: Radoslaw Gruchalski
> Attachments: kafka-cold-storage.txt
>
>
> This JIRA builds up on the cold storage article I have published on Medium. 
> The copy of the article attached here.
> The need for cold storage or an "indefinite" log seems to be quite often 
> discussed on the user mailing list.
> The cold storage idea would enable the opportunity for the operator to keep 
> the raw Kafka offset files in a third party storage and allow retrieving the 
> data back for re-consumption.
> The two possible options for enabling such functionality are, from the 
> article:
> First approach: if Kafka provided a notification mechanism and could trigger 
> a program when a segment file is to be discarded, it would become feasible to 
> provide a standard method of moving data to cold storage in reaction to those 
> events. Once the program finishes backing the segments up, it could tell 
> Kafka “it is now safe to delete these segments”.
> The second option is to provide an additional value for the 
> log.cleanup.policy setting, call it cold-storage. In case of this value, 
> Kafka would move the segment files — which otherwise would be deleted — to 
> another destination on the server. They can be picked up from there and moved 
> to the cold storage.
> Both have their limitations. The former one is simply a mechanism exposed to 
> allow operator building up the tooling necessary to enable this. Events could 
> be published in a manner similar to Mesos Event Bus 
> (https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
> could provide a control topic on which such info would be published. The 
> outcome is, the operator can subscribe to the event bus and get notified 
> about, at least, two events:
> - log segment is complete and can be backed up
> - partition leader changed
> These two, together with an option to keep the log segment safe from 
> compaction for a certain amount of time, would be sufficient to reliably 
> implement cold storage.
> The latter option, {{log.cleanup.policy}} setting would be more complete 
> feature but it is also much more difficult to implement.  All brokers would 
> have keep the backup of the data in the cold storage significantly increasing 
> the size requirements, also, the de-duplication of the data for the 
> replicated data would be left completely to the operator.
> In any case, the thing to stay away from is having Kafka to deal with the 
> physical aspect of moving the data to and back from the cold storage. This is 
> not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)