GitHub user harishreedharan opened a pull request: https://github.com/apache/spark/pull/1192
SPARK-1730. Make receiver store data reliably to avoid data-loss on executor failures. Added a new method in Receiver, ReceiverSupervisor, ReceiverSupervisorImpl to store the data and callback a supplied function with a given argument. You can merge this pull request into a Git repository by running: $ git pull https://github.com/harishreedharan/spark persist-data Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1192.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1192 ---- commit 6d6776a45f30e3594a15bda2582f99819c28a583 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-09T06:16:56Z SPARK-1729. Make Flume pull data from source, rather than the current push model Currently Spark uses Flume's internal Avro Protocol to ingest data from Flume. If the executor running the receiver fails, it currently has to be restarted on the same node to be able to receive data. This commit adds a new Sink which can be deployed to a Flume agent. This sink can be polled by a new DStream that is also included in this commit. This model ensures that data can be pulled into Spark from Flume even if the receiver is restarted on a new node. This also allows the receiver to receive data on multiple threads for better performance. commit d24d9d47795fe0a81fa2d70a4f81c24d2efd8914 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-18T07:58:45Z SPARK-1729. Make Flume pull data from source, rather than the current push model Update to the previous patch fixing some error cases and also excluding Netty dependencies. Also updated the unit tests. commit 08176adc2a1a4f17562f486e0f897abfb7eba84d Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-18T08:06:22Z SPARK-1729. Make Flume pull data from source, rather than the current push model Exclude IO Netty in the Flume sink. commit 03d6c1c45bb5e1e00ba0a3618b920481ec3ec51a Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-19T16:24:55Z SPARK-1729. Make Flume pull data from source, rather than the current push model Removing previousArtifact from build spec, so that the build runs fine. commit 8df37e4911f74253a901502c9232c3db26dc8856 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-20T06:09:02Z SPARK-1729. Make Flume pull data from source, rather than the current push model Updated Maven build to be equivalent of the sbt build. commit 87775aa52e21804680ed43dc4f789adf718ddb6c Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-21T00:42:40Z SPARK-1729. Make Flume pull data from source, rather than the current push model Fix build with maven. commit 0f10788487f10234aa39277d4c20556f7c846796 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-05-24T08:32:32Z SPARK-1729. Make Flume pull data from source, rather than the current push model Added support for polling several Flume agents from a single receiver. commit c604a3c0fee085679967460f50b563a8d58aedf1 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-06-05T16:17:05Z SPARK-1729. Optimize imports. commit 9741683173c5dad3148c77d1a0f47b92387b8bdc Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-06-06T06:38:12Z SPARK-1729. Fixes based on review. commit e7da5128be13130538e41fb5e976089e93f1e149 Author: Hari Shreedharan <hshreedha...@apache.org> Date: 2014-06-06T06:43:13Z SPARK-1729. Fixing import order commit d6fa3aa25e21be508c695067a858afd0d3ddbd64 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-10T05:27:19Z SPARK-1729. New Flume-Spark integration. Made the Flume Sink considerably simpler. Added a lot of documentation. commit 70bcc2ad5b117324652e41f0331eb974ab696966 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-10T05:34:40Z SPARK-1729. New Flume-Spark integration. Renamed the SparkPollingEvent to SparkFlumePollingEvent. commit 3c23c182fd8655e0f1a64cee64641f1cc803f7c2 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-10T23:20:40Z SPARK-1729. New Spark-Flume integration. Minor formatting changes. commit 0d69604ae319610b9fde1b3a77fd8130f70b4ec2 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-16T19:44:12Z FLUME-1729. Better Flume-Spark integration. Use readFully instead of read in EventTransformer. commit bda01fc18daae511603a526ca5fcd2ada97a3de4 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-17T22:15:36Z FLUME-1729. Flume-Spark integration. Refactoring classes into new files and minor changes in protocol. commit 4b0c7fcdf654023f56d3e85b8d52ee1d049d8c65 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-18T05:47:49Z FLUME-1729. New Flume-Spark integration. Avro does not support inheritance, so the error message needs to be part of the message itself. commit 205034dc78a8bda62e373101275cae1870875a21 Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-18T06:32:01Z Merging master in commit e13fab50a38d88f11021282e0da55dcaeab5a20c Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-24T07:41:20Z SPARK-1730. Make receiver store data reliably to avoid data-loss on executor failures. Added a new method in Receiver, ReceiverSupervisor, ReceiverSupervisorImpl to store the data and callback a supplied function with a given argument. commit 038b644f1b35ffe10d13ef830e0baa02d5ef7bef Author: Hari Shreedharan <harishreedha...@gmail.com> Date: 2014-06-24T07:44:23Z Merge remote-tracking branch 'origin/master' into persist-data ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---