Thanks for the replies. Regarding skipping WAL, it's not just about optimization. If you actually want exactly-once semantics, you need control of kafka offsets as well, including the ability to not use zookeeper as the system of record for offsets. Kafka already is a reliable system that has strong ordering guarantees (within a partition) and does not mandate the use of zookeeper to store offsets. I think there should be a spark api that acts as a very simple intermediary between Kafka and the user's choice of downstream store.
Take a look at the links I posted - if there's already been 2 independent implementations of the idea, chances are it's something people need. On Thu, Dec 18, 2014 at 1:44 PM, Hari Shreedharan <hshreedha...@cloudera.com > wrote: > > Hi Cody, > > I am an absolute +1 on SPARK-3146. I think we can implement something > pretty simple and lightweight for that one. > > For the Kafka DStream skipping the WAL implementation - this is something > I discussed with TD a few weeks ago. Though it is a good idea to implement > this to avoid unnecessary HDFS writes, it is an optimization. For that > reason, we must be careful in implementation. There are a couple of issues > that we need to ensure works properly - specifically ordering. To ensure we > pull messages from different topics and partitions in the same order after > failure, we’d still have to persist the metadata to HDFS (or some other > system) - this metadata must contain the order of messages consumed, so we > know how to re-read the messages. I am planning to explore this once I have > some time (probably in Jan). In addition, we must also ensure bucketing > functions work fine as well. I will file a placeholder jira for this one. > > I also wrote an API to write data back to Kafka a while back - > https://github.com/apache/spark/pull/2994 . I am hoping that this will > get pulled in soon, as this is something I know people want. I am open to > feedback on that - anything that I can do to make it better. > > Thanks, > Hari > > > On Thu, Dec 18, 2014 at 11:14 AM, Patrick Wendell <pwend...@gmail.com> > wrote: > >> Hey Cody, >> >> Thanks for reaching out with this. The lead on streaming is TD - he is >> traveling this week though so I can respond a bit. To the high level >> point of whether Kafka is important - it definitely is. Something like >> 80% of Spark Streaming deployments (anecdotally) ingest data from >> Kafka. Also, good support for Kafka is something we generally want in >> Spark and not a library. In some cases IIRC there were user libraries >> that used unstable Kafka API's and we were somewhat waiting on Kafka >> to stabilize them to merge things upstream. Otherwise users wouldn't >> be able to use newer Kakfa versions. This is a high level impression >> only though, I haven't talked to TD about this recently so it's worth >> revisiting given the developments in Kafka. >> >> Please do bring things up like this on the dev list if there are >> blockers for your usage - thanks for pinging it. >> >> - Patrick >> >> On Thu, Dec 18, 2014 at 7:07 AM, Cody Koeninger <c...@koeninger.org> >> wrote: >> > Now that 1.2 is finalized... who are the go-to people to get some >> > long-standing Kafka related issues resolved? >> > >> > The existing api is not sufficiently safe nor flexible for our >> production >> > use. I don't think we're alone in this viewpoint, because I've seen >> > several different patches and libraries to fix the same things we've >> been >> > running into. >> > >> > Regarding flexibility >> > >> > https://issues.apache.org/jira/browse/SPARK-3146 >> > >> > has been outstanding since August, and IMHO an equivalent of this is >> > absolutely necessary. We wrote a similar patch ourselves, then found >> that >> > PR and have been running it in production. We wouldn't be able to get >> our >> > jobs done without it. It also allows users to solve a whole class of >> > problems for themselves (e.g. SPARK-2388, arbitrary delay of messages, >> etc). >> > >> > Regarding safety, I understand the motivation behind WriteAheadLog as a >> > general solution for streaming unreliable sources, but Kafka already is >> a >> > reliable source. I think there's a need for an api that treats it as >> > such. Even aside from the performance issues of duplicating the >> > write-ahead log in kafka into another write-ahead log in hdfs, I need >> > exactly-once semantics in the face of failure (I've had failures that >> > prevented reloading a spark streaming checkpoint, for instance). >> > >> > I've got an implementation i've been using >> > >> > https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka >> > /src/main/scala/org/apache/spark/rdd/kafka >> > >> > Tresata has something similar at https://github.com/tresata/spark-kafka, >> >> > and I know there were earlier attempts based on Storm code. >> > >> > Trying to distribute these kinds of fixes as libraries rather than >> patches >> > to Spark is problematic, because large portions of the implementation >> are >> > private[spark]. >> > >> > I'd like to help, but i need to know whose attention to get. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >