Re: Am I crazy, or does the binary distro not have Kafka integration?
Yes it's a resaonable argument, that putting N more external integration modules on the default spark-submit classpath might bring in more third-party dependencies that clash or something. I think the convenience factor isn't a big deal; users can also just write a dependence on said module in their own app, once. It does seem like we could at least *ship* the binary bits in "external-jars/' or something; they're not even compiled in the binary distro. And it also means users have to make sure the version of spark-kafka they integrate works with their cluster, which means not just making sure their app matches the user-facing API of spark-kafka, but ensuring that the spark-kafka module's interface to spark works -- whatever internal details there may be there. On Sat, Aug 4, 2018 at 9:15 PM Matei Zaharia wrote: > I think that traditionally, the reason *not* to include these has been if > they brought additional dependencies that users don’t really need, but that > might clash with what the users have in their own app. Maybe this used to > be the case for Kafka. We could analyze it and include it by default, or > perhaps make it easier to add it in spark-submit and spark-shell. I feel > that in an IDE, it won’t be a huge problem because you just add it once, > but it is annoying for spark-submit. > > Matei > > > On Aug 4, 2018, at 2:19 PM, Sean Owen wrote: > > > > Hm OK I am crazy then. I think I never noticed it because I had always > used a distro that did actually supply this on the classpath. > > Well ... I think it would be reasonable to include these things (at > least, Kafka integration) by default in the binary distro. I'll update the > JIRA to reflect that this is at best a Wish. > > > > On Sat, Aug 4, 2018 at 4:17 PM Jacek Laskowski wrote: > > Hi Sean, > > > > It's been for years I'd say that you had to specify --packages to get > the Kafka-related jars on the classpath. I simply got used to this > annoyance (as did others). Could it be that it's an external package > (although an integral part of Spark)?! > > > > I'm very glad you've brought it up since I think Kafka data source is so > important that it should be included in spark-shell and spark-submit by > default. THANKS! > > > > Pozdrawiam, > > Jacek Laskowski > > > > https://about.me/JacekLaskowski > > Mastering Spark SQL https://bit.ly/mastering-spark-sql > > Spark Structured Streaming https://bit.ly/spark-structured-streaming > > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > > Follow me at https://twitter.com/jaceklaskowski > > > > On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen wrote: > > Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- > I provisionally marked this a Blocker, as if it's correct, then the release > is missing an important piece and we'll want to remedy that ASAP. I still > have this feeling I am missing something. The classes really aren't there > in the release but ... *nobody* noticed all this time? I guess maybe > Spark-Kafka users may be using a vendor distro that does package these bits. > > > > > > On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: > > I was debugging why a Kafka-based streaming app doesn't seem to find > Kafka-related integration classes when run standalone from our latest 2.3.1 > release, and noticed that there doesn't seem to be any Kafka-related jars > from Spark in the distro. In jars/, I see: > > > > spark-catalyst_2.11-2.3.1.jar > > spark-core_2.11-2.3.1.jar > > spark-graphx_2.11-2.3.1.jar > > spark-hive-thriftserver_2.11-2.3.1.jar > > spark-hive_2.11-2.3.1.jar > > spark-kubernetes_2.11-2.3.1.jar > > spark-kvstore_2.11-2.3.1.jar > > spark-launcher_2.11-2.3.1.jar > > spark-mesos_2.11-2.3.1.jar > > spark-mllib-local_2.11-2.3.1.jar > > spark-mllib_2.11-2.3.1.jar > > spark-network-common_2.11-2.3.1.jar > > spark-network-shuffle_2.11-2.3.1.jar > > spark-repl_2.11-2.3.1.jar > > spark-sketch_2.11-2.3.1.jar > > spark-sql_2.11-2.3.1.jar > > spark-streaming_2.11-2.3.1.jar > > spark-tags_2.11-2.3.1.jar > > spark-unsafe_2.11-2.3.1.jar > > spark-yarn_2.11-2.3.1.jar > > > > I checked make-distribution.sh, and it copies a bunch of JARs into the > distro, but does not seem to touch the kafka modules. > > > > Am I crazy or missing something obvious -- those should be in the > release, right? > > > >
Re: Am I crazy, or does the binary distro not have Kafka integration?
I think that traditionally, the reason *not* to include these has been if they brought additional dependencies that users don’t really need, but that might clash with what the users have in their own app. Maybe this used to be the case for Kafka. We could analyze it and include it by default, or perhaps make it easier to add it in spark-submit and spark-shell. I feel that in an IDE, it won’t be a huge problem because you just add it once, but it is annoying for spark-submit. Matei > On Aug 4, 2018, at 2:19 PM, Sean Owen wrote: > > Hm OK I am crazy then. I think I never noticed it because I had always used a > distro that did actually supply this on the classpath. > Well ... I think it would be reasonable to include these things (at least, > Kafka integration) by default in the binary distro. I'll update the JIRA to > reflect that this is at best a Wish. > > On Sat, Aug 4, 2018 at 4:17 PM Jacek Laskowski wrote: > Hi Sean, > > It's been for years I'd say that you had to specify --packages to get the > Kafka-related jars on the classpath. I simply got used to this annoyance (as > did others). Could it be that it's an external package (although an integral > part of Spark)?! > > I'm very glad you've brought it up since I think Kafka data source is so > important that it should be included in spark-shell and spark-submit by > default. THANKS! > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen wrote: > Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I > provisionally marked this a Blocker, as if it's correct, then the release is > missing an important piece and we'll want to remedy that ASAP. I still have > this feeling I am missing something. The classes really aren't there in the > release but ... *nobody* noticed all this time? I guess maybe Spark-Kafka > users may be using a vendor distro that does package these bits. > > > On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: > I was debugging why a Kafka-based streaming app doesn't seem to find > Kafka-related integration classes when run standalone from our latest 2.3.1 > release, and noticed that there doesn't seem to be any Kafka-related jars > from Spark in the distro. In jars/, I see: > > spark-catalyst_2.11-2.3.1.jar > spark-core_2.11-2.3.1.jar > spark-graphx_2.11-2.3.1.jar > spark-hive-thriftserver_2.11-2.3.1.jar > spark-hive_2.11-2.3.1.jar > spark-kubernetes_2.11-2.3.1.jar > spark-kvstore_2.11-2.3.1.jar > spark-launcher_2.11-2.3.1.jar > spark-mesos_2.11-2.3.1.jar > spark-mllib-local_2.11-2.3.1.jar > spark-mllib_2.11-2.3.1.jar > spark-network-common_2.11-2.3.1.jar > spark-network-shuffle_2.11-2.3.1.jar > spark-repl_2.11-2.3.1.jar > spark-sketch_2.11-2.3.1.jar > spark-sql_2.11-2.3.1.jar > spark-streaming_2.11-2.3.1.jar > spark-tags_2.11-2.3.1.jar > spark-unsafe_2.11-2.3.1.jar > spark-yarn_2.11-2.3.1.jar > > I checked make-distribution.sh, and it copies a bunch of JARs into the > distro, but does not seem to touch the kafka modules. > > Am I crazy or missing something obvious -- those should be in the release, > right? > - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Am I crazy, or does the binary distro not have Kafka integration?
Hm OK I am crazy then. I think I never noticed it because I had always used a distro that did actually supply this on the classpath. Well ... I think it would be reasonable to include these things (at least, Kafka integration) by default in the binary distro. I'll update the JIRA to reflect that this is at best a Wish. On Sat, Aug 4, 2018 at 4:17 PM Jacek Laskowski wrote: > Hi Sean, > > It's been for years I'd say that you had to specify --packages to get the > Kafka-related jars on the classpath. I simply got used to this annoyance > (as did others). Could it be that it's an external package (although an > integral part of Spark)?! > > I'm very glad you've brought it up since I think Kafka data source is so > important that it should be included in spark-shell and spark-submit by > default. THANKS! > > Pozdrawiam, > Jacek Laskowski > > https://about.me/JacekLaskowski > Mastering Spark SQL https://bit.ly/mastering-spark-sql > Spark Structured Streaming https://bit.ly/spark-structured-streaming > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams > Follow me at https://twitter.com/jaceklaskowski > > On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen wrote: > >> Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- >> I provisionally marked this a Blocker, as if it's correct, then the release >> is missing an important piece and we'll want to remedy that ASAP. I still >> have this feeling I am missing something. The classes really aren't there >> in the release but ... *nobody* noticed all this time? I guess maybe >> Spark-Kafka users may be using a vendor distro that does package these bits. >> >> >> On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: >> >>> I was debugging why a Kafka-based streaming app doesn't seem to find >>> Kafka-related integration classes when run standalone from our latest 2.3.1 >>> release, and noticed that there doesn't seem to be any Kafka-related jars >>> from Spark in the distro. In jars/, I see: >>> >>> spark-catalyst_2.11-2.3.1.jar >>> spark-core_2.11-2.3.1.jar >>> spark-graphx_2.11-2.3.1.jar >>> spark-hive-thriftserver_2.11-2.3.1.jar >>> spark-hive_2.11-2.3.1.jar >>> spark-kubernetes_2.11-2.3.1.jar >>> spark-kvstore_2.11-2.3.1.jar >>> spark-launcher_2.11-2.3.1.jar >>> spark-mesos_2.11-2.3.1.jar >>> spark-mllib-local_2.11-2.3.1.jar >>> spark-mllib_2.11-2.3.1.jar >>> spark-network-common_2.11-2.3.1.jar >>> spark-network-shuffle_2.11-2.3.1.jar >>> spark-repl_2.11-2.3.1.jar >>> spark-sketch_2.11-2.3.1.jar >>> spark-sql_2.11-2.3.1.jar >>> spark-streaming_2.11-2.3.1.jar >>> spark-tags_2.11-2.3.1.jar >>> spark-unsafe_2.11-2.3.1.jar >>> spark-yarn_2.11-2.3.1.jar >>> >>> I checked make-distribution.sh, and it copies a bunch of JARs into the >>> distro, but does not seem to touch the kafka modules. >>> >>> Am I crazy or missing something obvious -- those should be in the >>> release, right? >>> >> >
Re: Am I crazy, or does the binary distro not have Kafka integration?
Hi Sean, It's been for years I'd say that you had to specify --packages to get the Kafka-related jars on the classpath. I simply got used to this annoyance (as did others). Could it be that it's an external package (although an integral part of Spark)?! I'm very glad you've brought it up since I think Kafka data source is so important that it should be included in spark-shell and spark-submit by default. THANKS! Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Sat, Aug 4, 2018 at 9:56 PM, Sean Owen wrote: > Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I > provisionally marked this a Blocker, as if it's correct, then the release > is missing an important piece and we'll want to remedy that ASAP. I still > have this feeling I am missing something. The classes really aren't there > in the release but ... *nobody* noticed all this time? I guess maybe > Spark-Kafka users may be using a vendor distro that does package these bits. > > > On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: > >> I was debugging why a Kafka-based streaming app doesn't seem to find >> Kafka-related integration classes when run standalone from our latest 2.3.1 >> release, and noticed that there doesn't seem to be any Kafka-related jars >> from Spark in the distro. In jars/, I see: >> >> spark-catalyst_2.11-2.3.1.jar >> spark-core_2.11-2.3.1.jar >> spark-graphx_2.11-2.3.1.jar >> spark-hive-thriftserver_2.11-2.3.1.jar >> spark-hive_2.11-2.3.1.jar >> spark-kubernetes_2.11-2.3.1.jar >> spark-kvstore_2.11-2.3.1.jar >> spark-launcher_2.11-2.3.1.jar >> spark-mesos_2.11-2.3.1.jar >> spark-mllib-local_2.11-2.3.1.jar >> spark-mllib_2.11-2.3.1.jar >> spark-network-common_2.11-2.3.1.jar >> spark-network-shuffle_2.11-2.3.1.jar >> spark-repl_2.11-2.3.1.jar >> spark-sketch_2.11-2.3.1.jar >> spark-sql_2.11-2.3.1.jar >> spark-streaming_2.11-2.3.1.jar >> spark-tags_2.11-2.3.1.jar >> spark-unsafe_2.11-2.3.1.jar >> spark-yarn_2.11-2.3.1.jar >> >> I checked make-distribution.sh, and it copies a bunch of JARs into the >> distro, but does not seem to touch the kafka modules. >> >> Am I crazy or missing something obvious -- those should be in the >> release, right? >> >
Re: Am I crazy, or does the binary distro not have Kafka integration?
Let's take this to https://issues.apache.org/jira/browse/SPARK-25026 -- I provisionally marked this a Blocker, as if it's correct, then the release is missing an important piece and we'll want to remedy that ASAP. I still have this feeling I am missing something. The classes really aren't there in the release but ... *nobody* noticed all this time? I guess maybe Spark-Kafka users may be using a vendor distro that does package these bits. On Sat, Aug 4, 2018 at 10:48 AM Sean Owen wrote: > I was debugging why a Kafka-based streaming app doesn't seem to find > Kafka-related integration classes when run standalone from our latest 2.3.1 > release, and noticed that there doesn't seem to be any Kafka-related jars > from Spark in the distro. In jars/, I see: > > spark-catalyst_2.11-2.3.1.jar > spark-core_2.11-2.3.1.jar > spark-graphx_2.11-2.3.1.jar > spark-hive-thriftserver_2.11-2.3.1.jar > spark-hive_2.11-2.3.1.jar > spark-kubernetes_2.11-2.3.1.jar > spark-kvstore_2.11-2.3.1.jar > spark-launcher_2.11-2.3.1.jar > spark-mesos_2.11-2.3.1.jar > spark-mllib-local_2.11-2.3.1.jar > spark-mllib_2.11-2.3.1.jar > spark-network-common_2.11-2.3.1.jar > spark-network-shuffle_2.11-2.3.1.jar > spark-repl_2.11-2.3.1.jar > spark-sketch_2.11-2.3.1.jar > spark-sql_2.11-2.3.1.jar > spark-streaming_2.11-2.3.1.jar > spark-tags_2.11-2.3.1.jar > spark-unsafe_2.11-2.3.1.jar > spark-yarn_2.11-2.3.1.jar > > I checked make-distribution.sh, and it copies a bunch of JARs into the > distro, but does not seem to touch the kafka modules. > > Am I crazy or missing something obvious -- those should be in the release, > right? >
Am I crazy, or does the binary distro not have Kafka integration?
I was debugging why a Kafka-based streaming app doesn't seem to find Kafka-related integration classes when run standalone from our latest 2.3.1 release, and noticed that there doesn't seem to be any Kafka-related jars from Spark in the distro. In jars/, I see: spark-catalyst_2.11-2.3.1.jar spark-core_2.11-2.3.1.jar spark-graphx_2.11-2.3.1.jar spark-hive-thriftserver_2.11-2.3.1.jar spark-hive_2.11-2.3.1.jar spark-kubernetes_2.11-2.3.1.jar spark-kvstore_2.11-2.3.1.jar spark-launcher_2.11-2.3.1.jar spark-mesos_2.11-2.3.1.jar spark-mllib-local_2.11-2.3.1.jar spark-mllib_2.11-2.3.1.jar spark-network-common_2.11-2.3.1.jar spark-network-shuffle_2.11-2.3.1.jar spark-repl_2.11-2.3.1.jar spark-sketch_2.11-2.3.1.jar spark-sql_2.11-2.3.1.jar spark-streaming_2.11-2.3.1.jar spark-tags_2.11-2.3.1.jar spark-unsafe_2.11-2.3.1.jar spark-yarn_2.11-2.3.1.jar I checked make-distribution.sh, and it copies a bunch of JARs into the distro, but does not seem to touch the kafka modules. Am I crazy or missing something obvious -- those should be in the release, right?