Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15407
This doesn't affect correctness (only the highest offset for a given
partition is used in any case), just memory leaks. I'm not sure what a good
way to unit test memory leaks is, short
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15504#discussion_r83551653
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -232,6 +232,42 @@ private[kafka010] case class
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15504
[SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for
structured stream
## What changes were proposed in this pull request?
startingOffsets takes specific per-topicpartition
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15397
LGTM, thanks for talking it through
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15397#discussion_r83289086
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -270,7 +281,7 @@ private[kafka010] case class
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15397#discussion_r83289072
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -256,8 +269,6 @@ private[kafka010] case class
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15397
My main point is that whoever implements SPARK-17812 is going to have to
deal with the issue shown in SPARK-17782, which means much of this patch is
going to need to be changed anyway
Github user koeninger closed the pull request at:
https://github.com/apache/spark/pull/15387
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15401
@zsxwing so poll is only called in consumer strategy in situations in which
starting offsets have been provided, and seek is called immediately thereafter
for those offsets. What is the specific
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15442
[SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is
bad
## What changes were proposed in this pull request?
Documentation fix to make it clear that reusing group
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r82857280
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
---
@@ -0,0 +1,244 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
If you're worried about it then accept the alternative PR I linked.
On Sun, Oct 9, 2016 at 11:37 PM, Shixiong Zhu <notificati...@github.com>
wrote:
> During the
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15397
Look at the poll/seek implementation in the DStream's subscribe and
subscribe pattern when user offsets are provided, i.e. the problem that
triggered this ticket to begin with. You're going
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15407
[SPARK-17841][STREAMING][KAFKA] drain commitQueue
## What changes were proposed in this pull request?
Actually drain commit queue rather than just iterating it
## How
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
Let me know if you guys like that alternative PR better
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15401
[SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of
poll twice
## What changes were proposed in this pull request?
Alternative approach to https://github.com/apache
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
If the concern is TD's comment,
"Future calls to {@link #poll(long)} will not return any records from these
partitions until they have been resumed using {@link #resume(Colle
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15397
How is this going to work with assign? It seems like it's just avoiding
the problem, not fixing it.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
Poll also isn't going to return you just messages for a single
topicpartition, so to do what you're suggesting you'd have to go through
all the messages and do additional processing
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
You dont want poll consuming messages, its not just about offset
correctness, the driver shouldnt be spending time or bandwidth doing that.
What is the substantive concern
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
I set auto commit to false, and still recreated the test failure.
That makes sense to me, consumer position should still be getting updated
in memory even if it isn't saved to storage
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15387
I'm not going to say anything is impossible, which is the point of the
assert. If it does somehow happen, it will be at start, so should be
obvious.
The whole poll 0 / pause thing
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15367
Does backporting reduce the likelihood of change if user feedback indicates
we got it wrong?
My technical concerns were largely addressed, that's my big remaining
organizational concern
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15355
@zsxwing good eye, thanks. It's not that auto.offset.reset.earliest
doesn't work, it's that there's a potential race condition that poll gets
called twice slowly enough for consumer position
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15387
[SPARK-17782][STREAMING][KAFKA] eliminate race condition of poll twice
## What changes were proposed in this pull request?
Kafka consumers can't subscribe or maintain heartbeat without
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15132
@jnadler Have you had a chance to try this out and see whether it addresses
your issue?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15355
I have generally been unable to reproduce these kinds of test failures on
my local environment, and don't have access to the build server, so trying fix
without repro is pretty much shooting
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r81642196
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r81623277
--- Diff: docs/structured-streaming-kafka-integration.md ---
@@ -0,0 +1,185 @@
+---
+layout: global
+title: Structured Streaming + Kafka
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r81622511
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,281 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15307
@tdas I wasn't asking if this patch was intended to implement rate
limiting, I was asking if it was intended to be used to support the
implementation of rate limiting in the future. In other
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r81615342
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
---
@@ -185,25 +203,36 @@ class StreamExecution
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r8163
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r81590359
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetrics.scala
---
@@ -0,0 +1,252 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r81584577
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
---
@@ -273,8 +304,14 @@ class StreamExecution
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15307#discussion_r81582674
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala
---
@@ -185,25 +203,36 @@ class StreamExecution
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> It would be nice to be able to do something other than earliest/latest.
That's what Assign and the starting offset arguments to the Subscribe
strateg
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
Ok, so this kind of thing is why I was concerned about the copy, paste,
randomly change things approach to developing this module.
> (5) Topics are deleted when a Spark job is runi
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80793355
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/13998
I ran that test 100 times locally w/out error... you have any suggestions
on repro?
On Mon, Sep 26, 2016 at 6:40 PM, Cody Koeninger <c...@koeninger.org> wrote:
>
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80617904
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,344 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
Ok, finished a line-by-line compare + comment.
The biggest thing I'm having trouble reconciling is the stated emphasis on
limiting user options in order to give guarantees, yet throwing
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80617145
--- Diff: external/kafka-0-10-sql/pom.xml ---
@@ -0,0 +1,82 @@
+
+
+
+http://maven.apache.org/POM/4.0.0;
xmlns:xsi="http://www.w3.org
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80617036
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala
---
@@ -0,0 +1,163 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80616878
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala
---
@@ -0,0 +1,163 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80616386
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala
---
@@ -0,0 +1,163 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80616098
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceRDD.scala
---
@@ -0,0 +1,163 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80615842
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80615307
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80614899
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80614481
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80613358
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80613155
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSourceProvider.scala
---
@@ -0,0 +1,263 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80612809
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,344 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80612619
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80612293
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80611556
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80611465
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80610814
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80610748
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80610562
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80610210
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80609930
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80609718
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80609252
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80609099
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,366 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80607976
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80607882
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80607795
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80607406
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/CachedKafkaConsumer.scala
---
@@ -0,0 +1,155 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80606631
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -149,6 +160,14 @@ private[kafka010] case class
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r80605915
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,344 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/13998
Sure I'll give it another look
On Sep 26, 2016 3:46 PM, "Tathagata Das" <notificati...@github.com> wrote:
> @koeninger <https://github.com/koeninger&
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
Source.getOffset returns an sql Offset, not a kafka offset. At this point
the current plan is to remove the ordering requirement for sql Offset,
which makes the whole discussion
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
We aren't comparing ordering of offsets across partitions, and I don't
think that was ever in consideration. At this point the most likely
candidate for the global ordering is implicit
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15207
LGTM.
You probably already checked this, but FWIW I verified the kafka topic
deletion test does pass once this is merged:
https://github.com/koeninger/spark-1/tree/kafka-source
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> I agree that if/when we add that ability to add existing partitions
midstream we'd probably need to add two offsets in to the SQL offset for new
partitions.
It's not just exist
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
@tdas I think as long as marmbrus' PR to remove comparable from the
interface works for sane variations of subscription changes it's the best way
to go. I'm honestly fine with someone getting
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> For streaming you already know what the global order is, because you know
when you asked for A and B. I agree that we should probably remove the
comparable requirement from Offset in fa
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
@tdas moving this conversation back to the PR that's linked from the public
jira
> yeah, i am trying to figure out all the options and write up something to
so that we are cl
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
This is pretty much the fundamental issue. Kafka offsets alone aren't
capable of meeting the SQL Offset interface as defined. I think that means
the Offset interface needs
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> I'd want to see some test cases though that show why the current
implementation is wrong from an end-user perspective if it needs to block
merging initial kafka support.
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> We are not giving the developer the option to manually configure a
consumer in this way for this PR precisely because I don't think we can while
still maintaining the semantics that structu
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
My fork is not following auto.offset.reset, it's following what the
(potentially user-provided) consumer does when it sees a new partition.
Maybe that's auto.offset.reset, maybe it's
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
Absent a direct answer, I'm going to read that as "Yes I've made up my
mind, no I will not consider PRs to the contrary."
If you want pointers to what I think the correct impl
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r79680339
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,446 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
You should not be assuming 0 for a starting offset for partitions you've
just learned about. You should be asking the underlying driver consumer what
its position is. This is yet another
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15132
I personally wouldn't encourage people to use receivers unless they had a
very specific reason to. The kafka-0-10 submodule doesn't even have a way
to create receiver based streams, so I'm
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
I'm not concerned about people deleting partitions before messages have
been processed, because they can take care of that problem themselves, by
not deleting things until consuming has
GitHub user koeninger opened a pull request:
https://github.com/apache/spark/pull/15132
[SPARK-17510][STREAMING][KAFKA] config max rate on a per-partition basis
## What changes were proposed in this pull request?
Allow configuration of max rate on a per-topicpartition basis
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
> I pushed for this code to be copied rather than refactored because I
think this is the right direction long term. While it is nice to minimize
inter-project dependencies, that is not rea
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r79094763
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,446 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r79093876
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,446 @@
+/*
+ * Licensed
Github user koeninger commented on a diff in the pull request:
https://github.com/apache/spark/pull/15102#discussion_r79092976
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaSource.scala
---
@@ -0,0 +1,446 @@
+/*
+ * Licensed
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
* This already does depend on most of the existing Kafka DStream
implementation. The fact that most of it was copied wholesale proves that. If
you're just saying that you don't want it to have
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/15102
Just from a very brief look, this duplicates egregious amounts of code from
the existing Kafka submodule and doesn't handle offset ordering correctly when
topicpartitions change.
I'd -1
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/14981
Yeah, I don't know of an easier workaround.
My understanding is also that the asf concern is tied to distribution, so
not publishing to maven should be sufficient.
On Sep 14
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/14981
Yeah, if it's cleaner to remove the kinesis-asl-assembly module I don't
think it's a serious hardship to users.
---
If your project is set up for it, you can reply to this email and have your
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/14981
Isn't that mostly down to whether someone wants to just put the whole
assembly on their classpath, vs install the project and depend on it in their
build tool? I can see why someone would want
Github user koeninger commented on the issue:
https://github.com/apache/spark/pull/14981
My 2 cents are that we should make things as easy as possible for users,
within the bounds of what ASF legal is willing to tolerate ;) Which probably
means having it exist, but not published
201 - 300 of 692 matches
Mail list logo