Would you mind if I ask the condition of being public API? Source/Sink traits are not marked as @DeveloperApi but they're defined as public, and located to sql-core so even not semantically private (for catalyst), easy to give a signal they're public APIs.
Also, if I'm not missing here, creating streaming DataFrame via RDD[Row] is not available even for private API. There're some other approaches on using private API: 1) SQLContext.internalCreateDataFrame - as it requires RDD[InternalRow], they should also depend on catalyst and have to deal with InternalRow which Spark community seems to be desired to change it eventually 2) Dataset.ofRows - it requires LogicalPlan which is also in catalyst. So they not only need to apply "package hack" but also need to depend on catalyst. On Mon, Oct 7, 2019 at 9:45 PM Wenchen Fan <cloud0...@gmail.com> wrote: > AFAIK there is no public streaming data source API before DS v2. The > Source and Sink API is private and is only for builtin streaming sources. > Advanced users can still implement custom stream sources with private Spark > APIs (you can put your classes under the org.apache.spark.sql package to > access the private methods). > > That said, DS v2 is the first public streaming data source API. It's > really hard to design a stable, efficient and flexible data source API that > is unified between batch and streaming. DS v2 has evolved a lot in the > master branch and hopefully there will be no big breaking changes anymore. > > > On Sat, Oct 5, 2019 at 12:24 PM Jungtaek Lim <kabhwan.opensou...@gmail.com> > wrote: > >> I remembered the actual case from developer who implements custom data >> source. >> >> >> https://lists.apache.org/thread.html/c1a210510b48bb1fea89828c8e2f5db8c27eba635e0079a97b0c7faf@%3Cdev.spark.apache.org%3E >> >> Quoting here: >> We started implementing DSv2 in the 2.4 branch, but quickly discovered >> that the DSv2 in 3.0 was a complete breaking change (to the point where it >> could have been named DSv3 and it wouldn’t have come as a surprise). Since >> the DSv2 in 3.0 has a compatibility layer for DSv1 datasources, we decided >> to fall back into DSv1 in order to ease the future transition to Spark 3. >> >> Given DSv2 for Spark 2.x and 3.x are diverged a lot, realistic solution >> on dealing with DSv2 breaking change is having DSv1 as temporary solution, >> even DSv2 for 3.x will be available. They need some time to make transition. >> >> I would file an issue to support streaming data source on DSv1 and submit >> a patch unless someone objects. >> >> >> On Wed, Oct 2, 2019 at 4:08 PM Jacek Laskowski <ja...@japila.pl> wrote: >> >>> Hi Jungtaek, >>> >>> Thanks a lot for your very prompt response! >>> >>> > Looks like it's missing, or intended to force custom streaming source >>> implemented as DSv2. >>> >>> That's exactly my understanding = no more DSv1 data sources. That >>> however is not consistent with the official message, is it? Spark 2.4.4 >>> does not actually say "we're abandoning DSv1", and people could not really >>> want to jump on DSv2 since it's not recommended (unless I missed that). >>> >>> I love surprises (as that's where people pay more for consulting :)), >>> but not necessarily before public talks (with one at SparkAISummit in two >>> weeks!) Gonna be challenging! Hope I won't spread a wrong word. >>> >>> Pozdrawiam, >>> Jacek Laskowski >>> ---- >>> https://about.me/JacekLaskowski >>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>> The Internals of Spark Structured Streaming >>> https://bit.ly/spark-structured-streaming >>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>> Follow me at https://twitter.com/jaceklaskowski >>> >>> >>> >>> On Wed, Oct 2, 2019 at 6:16 AM Jungtaek Lim < >>> kabhwan.opensou...@gmail.com> wrote: >>> >>>> Looks like it's missing, or intended to force custom streaming source >>>> implemented as DSv2. >>>> >>>> I'm not sure Spark community wants to expand DSv1 API: I could propose >>>> the change if we get some supports here. >>>> >>>> To Spark community: given we bring major changes on DSv2, someone would >>>> want to rely on DSv1 while transition from old DSv2 to new DSv2 happens and >>>> new DSv2 gets stabilized. Would we like to provide necessary changes on >>>> DSv1? >>>> >>>> Thanks, >>>> Jungtaek Lim (HeartSaVioR) >>>> >>>> On Wed, Oct 2, 2019 at 4:27 AM Jacek Laskowski <ja...@japila.pl> wrote: >>>> >>>>> Hi, >>>>> >>>>> I think I've got stuck and without your help I won't move any further. >>>>> Please help. >>>>> >>>>> I'm with Spark 2.4.4 and am developing a streaming Source (DSv1, >>>>> MicroBatch) and in getBatch phase when requested for a DataFrame, there is >>>>> this assert [1] I can't seem to go past with any DataFrame I managed to >>>>> create as it's not streaming. >>>>> >>>>> assert(batch.isStreaming, >>>>> s"DataFrame returned by getBatch from $source did not have >>>>> isStreaming=true\n" + >>>>> s"${batch.queryExecution.logical}") >>>>> >>>>> [1] >>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala#L439-L441 >>>>> >>>>> All I could find is private[sql], >>>>> e.g. SQLContext.internalCreateDataFrame(..., isStreaming = true) [2] or >>>>> [3] >>>>> >>>>> [2] >>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L422-L428 >>>>> [3] >>>>> https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L62-L81 >>>>> >>>>> Pozdrawiam, >>>>> Jacek Laskowski >>>>> ---- >>>>> https://about.me/JacekLaskowski >>>>> The Internals of Spark SQL https://bit.ly/spark-sql-internals >>>>> The Internals of Spark Structured Streaming >>>>> https://bit.ly/spark-structured-streaming >>>>> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals >>>>> Follow me at https://twitter.com/jaceklaskowski >>>>> >>>>>