We don't support this yet, but I've opened this JIRA as it sounds generally useful: https://issues.apache.org/jira/browse/SPARK-19031
In the mean time you could try implementing your own Source, but that is pretty low level and is not yet a stable API. On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <yyz1...@gmail.com> wrote: > Hi all, > > Thanks a lot for your contributions to bring us new technologies. > > I don't want to waste your time, so before I write to you, I googled, > checked stackoverflow and mailing list archive with keywords "streaming" > and "jdbc". But I was not able to get any solution to my use case. I hope I > can get some clarification from you. > > The use case is quite straightforward, I need to harvest a relational > database via jdbc, do something with data, and store result into Kafka. I > am stuck at the first step, and the difficulty is as follows: > > 1. The database is too large to ingest with one thread. > 2. The database is dynamic and time series data comes in constantly. > > Then an ideal workflow is that multiple workers process partitions of data > incrementally according to a time window. For example, the processing > starts from the earliest data with each batch containing data for one hour. > If data ingestion speed is faster than data production speed, then > eventually the entire database will be harvested and those workers will > start to "tail" the database for new data streams and the processing > becomes real time. > > With Spark SQL I can ingest data from a JDBC source with partitions > divided by time windows, but how can I dynamically increment the time > windows during execution? Assume that there are two workers ingesting data > of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task > for 2017-01-03. But I am not able to find out how to increment those values > during execution. > > Then I looked into Structured Streaming. It looks much more promising > because window operations based on event time are considered during > streaming, which could be the solution to my use case. However, from > documentation and code example I did not find anything related to streaming > data from a growing database. Is there anything I can read to achieve my > goal? > > Any suggestion is highly appreciated. Thank you very much and have a nice > day. > > Best regards, > Yang > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >