Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Michael Armbrust Thu, 29 Dec 2016 17:02:36 -0800

We don't support this yet, but I've opened this JIRA as it sounds generally
useful: https://issues.apache.org/jira/browse/SPARK-19031


In the mean time you could try implementing your own Source, but that is
pretty low level and is not yet a stable API.

On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <yyz1...@gmail.com>
wrote:

> Hi all,
>
> Thanks a lot for your contributions to bring us new technologies.
>
> I don't want to waste your time, so before I write to you, I googled,
> checked stackoverflow and mailing list archive with keywords "streaming"
> and "jdbc". But I was not able to get any solution to my use case. I hope I
> can get some clarification from you.
>
> The use case is quite straightforward, I need to harvest a relational
> database via jdbc, do something with data, and store result into Kafka. I
> am stuck at the first step, and the difficulty is as follows:
>
> 1. The database is too large to ingest with one thread.
> 2. The database is dynamic and time series data comes in constantly.
>
> Then an ideal workflow is that multiple workers process partitions of data
> incrementally according to a time window. For example, the processing
> starts from the earliest data with each batch containing data for one hour.
> If data ingestion speed is faster than data production speed, then
> eventually the entire database will be harvested and those workers will
> start to "tail" the database for new data streams and the processing
> becomes real time.
>
> With Spark SQL I can ingest data from a JDBC source with partitions
> divided by time windows, but how can I dynamically increment the time
> windows during execution? Assume that there are two workers ingesting data
> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task
> for 2017-01-03. But I am not able to find out how to increment those values
> during execution.
>
> Then I looked into Structured Streaming. It looks much more promising
> because window operations based on event time are considered during
> streaming, which could be the solution to my use case. However, from
> documentation and code example I did not find anything related to streaming
> data from a growing database. Is there anything I can read to achieve my
> goal?
>
> Any suggestion is highly appreciated. Thank you very much and have a nice
> day.
>
> Best regards,
> Yang
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: [Spark Structured Streaming]: Is it possible to ingest data from a jdbc data source incrementally?

Reply via email to