Responses inline On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford <traviscrawf...@gmail.com> wrote:
> Hi Spark gurus, > > At Medium we're using Spark for an ETL job that scans DynamoDB tables and > loads into Redshift. Currently I use a parallel scanner implementation that > writes files to local disk, then have Spark read them as a DataFrame. > > Ideally we could read the DynamoDB table directly as a DataFrame, so I > started putting together a data source at > https://github.com/traviscrawford/spark-dynamodb > > A few questions: > > * What's the best way to incrementally build the RDD[Row] returned by > "buildScan"? Currently I make an RDD[Row] from each page of results, and > union them together. Does this approach seem reasonable? Any suggestions > for a better way? > If the number of pages can be high (e.g. > 100), it is best to avoid using union. The simpler way is ... val pages = ... sc.parallelize(pages, pages.size).flatMap { page => ... } The above creates a task per page. Looking at your code, you are relying on Spark's JSON inference to read the JSON data. You would need a different thing there in order to parallelize this. Right now you are bringing all the data into the driver and then send them out. > > * Currently my stand-alone scanner creates separate threads for each scan > segment. I could use that same approach and create threads in the Spark > driver, though ideally each scan segment would run in an executor. Any tips > on how to get the segment scanners to run on Spark executors? > I'm not too familiar with dynamo. Is segment different from the page above? > > Thanks, > Travis >