Responses inline

On Wed, Apr 13, 2016 at 7:45 AM, Travis Crawford <traviscrawf...@gmail.com>
wrote:

> Hi Spark gurus,
>
> At Medium we're using Spark for an ETL job that scans DynamoDB tables and
> loads into Redshift. Currently I use a parallel scanner implementation that
> writes files to local disk, then have Spark read them as a DataFrame.
>
> Ideally we could read the DynamoDB table directly as a DataFrame, so I
> started putting together a data source at
> https://github.com/traviscrawford/spark-dynamodb
>
> A few questions:
>
> * What's the best way to incrementally build the RDD[Row] returned by
> "buildScan"? Currently I make an RDD[Row] from each page of results, and
> union them together. Does this approach seem reasonable? Any suggestions
> for a better way?
>

If the number of pages can be high (e.g. > 100), it is best to avoid using
union. The simpler way is ...

val pages = ...
sc.parallelize(pages, pages.size).flatMap { page =>
  ...
}

The above creates a task per page.

Looking at your code, you are relying on Spark's JSON inference to read the
JSON data. You would need a different thing there in order to parallelize
this. Right now you are bringing all the data into the driver and then send
them out.



>
> * Currently my stand-alone scanner creates separate threads for each scan
> segment. I could use that same approach and create threads in the Spark
> driver, though ideally each scan segment would run in an executor. Any tips
> on how to get the segment scanners to run on Spark executors?
>

I'm not too familiar with dynamo. Is segment different from the page above?



>
> Thanks,
> Travis
>

Reply via email to