On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux <lerfranc...@gmail.com>
wrote:

> 1.       Should I use dataframes to ‘pull the source data? If so, do I do
> a groupby and order by as part of the SQL query?
>
Seems reasonable. If you use Scala you might want to define a case class
and convert the data frame to a dataset (I find the API of the latter
easier to work with), either before or after you group.


> 2.       How do I then split the grouped data (i.e. driver ID key value
> pairs) to then be parallelized for concurrent processing (i.e. ideally the
> number of parallel datasets/grouped data should run at max node cluster
> capacity)? DO I need to do some sort of mappartitioning ?
>
Spark partitions the data. Each partition can be processed in parallel (if
you look at the list of tasks for a stage in the UI, each task is a single
partition).

The problem you could run into is if you have many records for a given
driver, since they will all end up in the same partition. Look out for that
(you can see in the UI how much data is being processed by each task).

The number of partitions for stages which read the data depend on how the
data is stored (watch out for large gzipped files, as they cannot be
split). When data is shuffled (e.g. for a group by) it will be
repartitioned, with the number of partitions being determined by the
properties spark.default.parallelism and spark.sql.shuffle.partitions.

See
http://spark.apache.org/docs/latest/configuration.html#execution-behavior
and
http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options
.


> 3.       Pending (1) & (2) answers: How does each (i.e. grouped data set)
> dataframe or RDD or dataset perform these rules based checks (i.e. backward
> and forward looking checks) ? i.e. how is this achieved in SPARK?
>
Once you've grouped by driver ID, I assume you'll want to use a map
function. If you want to output multiple records (un-grouping after you've
processed them), you can use flatMap.

You might want to verify your grouped data is in row ID order. You need to
be careful about assuming things are ordered in a particular way in a
distributed system.

Reply via email to