Re: Streaming - lookup against reference data

2016-09-15 Thread Tom Davis
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is
encouraging.

I've not used Redis, but it does seem that for most of my current and
likely future use-cases it would be the best fit (nice compromise of scale
and easy setup / access).

Thanks,

Tom

On Wed, Sep 14, 2016 at 10:09 PM Jörn Franke  wrote:

> Hmm is it just a lookup and the values are small? I do not think that in
> this case redis needs to be installed on each worker node. Redis has a
> rather efficient protocol. Hence one or a few dedicated redis nodes
> probably fit your purpose more then needed. Just try to reuse connections
> and do not establish it for each lookup from the same node.
>
> Additionally Redis has a lot of interesting data structures such as
> hyperloglogs.
>
> Hbase - you can design here where to store which part of the reference
> data set and partition in Spark accordingly. Depends on the data and is
> tricky.
>
> About the other options I am a bit skeptical - especially since you need
> to include updated data, might have side effects.
>
> Nevertheless, you mention all the options that are possible. I guess for a
> true evaluation you have to check your use case, the envisioned future
> architecture for other use cases, required performance, maintability etc.
>
> On 14 Sep 2016, at 20:44, Tom Davis  wrote:
>
> Hi all,
>
> Interested in patterns people use in the wild for lookup against reference
> data sets from a Spark streaming job. The reference dataset will be updated
> during the life of the job (although being 30mins out of date wouldn't be
> an issue, for example).
>
> So far I have come up with a few options, all of which have advantages and
> disadvantages:
>
> 1. For small reference datasets, distribute the data as an in memory Map()
> from the driver, refreshing it inside the foreachRDD() loop.
>
> Obviously the limitation here is size.
>
> 2. Run a Redis (or similar) cache on each worker node, perform lookups
> against this.
>
> There's some complexity to managing this, probably outside of the Spark
> job.
>
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop
> on the driver. Perform a join of the reference and stream batch RDDs.
> Perhaps keep the reference RDD in memory.
>
> I suspect that this will scale, but I also suspect there's going to be the
> potential for a lot of data shuffling across the network which will slow
> things down.
>
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data
> available to other services but is a call out over the network, albeit
> within the cluster.
>
> I guess there's no solution that fits all, but interested in other
> people's experience and whether I've missed anything obvious.
>
> Thanks,
>
> Tom
>
>


Re: Streaming - lookup against reference data

2016-09-14 Thread Jörn Franke
Hmm is it just a lookup and the values are small? I do not think that in this 
case redis needs to be installed on each worker node. Redis has a rather 
efficient protocol. Hence one or a few dedicated redis nodes probably fit your 
purpose more then needed. Just try to reuse connections and do not establish it 
for each lookup from the same node.

Additionally Redis has a lot of interesting data structures such as 
hyperloglogs.

Hbase - you can design here where to store which part of the reference data set 
and partition in Spark accordingly. Depends on the data and is tricky.

About the other options I am a bit skeptical - especially since you need to 
include updated data, might have side effects.

Nevertheless, you mention all the options that are possible. I guess for a true 
evaluation you have to check your use case, the envisioned future architecture 
for other use cases, required performance, maintability etc.

> On 14 Sep 2016, at 20:44, Tom Davis  wrote:
> 
> Hi all,
> 
> Interested in patterns people use in the wild for lookup against reference 
> data sets from a Spark streaming job. The reference dataset will be updated 
> during the life of the job (although being 30mins out of date wouldn't be an 
> issue, for example). 
> 
> So far I have come up with a few options, all of which have advantages and 
> disadvantages:
> 
> 1. For small reference datasets, distribute the data as an in memory Map() 
> from the driver, refreshing it inside the foreachRDD() loop. 
> 
> Obviously the limitation here is size. 
> 
> 2. Run a Redis (or similar) cache on each worker node, perform lookups 
> against this. 
> 
> There's some complexity to managing this, probably outside of the Spark job.
> 
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop on 
> the driver. Perform a join of the reference and stream batch RDDs. Perhaps 
> keep the reference RDD in memory. 
> 
> I suspect that this will scale, but I also suspect there's going to be the 
> potential for a lot of data shuffling across the network which will slow 
> things down. 
> 
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data 
> available to other services but is a call out over the network, albeit within 
> the cluster.
> 
> I guess there's no solution that fits all, but interested in other people's 
> experience and whether I've missed anything obvious. 
> 
> Thanks,
> 
> Tom


Streaming - lookup against reference data

2016-09-14 Thread Tom Davis
Hi all,

Interested in patterns people use in the wild for lookup against reference
data sets from a Spark streaming job. The reference dataset will be updated
during the life of the job (although being 30mins out of date wouldn't be
an issue, for example).

So far I have come up with a few options, all of which have advantages and
disadvantages:

1. For small reference datasets, distribute the data as an in memory Map()
from the driver, refreshing it inside the foreachRDD() loop.

Obviously the limitation here is size.

2. Run a Redis (or similar) cache on each worker node, perform lookups
against this.

There's some complexity to managing this, probably outside of the Spark job.

3. Load the reference data into an RDD, again inside the foreachRDD() loop
on the driver. Perform a join of the reference and stream batch RDDs.
Perhaps keep the reference RDD in memory.

I suspect that this will scale, but I also suspect there's going to be the
potential for a lot of data shuffling across the network which will slow
things down.

4. Similar to the Redis option, but use Hbase. Scales well and makes data
available to other services but is a call out over the network, albeit
within the cluster.

I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.

Thanks,

Tom