Thanks for the response Jörn. So to elaborate, I have a large dataset with
userIds, each tagged with a property, e.g.:

user_1    prop1=X
user_2    prop1=Y    prop2=A
user_3    prop2=B


I would like to be able to get the number of distinct users that have a
particular property (or combination of properties). The cardinality of each
property is in the 1000s and will only grow, as will the number of
properties. I'm happy with approximate values to trade accuracy for
performance.

Spark's performance when doing this via spark-shell is more that excellent
using the "countApproxDistinct" method on a "javaRDD". However, I've no
idea what's the best way to be able to run a query programatically like I
can do manually via spark-shell.

Hope this clarifies things.


On 14 December 2015 at 17:04, Jörn Franke <jornfra...@gmail.com> wrote:

> Can you elaborate a little bit more on the use case? It looks a little bit
> like an abuse of Spark in general . Interactive queries that are not
> suitable for in-memory batch processing might be better supported by ignite
> that has in-memory indexes, concept of hot, warm, cold data etc. or hive on
> tez+llap .
>
> > On 14 Dec 2015, at 17:19, Krishna Rao <krishnanj...@gmail.com> wrote:
> >
> > Hi all,
> >
> > What's the best way to run ad-hoc queries against a cached RDDs?
> >
> > For example, say I have an RDD that has been processed, and persisted to
> memory-only. I want to be able to run a count (actually
> "countApproxDistinct") after filtering by an, at compile time, unknown
> (specified by query) value.
> >
> > I've experimented with using (abusing) Spark Streaming, by streaming
> queries and running these against the cached RDD. However, as I say I don't
> think that this is an intended use-case of Streaming.
> >
> > Cheers,
> >
> > Krishna
>

Reply via email to