Hi all,

What's the best way to run ad-hoc queries against a cached RDDs?

For example, say I have an RDD that has been processed, and persisted to
memory-only. I want to be able to run a count (actually
"countApproxDistinct") after filtering by an, at compile time, unknown
(specified by query) value.

I've experimented with using (abusing) Spark Streaming, by streaming
queries and running these against the cached RDD. However, as I say I don't
think that this is an intended use-case of Streaming.

Cheers,

Krishna

Reply via email to