Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
There are a few diff ways to apply approximation algorithms and probabilistic data structures to your Spark data - including Spark's countApproxDistinct() methods as you pointed out. There's also Twitter Algebird, and Redis HyperLogLog (PFCOUNT, PFADD). Here's some examples from my *pipeline

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
Hi all, What's the best way to run ad-hoc queries against a cached RDDs? For example, say I have an RDD that has been processed, and persisted to memory-only. I want to be able to run a count (actually "countApproxDistinct") after filtering by an, at compile time, unknown (specified by query)

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
Thanks for the response Jörn. So to elaborate, I have a large dataset with userIds, each tagged with a property, e.g.: user_1prop1=X user_2prop1=Yprop2=A user_3prop2=B I would like to be able to get the number of distinct users that have a particular property (or combination of

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Jörn Franke
Can you elaborate a little bit more on the use case? It looks a little bit like an abuse of Spark in general . Interactive queries that are not suitable for in-memory batch processing might be better supported by ignite that has in-memory indexes, concept of hot, warm, cold data etc. or hive on