On Nov 29, 2017; 12:17pm, Miles Wells wrote:

> Did Scott ever write that blog post he mentioned in the linked post? I am 
> quite interested in reading more about his testing. 


No, I never did. I quit the company a few months later and attempted to
found a TSDB company called Kerf, so I was a little busy with other
things. Kerf didn't work out, and I'm back working with jd again.

Spark beating incident was one of those "had a big company deadline"
things, and the Spark solution not only didn't have the functionality to
solve the problem, despite having several Spark bigshots on the team, it
was obscenely slow, even using parquet files.

Jd 2.0 had more than enough functionality to solve the problem
(aggregating a 500gb data set into a smaller data set our machine
learning tools could use), and was fast and memory efficient. It was a
peculiar problem in that the data was relatively modest in rowspace, but
huge in column space (and it got column bigger with the aggregate).
Pretty sure jd could have handled a problem like this on 4x the size on
the server I had access to. Remember: jd is single threaded: spark was
running on something like 250 CPUs (6-8 machines).

Spark is not worth it unless you have a huge cluster and can solve the
problem in no other way.  Maybe it's gotten better since then, but a lot
of the design decisions can be described in no other way than "bad."
It's CPU, IO and memory inefficient. Better than Hadoop at least. A
friend of mine who was an early google employee needed a map-reduce
framework for his new project. He fiddled with these open source ones,
laughed at how bad they were, wrote his own in a week, and has been
using it in production for a few years now. 

J/jd is capable of doing spark-like things (aka sharding the problem
across many machines) with some back end work. The Kx people have done
this sort of thing. Maybe some day this will happen; APL languages are a
natural fit for parallel compute -there are old papers on doing this,
and I think one of J's ancestors were specifically designed for it (FP).

In the meantime, I think jd is a great tool for terascale problems which
have some sort of time orientation (and probably without time
orientation also). It is comparable in speed to K and it costs less.
Plus I know J better, and J comes out of the box with more tools I need
(including a large personal library of J based prediction algorithms).

I'm not yet grinding particularly big data on my present jd-related
project, but it's reassuring to know that I can when the time comes. 


-SL


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to