On Nov 29, 2017; 12:17pm, Miles Wells wrote: > Did Scott ever write that blog post he mentioned in the linked post? I am > quite interested in reading more about his testing.
No, I never did. I quit the company a few months later and attempted to found a TSDB company called Kerf, so I was a little busy with other things. Kerf didn't work out, and I'm back working with jd again. Spark beating incident was one of those "had a big company deadline" things, and the Spark solution not only didn't have the functionality to solve the problem, despite having several Spark bigshots on the team, it was obscenely slow, even using parquet files. Jd 2.0 had more than enough functionality to solve the problem (aggregating a 500gb data set into a smaller data set our machine learning tools could use), and was fast and memory efficient. It was a peculiar problem in that the data was relatively modest in rowspace, but huge in column space (and it got column bigger with the aggregate). Pretty sure jd could have handled a problem like this on 4x the size on the server I had access to. Remember: jd is single threaded: spark was running on something like 250 CPUs (6-8 machines). Spark is not worth it unless you have a huge cluster and can solve the problem in no other way. Maybe it's gotten better since then, but a lot of the design decisions can be described in no other way than "bad." It's CPU, IO and memory inefficient. Better than Hadoop at least. A friend of mine who was an early google employee needed a map-reduce framework for his new project. He fiddled with these open source ones, laughed at how bad they were, wrote his own in a week, and has been using it in production for a few years now. J/jd is capable of doing spark-like things (aka sharding the problem across many machines) with some back end work. The Kx people have done this sort of thing. Maybe some day this will happen; APL languages are a natural fit for parallel compute -there are old papers on doing this, and I think one of J's ancestors were specifically designed for it (FP). In the meantime, I think jd is a great tool for terascale problems which have some sort of time orientation (and probably without time orientation also). It is comparable in speed to K and it costs less. Plus I know J better, and J comes out of the box with more tools I need (including a large personal library of J based prediction algorithms). I'm not yet grinding particularly big data on my present jd-related project, but it's reassuring to know that I can when the time comes. -SL ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
