To throw a cat amongst the pigeons... Postgres and some other backends can use R as a built in language at the server side, e.g. http://www.joeconway.com/plr/ I'm afraid I have not used any of these in a serious project.
Kind regards, Sean On 7 May 2014 15:30, jim holtman <jholt...@gmail.com> wrote: > The place where I have seen a large difference is when you have to "join" a > large number of tables to get results. The problem with doing is R is the > restriction of how much can fit into memory and the time that it takes to > download the data across a data connection if the data is remote from the > machine that you are doing R on. As an aside, we have a number of stored > procedures on the SQL machine that will call R locally to do some of the > processing. You have to look at the tradeoff of the time to get the data > vs. the time needed to aggregate the results. An SQL server is much better > at doing joins across ten different tables and being able to scale that > across multiple processors if they are available. The resulting table is > typically much smaller when it comes time to download it to your local R > machine. There is a big difference is these tables start to exceed the > memory that you have on your local R machine. > > I would agree that if I can fit the data into memory, there are packages > (data.table & dplyr) that will make the accessing/aggregation of data > pretty fast, but if I can do some of that on a SQL machine, the amount of > data that I have to download and locally process may be a lot smaller. > > > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > Tell me what you want to do, not how you want to do it. > > > On Wed, May 7, 2014 at 10:08 AM, Tim Keitt <tke...@utexas.edu> wrote: > > > On Wed, May 7, 2014 at 7:08 AM, Hadley Wickham <h.wick...@gmail.com> > > wrote: > > > > > > I think of R and SQL as mostly complementary. The example query is > > > simple in > > > > SQL, and relatively ugly in R. But also, I would expect it to be much > > > faster > > > > with an SQL engine, after all, that is what SQL engines are designed > to > > > do. > > > > So the implied suggestion that it might be faster in R really > surprises > > > me. > > > > (Implied by the original question and also somewhat endorsed by some > > > > responses.) > > > > > > I think your expectation is wrong. SQL engines are not designed to do > > > fast in-memory analysis. They are designed to be: > > > > > > * fast at adding new rows (in an ACID manner) > > > * fast at working with on disk > > > > > > My expectation (after working on dplyr) is that if you can fit the > > > data in memory, R should be faster. If it's not faster, it's a sign > > > that the underlying R code is poorly written. > > > > > > The right indexes can make a RDMS competitive with naive loops on > > > in-memory data, but I rarely find subsetting to be a bottleneck. > > > Indexes help a little with counts, but don't help at all for any other > > > summaries > > > > > > > 1/ Has anyone ever had the experience that an R "query" of a > dataframe > > is > > > > faster than an SQL database for a non-trivial SQL style query on a > > > modestly > > > > large dataset? If so, what are the parameters that one should think > > > about > > > > when considering if a "pure R" solution might be faster? > > > > > > If it fits in memory, R _should_ be faster. > > > > > > > 2/ Assuming no in 1/, is there any programming language that would be > > > > preferred to an SQL engine for doing SQL style queries? (Please leave > > > aside > > > > the debate over your favourite flavour of SQL engine, and the > > possibility > > > > that the database would be better in some non-SQL structure.) If > yes, I > > > > would really appreciate some explanation. I think of the fact that > > > general > > > > programming languages needs to do other things besides SQL queries as > > > > something like a constraint, and there is a theorem that a > constrained > > > > optimization can never be better than an unconstrained one. > > > > > > Not a programming language, but I think it's useful to be aware of > > > columnar databases (e.g. monetdb). These are column based (liked R), > > > not row based (like most RDMSs), which means that they're slower at > > > adding new rows, but faster at doing analytic summaries. > > > > > > > All very good points. Just remember that the R in RDMS stands for > > relational. They facilitate very large and complex joins as opposed for > > simple filtering or aggregation. Column stores are not always ideal for > > complex joins but very good for large scans. > > > > THK > > > > > > > > > > Hadley > > > > > > -- > > > http://had.co.nz/ > > > > > > _______________________________________________ > > > R-sig-DB mailing list -- R Special Interest Group > > > R-sig-DB@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/r-sig-db > > > > > > > > > > > -- > > http://www.keittlab.org/ > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > R-sig-DB mailing list -- R Special Interest Group > > R-sig-DB@r-project.org > > https://stat.ethz.ch/mailman/listinfo/r-sig-db > > > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-DB mailing list -- R Special Interest Group > R-sig-DB@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-db > [[alternative HTML version deleted]] _______________________________________________ R-sig-DB mailing list -- R Special Interest Group R-sig-DB@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-db