Thanks so much Rafael, I think piggyback is exactly what I was looking for. I wonder if it is possible/best practice to include a call to it during the install.packages('MyPackage') process so that the data is available prior to running tests in the R CMD build Github Action (and also for users to have the default/most recent dataset) downloaded alongside the package. -John
On Fri, Feb 14, 2025 at 4:08 PM Rafael H. M. Pereira < rafa.pereira...@gmail.com> wrote: > Hi John, > > There are different alternatives on where to host the data (e.g. OSF, a > proprietary server, Github etc). The solution I've been adopting in most of > my packages is to use a combination of a proprietary server and Github. > So the data is first downloaded from our own server and only if our server > is offline, then the download is redirected to Github. This is what I try > to do so our packages do not overload Github. Of course, this creates some > additional work from our side to make sure the files in our server are > always mirrored on github. > > A key point to pay attention to when hosting the data on Github is to host > it as an attachment to a *release* . A good way to manage the files and > releases is using the {piggyback} package, by Carl Boettiger et al at > ROpenSci. The documentation of the package is a really great guide on how > to host data on github and it has some really convenient functions to > create releases, upload and download files. Kudos to them ! > https://docs.ropensci.org/piggyback/ > > Best, > > Rafael Pereira > > On Fri, Feb 14, 2025 at 11:55 AM John Clarke < > john.cla...@cornerstonenw.com> wrote: > >> Hi folks, >> >> I've looked around for this particular question, but haven't found a good >> answer. I have a versioned dataset that includes about 6 csv files that >> total about 15MB for each version. The versions get updated every few >> years >> or so and are used to drive the model which was written in C++ but is now >> inside an Rcpp wrapper. Apart from the fact that CRAN does not permit >> large >> files, I want to have a better way for users to access particular versions >> of the dataset. >> >> Usage idea: >> # The following would hopefully also download default/most recent version >> of the csv files from CRAN (if allowed) or Github or some other repository >> for academic open source data. >> install.packages("MyPackage") >> mypackage = new(MyPackage) >> >> Then, if necessary, the user could change the dataset used with something >> like: >> mypackage.dataset("2.1.0") which would retrieve new csv files if they >> haven't already been downloaded and update the data_folder path internally >> to point to 2.1.0 directory. >> >> Requirements: >> - The dataset is csv (not a R data object) and the Rcpp MyPackage expects >> this format >> - Would be nice to properly include citations for the data as they will >> likely be initially released through a journal publication >> >> What is the best practice for this sort of dataset management for a >> package >> in R? Is it okay to use Github to store and version the data? Or >> preferred to use an R package (ignoring the file size limit). Or some >> other >> open source data hosting? I see https://r-universe.dev/ as an option as >> well. In any case, what is the proper mechanism for retrieving/caching the >> data? >> >> Thanks, >> >> -John >> >> John Clarke | Senior Technical Advisor | >> Cornerstone Systems Northwest | john.cla...@cornerstonenw.com >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-package-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-package-devel >> > [[alternative HTML version deleted]] ______________________________________________ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel