Re: We need a global decision about R data in binary format, and stick to it.
I demand that Lisandro Damián Nicanor Pérez Meyer may or may not have written... On Monday 05 August 2013 17:48:27 Paul Wise wrote: On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote: What about svg files that are converted into png's and then manually adjusted? I'd say the source is the combination of the SVG files plus the adjusted PNGs. I guess you are thinking of a particular case here? What is the reason for manually adjusting them? Maybe optipng-ed images? Anyway, AFAIR, last time I tried that I did all in the rules file, so nothing to be afraid here. I'd say that the only possible issues here are in losing or re-ordering palette entries and in throwing away colour data for transparent pixels, should you choose that option. In many cases, though, neither optimisation would actually cause problems. -- | _ | Darren Salt, using Debian GNU/Linux (and Android) | ( ) | | X | ASCII Ribbon campaign against HTML e-mail | / \ | http://www.asciiribbon.org/ To see a need and wait to be asked, is to already refuse. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/5381bb31c0%lists...@moreofthesa.me.uk
Re: We need a global decision about R data in binary format, and stick to it.
On Monday 05 August 2013 17:48:27 Paul Wise wrote: On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote: What about svg files that are converted into png's and then manually adjusted? I'd say the source is the combination of the SVG files plus the adjusted PNGs. I guess you are thinking of a particular case here? What is the reason for manually adjusting them? Maybe optipng-ed images? Anyway, AFAIR, last time I tried that I did all in the rules file, so nothing to be afraid here. -- Passwords are like underwear. You shouldn’t leave them out where people can see them. You should change them regularly. And you shouldn’t loan them out to strangers. Anonymous Lisandro Damián Nicanor Pérez Meyer http://perezmeyer.com.ar/ http://perezmeyer.blogspot.com/ signature.asc Description: This is a digitally signed message part.
Re: We need a global decision about R data in binary format, and stick to it.
Ian Jackson ijackson at chiark.greenend.org.uk writes: Jeremy Stanley writes (Re: We need a global decision about R data in binary format, and stick to it.): interpreted strongly. For example I have a program which relies on a fairly large set of correlative data requiring hours of expensive computation to generate. In the source package I include the original data on which the resulting tables are based and provide a means to regenerate it on the fly at package build time, but disable it by default so that it doesn't chew up build resources That makes sense, and is IMO a good reason for not doing the complete from-scratch build each time. I believe that even these should be built from source “each time”… once. Then put them into an arch:all subpackage, no load on the buildds. The package maintainer can then choose to do local test builds using the pregenerated binaries, but for the upload to the archive, they should/must recompile that data. bye, //mirabilos -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/loom.20130806t171347-...@post.gmane.org
Re: We need a global decision about R data in binary format, and stick to it.
On Mon, Aug 05, 2013 at 05:44:16PM -0700, Don Armstrong wrote: My last question is, given the answers to the previous questions, what do we do with the R packages that are already in the archive and also contain data that is editable as is but do have an original source, who will do it, and what is the timeline in case of inaction. The package maintainer should handle it; in the case of inaction from upstream, the package maintainer can then either remove the data, split the package, move the package to non-free, or remove the package from Debian entirely. The timeline should be the standard one that is used for all RC bugs. In the current situation, that I describe as active bitrotting, we do not apply the same rules to the packages that enter the archive and the packages that are already in, which cause the packages under active development to become obsolete each time new dependancies can not enter in Debian. We actually do and should apply the same rules. Sometimes violations of the rules are missed for a while, though, and we have to come back and file bugs with severity serious to deal with the problem. Just to provide the number and names of source packages in main/unstable that are containing at least one *.rda file: $ echo deb http://http.debian.net/debian/ unstable main sources.list.main_sources_unstable $ sudo apt-file --architecture source --sources-list sources.list.main_sources_unstable update $ apt-file --architecture source --sources-list sources.list.main_sources_unstable search .rda | sed 's/: .*$//' | uniq boot car cluster dichromat effects erm fportfolio gdata gplots gtools hmisc lattice latticeextra lme4 lmtest mcmcpack mgcv misc3d multcomp nlme permute r-base r-bioc-biobase r-cran-bayesm r-cran-boolnet r-cran-coda r-cran-colorspace r-cran-deal r-cran-diagnosismed r-cran-epi r-cran-epicalc r-cran-epitools r-cran-evd r-cran-genetics r-cran-ggplot2 r-cran-gss r-cran-mapdata r-cran-maps r-cran-mass r-cran-plotrix r-cran-plyr r-cran-pscl r-cran-psy r-cran-randomforest r-cran-reshape r-cran-reshape2 r-cran-rocr r-cran-sn r-cran-sp r-cran-teachingdemos r-cran-timeseries r-cran-vcd r-cran-vegan r-cran-vgam r-other-bio3d r-zoo raschsampler rggobi rmatrix robustbase rpart sandwich sm strucchange survival tseries urca These are 67 packages. I have no numbers whether the *.rda files were added in a later version than the one accepted by ftpmaster. Currently, my take would be to move packages to non-free. This would also allow us to ship the PDF documentation that we currently delete. In these cases, we should split the package out into a non-free component and a free component. I personally would agree with the ftpmaster policy used in the past to accept these packages in main. I should note that I'm currently distributing via debian-r.debian.net a few hundred packages which probably have this particular problem too. ... in case you really regard this as problem. Kind regards Andreas. -- http://fam-tille.de -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130806214555.gc18...@an3as.eu
Re: We need a global decision about R data in binary format, and stick to it.
Paul Tagliamonte writes (Re: We need a global decision about R data in binary format, and stick to it.): On Mon, Aug 05, 2013 at 09:57:35AM +0900, Charles Plessy wrote: it is the common practice in upstream R packages to store data in binary objects. Those objects can be modified with R, and exported into various formats. The Debian archive if full of them. This is not unlike a Python pickle. However, even more to the point, with *this* package, that was a *generated data table*. These *generated* values are clearly not prefered form of modification. I asked the uploader to point to where they came from. I don't think this is unfair. We need to separate these two issues. One is the file format question. It doesn't seem to me that there is anything wrong with a binary format as the preferred form for modification, in principle. For a file which is typically edited using R, including by upstream when they what to edit it, then there is no problem. The other is the assertion that this particular case involves a generated data table. If this is the case then the source package needs to contain the source code which generates the table - and, really, it should regenerate the table during the build. (The source might be in the form of another R binary object.) (Of course there is a third issue: it is probably not the best engineering decision to use a binary save format rather than text source code. But that's not something the Debian maintainer necessarily gets to choose and it's not a reason for an ftpmaster reject.) The question asked by Paul is a recurrent question that comes each time the FTP trainees rotate (basically once per release cycle, because during the Freeze the FTP trainees find other exciting tasks to do, and then do not seem to have much time to process NEW anymore). This must mean many people who care deeply about this topic see this as an issue. I don't think this is a helpful response to someone who is raising what they see as a systematic problem. Paul, would it be possible to update the ftpmaster assistant reference materials to discuss R's binary files ? Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20991.42219.341036.231...@chiark.greenend.org.uk
Re: We need a global decision about R data in binary format, and stick to it.
On Mon, Aug 05, 2013 at 02:13:15PM +0100, Ian Jackson wrote: We need to separate these two issues. Aye. IMVHO, this is the same as how we should treat images (I mean, for any data format, not just this one case of a pickled object) - if the image was a photo, clearly the .jpg or .png or whatever we get is the best way to communicate this data, but if the image was generated off an .svg, it should be distributed with it (and even rebuilt at build-time). One is the file format question. It doesn't seem to me that there is anything wrong with a binary format as the preferred form for modification, in principle. For a file which is typically edited using R, including by upstream when they what to edit it, then there is no problem. Sure. If this data wasn't collected off some scientific instrument or lovingly hand-made, I strongly believe that we should rebuild such objects at build time, and use those in the binary packages. The other is the assertion that this particular case involves a generated data table. If this is the case then the source package needs to contain the source code which generates the table - and, really, it should regenerate the table during the build. (The source might be in the form of another R binary object.) I completely agree. (Of course there is a third issue: it is probably not the best engineering decision to use a binary save format rather than text source code. But that's not something the Debian maintainer necessarily gets to choose and it's not a reason for an ftpmaster reject.) The question asked by Paul is a recurrent question that comes each time the FTP trainees rotate (basically once per release cycle, because during the Freeze the FTP trainees find other exciting tasks to do, and then do not seem to have much time to process NEW anymore). This must mean many people who care deeply about this topic see this as an issue. I don't think this is a helpful response to someone who is raising what they see as a systematic problem. I'm sorry, Charles. Ian's right. That was a poor tone. Paul, would it be possible to update the ftpmaster assistant reference materials to discuss R's binary files ? I would be happy to document what is and isn't OK with these files. I'll have to seek a bit of consensus from the rest of the ftp-team, but I think treating them as if they were any other data format should be fine. Ian. Thanks, Ian, Paul -- .''`. Paul Tagliamonte paul...@debian.org : :' : Proud Debian Developer `. `'` 4096R / 8F04 9AD8 2C92 066C 7352 D28A 7B58 5B30 807C 2A87 `- http://people.debian.org/~paultag signature.asc Description: Digital signature
Re: We need a global decision about R data in binary format, and stick to it.
Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit : On Mon, Aug 05, 2013 at 02:13:15PM +0100, Ian Jackson wrote: We need to separate these two issues. Aye. IMVHO, this is the same as how we should treat images (I mean, for any data format, not just this one case of a pickled object) - if the image was a photo, clearly the .jpg or .png or whatever we get is the best way to communicate this data, but if the image was generated off an .svg, it should be distributed with it (and even rebuilt at build-time). Could we made an exception for specially crafted image in order to exercice buffer oveeflow ? (I think particularly art libpng ImageMagick) One is the file format question. It doesn't seem to me that there is anything wrong with a binary format as the preferred form for modification, in principle. For a file which is typically edited using R, including by upstream when they what to edit it, then there is no problem. Sure. If this data wasn't collected off some scientific instrument or lovingly hand-made, I strongly believe that we should rebuild such objects at build time, and use those in the binary packages. The other is the assertion that this particular case involves a generated data table. If this is the case then the source package needs to contain the source code which generates the table - and, really, it should regenerate the table during the build. (The source might be in the form of another R binary object.) I completely agree. (Of course there is a third issue: it is probably not the best engineering decision to use a binary save format rather than text source code. But that's not something the Debian maintainer necessarily gets to choose and it's not a reason for an ftpmaster reject.) The question asked by Paul is a recurrent question that comes each time the FTP trainees rotate (basically once per release cycle, because during the Freeze the FTP trainees find other exciting tasks to do, and then do not seem to have much time to process NEW anymore). This must mean many people who care deeply about this topic see this as an issue. I don't think this is a helpful response to someone who is raising what they see as a systematic problem. I'm sorry, Charles. Ian's right. That was a poor tone. Paul, would it be possible to update the ftpmaster assistant reference materials to discuss R's binary files ? I would be happy to document what is and isn't OK with these files. I'll have to seek a bit of consensus from the rest of the ftp-team, but I think treating them as if they were any other data format should be fine. Ian. Thanks, Ian, Paul -- .''`. Paul Tagliamonte paul...@debian.org : :' : Proud Debian Developer `. `'` 4096R / 8F04 9AD8 2C92 066C 7352 D28A 7B58 5B30 807C 2A87 `- http://people.debian.org/~paultag
Re: We need a global decision about R data in binary format, and stick to it.
Bastien ROUCARIES writes (Re: We need a global decision about R data in binary format, and stick to it.): Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit : IMVHO, this is the same as how we should treat images (I mean, for any data format, not just this one case of a pickled object) - if the image was a photo, clearly the .jpg or .png or whatever we get is the best way to communicate this data, but if the image was generated off an .svg, it should be distributed with it (and even rebuilt at build-time). Could we made an exception for specially crafted image in order to exercice buffer oveeflow ? (I think particularly art libpng ImageMagick) I think this is something of a red herring corner case, and not really related to the question about R binary objects. If the last thing that happened to the image file was that upstream edited it with a hex editor to introduce a buffer overflow, then the resulting binary file is the preferred form for modification (after all, that's how the last person to do so modified it...) Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20991.46045.904978.836...@chiark.greenend.org.uk
Re: We need a global decision about R data in binary format, and stick to it.
On 2013-08-05, Paul Tagliamonte paul...@debian.org wrote: IMVHO, this is the same as how we should treat images (I mean, for any data format, not just this one case of a pickled object) - if the image was a photo, clearly the .jpg or .png or whatever we get is the best way to communicate this data, but if the image was generated off an .svg, it should be distributed with it (and even rebuilt at build-time). Whattabout svg files that are converted into png's and then manually adjusted? /Sune -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/slrnkvvdku.j0.nos...@sshway.ssh.pusling.com
Re: We need a global decision about R data in binary format, and stick to it.
On 2013-08-05 14:13:15 +0100 (+0100), Ian Jackson wrote: [...] The other is the assertion that this particular case involves a generated data table. If this is the case then the source package needs to contain the source code which generates the table - and, really, it should regenerate the table during the build. [...] No argument on the first, but the second sets a bad precedent if interpreted strongly. For example I have a program which relies on a fairly large set of correlative data requiring hours of expensive computation to generate. In the source package I include the original data on which the resulting tables are based and provide a means to regenerate it on the fly at package build time, but disable it by default so that it doesn't chew up build resources unnecessarily. Since I need to generate the correlation data for other (non-Debian) users of the software anyway, I ship the generated files in the source package too and just include them in the binary package (along with instructions and tooling for the end user to be able to build datasets they can use to override the default ones provided). While my example is Python rather than R, I expect it's representative of situations for many scientific tools. Perhaps some guidance on when this tactic is or is not appropriate would be beneficial. -- { PGP( 48F9961143495829 ); FINGER( fu...@cthulhu.yuggoth.org ); WWW( http://fungi.yuggoth.org/ ); IRC( fu...@irc.yuggoth.org#ccl ); WHOIS( STANL3-ARIN ); MUD( kin...@katarsis.mudpy.org:6669 ); } -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130805151657.gd1...@yuggoth.org
Re: We need a global decision about R data in binary format, and stick to it.
Jeremy Stanley writes (Re: We need a global decision about R data in binary format, and stick to it.): No argument on the first, but the second sets a bad precedent if interpreted strongly. For example I have a program which relies on a fairly large set of correlative data requiring hours of expensive computation to generate. In the source package I include the original data on which the resulting tables are based and provide a means to regenerate it on the fly at package build time, but disable it by default so that it doesn't chew up build resources unnecessarily. That makes sense, and is IMO a good reason for not doing the complete from-scratch build each time. Since I need to generate the correlation data for other (non-Debian) users of the software anyway, I ship the generated files in the source package too and just include them in the binary package (along with instructions and tooling for the end user to be able to build datasets they can use to override the default ones provided). While my example is Python rather than R, I expect it's representative of situations for many scientific tools. Perhaps some guidance on when this tactic is or is not appropriate would be beneficial. There should IMO be a standard way to request a source package to do from-scratch rebuilds for this kind of thing, for QA purposes. Ian. -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20991.51097.617273.783...@chiark.greenend.org.uk
Re: We need a global decision about R data in binary format, and stick to it.
On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote: What about svg files that are converted into png's and then manually adjusted? I'd say the source is the combination of the SVG files plus the adjusted PNGs. I guess you are thinking of a particular case here? What is the reason for manually adjusting them? -- bye, pabs http://wiki.debian.org/PaulWise -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/caktje6e1xcubomaeuajzmkvdhjzumhnpwh04+fw8m19qd8p...@mail.gmail.com
Re: We need a global decision about R data in binary format, and stick to it.
On 2013-08-05 16:41:13 +0100 (+0100), Ian Jackson wrote: [...] There should IMO be a standard way to request a source package to do from-scratch rebuilds for this kind of thing, for QA purposes. I absolutely agree. If there were a standard make target or envvar for this purpose I would gladly implement it in my debian/rules. -- { PGP( 48F9961143495829 ); FINGER( fu...@cthulhu.yuggoth.org ); WWW( http://fungi.yuggoth.org/ ); IRC( fu...@irc.yuggoth.org#ccl ); WHOIS( STANL3-ARIN ); MUD( kin...@katarsis.mudpy.org:6669 ); } -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130805155503.ge1...@yuggoth.org
Re: We need a global decision about R data in binary format, and stick to it.
]] Ian Jackson Bastien ROUCARIES writes (Re: We need a global decision about R data in binary format, and stick to it.): Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit : IMVHO, this is the same as how we should treat images (I mean, for any data format, not just this one case of a pickled object) - if the image was a photo, clearly the .jpg or .png or whatever we get is the best way to communicate this data, but if the image was generated off an .svg, it should be distributed with it (and even rebuilt at build-time). Could we made an exception for specially crafted image in order to exercice buffer oveeflow ? (I think particularly art libpng ImageMagick) I think this is something of a red herring corner case, and not really related to the question about R binary objects. Agreed. If the last thing that happened to the image file was that upstream edited it with a hex editor to introduce a buffer overflow, then the resulting binary file is the preferred form for modification (after all, that's how the last person to do so modified it...) Or more precisely, it's no longer an image that you tend to use for, well, displaying something. It's a test for a buffer overflow that also happens to be an image. (Saying that just because somebody last edited a file with a hex editor then that's the preferred form for modification leaves a pretty large hole. If I make a change to a blob and change a 2012 to 2013 in a copyright notice, it's obvious that the blob isn't its own source.) -- Tollef Fog Heen UNIX is user friendly, it's just picky about who its friends are -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/m2siyoqa93@rahvafeir.err.no
Re: We need a global decision about R data in binary format, and stick to it.
On Mon, 05 Aug 2013, Ian Jackson wrote: The other is the assertion that this particular case involves a generated data table. If this is the case then the source package needs to contain the source code which generates the table - and, really, it should regenerate the table during the build. (The source might be in the form of another R binary object.) I know of almost no cases where someone actually generated the R binary object directly. In general, you have a data table represented as some kind of text file, and then you do operations on it, which result in a R binary object being created from a collection of text files. Subsequently, you might load the R binary object and modify it within R, but for some modifications, you might want to go back to the original data table. It's unfortunately common practice for R upstreams to ship the binary object instead of the combination of original tables and R source necessary to generate the actual R binary save data, but this is something that should be changed, and Debian should be working to lead the charge to do this. In almost all cases, dropping the R binary object(s) do not appreciably change the functionality of the R module; it just means that it is more difficult to use the examples because there is no example data. -- Don Armstrong http://www.donarmstrong.com in Just- spring when the world is mud- luscious the little lame baloonman whistles far and wee -- e.e. cummings [in Just-] -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130805224955.gd14...@rzlab.ucr.edu
Re: [Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.
Hi Joerg and Paul, thank you for your prompt answers and thank for everybody's contribution. I would like to focus my questions on R binary objects that represent data that was not entirely computer-generated (that is, for which the source code can not be summarised by a mathematical formula and simple starting values). Note also that a large number of other software, like LibreOffice for instance, allow to store unformatted textual data as a binary object. Therefore binary object does not mean that the content is impractical to retreive. My first question is: to what extent do we need to verify that the object can be regenerated. - The starting point is a source package with a R binary object. - With this starting point only, it may be impossible to know if it has a source or not. Has the upstream developer typed the results by hand in a R session, for instance when collecting data from a table in a printed report, did he collect his data in a file, not provided in the source package, or does he need a combination of data and scripts to regenerate the binary object ? Unless the answer can be found on the Internet, one has to ask the author directly. - If we have to ask, how long do we need to wait for the answer, and what is the conclusion in case there is no answer. My second question is: to what extent do we need the source. - When the R binary object is a table that has been generated by hand, my understanding is that it does not matter whatever format Upstream prefers, since it is trivial for anybody to export the R object into his favorite format for modification. - When the data in the R binary object has been produced by processing another data file, to what point do we need to go backwards ? This is an important question, because at the end of the chain of rebuildability, there can be gigabytes of data. - When the source of the binary object is not strictly necessary for making relevant modifications, can we distribute the package in Debian ? My last question is, given the answers to the previous questions, what do we do with the R packages that are already in the archive and also contain data that is editable as is but do have an original source, who will do it, and what is the timeline in case of inaction. Also, since the case of pictures have been discussed, here is a parallel between R objects and PNG files is the following. 1) In the PNG file's metadata, there is a field that can indicate if for instance it was made by Inkscape. However, in presence of that field, one can not conclude if the SVG source is still existing, or if it exists on the computer of a contributor, but the upstream developers decided to discard it. 2) If a program displays an image in PNG format and does not use its SVG source, while one can regret that the source is not available, it does not prevent from editing the PNG, or even replacing it entirely. 3) One could consider to scan the Debian archive for PNG files made with Inkscape with no corresponding SVG file in the source package. Would such packages be non-Free ? If yes, how long would you wait before removing the package ? While writing this answer, I also read Don's email advocating for Debian to take the lead and change the current practice in the R community, that prefers to ditribute data as R binary objects in the source packages. This is laudable, but I expect that it will take time, and it needs people who have roots in both communities. In the current situation, that I describe as active bitrotting, we do not apply the same rules to the packages that enter the archive and the packages that are already in, which cause the packages under active development to become obsolete each time new dependancies can not enter in Debian. Given the rotten tomatoes that fly on my face because I can not update anymore the r-cran-ggplot2 package, I do not feel fit to the task of negociating with the R community to change its traditions. In any case, I think that we need clear guidelines, that help to foresee if a R package is acceptable or not in Debian, so that we can better decide if we undertake the work at all. Currently, my take would be to move packages to non-free. This would also allow us to ship the PDF documentation that we currently delete. Cheers, -- Charles Plessy Debian Med packaging team, http://www.debian.org/devel/debian-med Tsurumi, Kanagawa, Japan -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130805232904.ga8...@falafel.plessy.net
Re: [Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.
On Tue, 06 Aug 2013, Charles Plessy wrote: My first question is: to what extent do we need to verify that the object can be regenerated. - The starting point is a source package with a R binary object. - With this starting point only, it may be impossible to know if it has a source or not. [...] Unless the answer can be found on the Internet, one has to ask the author directly. - If we have to ask, how long do we need to wait for the answer, and what is the conclusion in case there is no answer. We should ask if there is any question. If we get no answer, we should use our best judgment as to the likely case. Non-responsive upstreams also should cause us to question whether we should be distributing the package at all. My second question is: to what extent do we need the source. - When the R binary object is a table that has been generated by hand, my understanding is that it does not matter whatever format Upstream prefers, since it is trivial for anybody to export the R object into his favorite format for modification. The original table in any form is source, then. But if there are any subsequent alterations to the table, we should distribute those subsequent alterations. In many cases, you take the original raw data, and then alter it. If the code to do that exists, we should take the original raw data, and do the alterations. [This should really be SOP for all modules in R, because to do otherwise means that it is very difficult to reproduce your alterations in the event of wrong data or new data.] - When the data in the R binary object has been produced by processing another data file, to what point do we need to go backwards ? This is an important question, because at the end of the chain of rebuildability, there can be gigabytes of data. This is a far more difficult case, but if this data exists and can be digitally distributed Debian should have it and distribute it. Perhaps not in the source package, but almost certainly in a data package somewhere. [And honestly, there are very few interesting R packages which we can actually distribute where this is really the case. I can't think of any we currently distribute, and the main ones I can think of involve databases of sequences for microarrays, and there you actually want the complete data anyway.] - When the source of the binary object is not strictly necessary for making relevant modifications, can we distribute the package in Debian ? If the source isn't strictly necessary, we should remove the binary object, and distribute the package. My last question is, given the answers to the previous questions, what do we do with the R packages that are already in the archive and also contain data that is editable as is but do have an original source, who will do it, and what is the timeline in case of inaction. The package maintainer should handle it; in the case of inaction from upstream, the package maintainer can then either remove the data, split the package, move the package to non-free, or remove the package from Debian entirely. The timeline should be the standard one that is used for all RC bugs. In the current situation, that I describe as active bitrotting, we do not apply the same rules to the packages that enter the archive and the packages that are already in, which cause the packages under active development to become obsolete each time new dependancies can not enter in Debian. We actually do and should apply the same rules. Sometimes violations of the rules are missed for a while, though, and we have to come back and file bugs with severity serious to deal with the problem. Currently, my take would be to move packages to non-free. This would also allow us to ship the PDF documentation that we currently delete. In these cases, we should split the package out into a non-free component and a free component. I should note that I'm currently distributing via debian-r.debian.net a few hundred packages which probably have this particular problem too. -- Don Armstrong http://www.donarmstrong.com in Just- spring when the world is mud- luscious the little lame baloonman whistles far and wee -- e.e. cummings [in Just-] -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130806004416.gf14...@rzlab.ucr.edu
We need a global decision about R data in binary format, and stick to it.
Le Mon, Aug 05, 2013 at 12:37:17AM +, Paul Richards Tagliamonte a écrit : Hi maintainer, sysdata.rda appears to be in your source, which is a dataset compressed into pickled R objects. Can you assure me of one of two things: 1. that this data is *not* used anywhere in the binary packages (and is not shipped) and *can* be rebuilt from *just* the contents of the package and that it is *not* shipped. 2. that you rebuild this at build-time, and that is included. I see two sysdata files that are getting installed. If these are coming from this binary file, please respond asking for a REJECT and re-upload this package fixing the situation Dear Paul and everybody, it is the common practice in upstream R packages to store data in binary objects. Those objects can be modified with R, and exported into various formats. The Debian archive if full of them. The question asked by Paul is a recurrent question that comes each time the FTP trainees rotate (basically once per release cycle, because during the Freeze the FTP trainees find other exciting tasks to do, and then do not seem to have much time to process NEW anymore). The proble is that if there is a too strong mismatch between what the R modules currently in the Debian archive, and the criteria for introducing new packages. As a consequence, the work on packages that are actively developped stops, and Debian slowly retains only the packages that nobody uses anymore, and that therefore do not pick extra dependancies that have to go through the NEW queue. This is active bitrotting at its worst. I would like to have a global decision about R packages in Debian, not only about the new ones, and then document this decision and stick to it. But I warn that it may have the consequence of moving most of them to non-free, despite the data in binaryformat is freely modifiable or exportable to text format with R, which is a Free software that we distribute. Have a nice day, -- Charles Plessy Debian Med packaging team, http://www.debian.org/devel/debian-med Tsurumi, Kanagawa, Japan -- To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/20130805005735.ge22...@falafel.plessy.net
Re: We need a global decision about R data in binary format, and stick to it.
On Mon, Aug 05, 2013 at 09:57:35AM +0900, Charles Plessy wrote: Dear Paul and everybody, it is the common practice in upstream R packages to store data in binary objects. Those objects can be modified with R, and exported into various formats. The Debian archive if full of them. This is not unlike a Python pickle. However, even more to the point, with *this* package, that was a *generated data table*. These *generated* values are clearly not prefered form of modification. I asked the uploader to point to where they came from. I don't think this is unfair. Surely you can see this. The question asked by Paul is a recurrent question that comes each time the FTP trainees rotate (basically once per release cycle, because during the Freeze the FTP trainees find other exciting tasks to do, and then do not seem to have much time to process NEW anymore). This must mean many people who care deeply about this topic see this as an issue. Cheers, Paul -- .''`. Paul Tagliamonte paul...@debian.org : :' : Proud Debian Developer `. `'` 4096R / 8F04 9AD8 2C92 066C 7352 D28A 7B58 5B30 807C 2A87 `- http://people.debian.org/~paultag signature.asc Description: Digital signature