Re: We need a global decision about R data in binary format, and stick to it.

2013-08-16 Thread Darren Salt
I demand that Lisandro Damián Nicanor Pérez Meyer may or may not have
written...

 On Monday 05 August 2013 17:48:27 Paul Wise wrote:
 On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote:
 What about svg files that are converted into png's and then manually
 adjusted?
 I'd say the source is the combination of the SVG files plus the adjusted
 PNGs.

 I guess you are thinking of a particular case here? What is the reason
 for manually adjusting them?

 Maybe optipng-ed images? Anyway, AFAIR, last time I tried that I did all in

 the rules file, so nothing to be afraid here.

I'd say that the only possible issues here are in losing or re-ordering
palette entries and in throwing away colour data for transparent pixels,
should you choose that option. In many cases, though, neither optimisation
would actually cause problems.

-- 
|  _  | Darren Salt, using Debian GNU/Linux (and Android)
| ( ) |
|  X  | ASCII Ribbon campaign against HTML e-mail
| / \ | http://www.asciiribbon.org/

To see a need and wait to be asked, is to already refuse.


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/5381bb31c0%lists...@moreofthesa.me.uk



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-13 Thread Lisandro Damián Nicanor Pérez Meyer
On Monday 05 August 2013 17:48:27 Paul Wise wrote:
 On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote:
  What about svg files that are converted into png's and then manually
  adjusted?
 
 I'd say the source is the combination of the SVG files plus the adjusted
 PNGs.
 
 I guess you are thinking of a particular case here? What is the reason
 for manually adjusting them?

Maybe optipng-ed images? Anyway, AFAIR, last time I tried that I did all in 
the rules file, so nothing to be afraid here.

-- 
Passwords are like underwear. You shouldn’t leave them out where people can
see them. You should change them regularly. And you shouldn’t loan them out
to strangers.
  Anonymous

Lisandro Damián Nicanor Pérez Meyer
http://perezmeyer.com.ar/
http://perezmeyer.blogspot.com/


signature.asc
Description: This is a digitally signed message part.


Re: We need a global decision about R data in binary format, and stick to it.

2013-08-06 Thread Thorsten Glaser
Ian Jackson ijackson at chiark.greenend.org.uk writes:

 Jeremy Stanley writes (Re: We need a global decision about R data in
binary format, and stick to it.):

  interpreted strongly. For example I have a program which relies on a
  fairly large set of correlative data requiring hours of expensive
  computation to generate. In the source package I include the
  original data on which the resulting tables are based and provide a
  means to regenerate it on the fly at package build time, but disable
  it by default so that it doesn't chew up build resources

 That makes sense, and is IMO a good reason for not doing the complete
 from-scratch build each time.

I believe that even these should be built from source “each time”… once.
Then put them into an arch:all subpackage, no load on the buildds.

The package maintainer can then choose to do local test builds using
the pregenerated binaries, but for the upload to the archive, they
should/must recompile that data.

bye,
//mirabilos


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/loom.20130806t171347-...@post.gmane.org



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-06 Thread Andreas Tille
On Mon, Aug 05, 2013 at 05:44:16PM -0700, Don Armstrong wrote:
  
  My last question is, given the answers to the previous questions, what
  do we do with the R packages that are already in the archive and also
  contain data that is editable as is but do have an original source,
  who will do it, and what is the timeline in case of inaction.
 
 The package maintainer should handle it; in the case of inaction from
 upstream, the package maintainer can then either remove the data, split
 the package, move the package to non-free, or remove the package from
 Debian entirely. The timeline should be the standard one that is used
 for all RC bugs.
 
  In the current situation, that I describe as active bitrotting, we
  do not apply the same rules to the packages that enter the archive and
  the packages that are already in, which cause the packages under
  active development to become obsolete each time new dependancies can
  not enter in Debian.
 
 We actually do and should apply the same rules. Sometimes violations of
 the rules are missed for a while, though, and we have to come back and
 file bugs with severity serious to deal with the problem.

Just to provide the number and names of source packages in main/unstable
that are containing at least one *.rda file:

$ echo deb http://http.debian.net/debian/ unstable main  
sources.list.main_sources_unstable
$ sudo apt-file --architecture source --sources-list 
sources.list.main_sources_unstable update
$ apt-file --architecture source --sources-list 
sources.list.main_sources_unstable search .rda | sed 's/: .*$//' | uniq
boot
car
cluster
dichromat
effects
erm
fportfolio
gdata
gplots
gtools
hmisc
lattice
latticeextra
lme4
lmtest
mcmcpack
mgcv
misc3d
multcomp
nlme
permute
r-base
r-bioc-biobase
r-cran-bayesm
r-cran-boolnet
r-cran-coda
r-cran-colorspace
r-cran-deal
r-cran-diagnosismed
r-cran-epi
r-cran-epicalc
r-cran-epitools
r-cran-evd
r-cran-genetics
r-cran-ggplot2
r-cran-gss
r-cran-mapdata
r-cran-maps
r-cran-mass
r-cran-plotrix
r-cran-plyr
r-cran-pscl
r-cran-psy
r-cran-randomforest
r-cran-reshape
r-cran-reshape2
r-cran-rocr
r-cran-sn
r-cran-sp
r-cran-teachingdemos
r-cran-timeseries
r-cran-vcd
r-cran-vegan
r-cran-vgam
r-other-bio3d
r-zoo
raschsampler
rggobi
rmatrix
robustbase
rpart
sandwich
sm
strucchange
survival
tseries
urca


These are 67 packages.  I have no numbers whether the *.rda files were
added in a later version than the one accepted by ftpmaster.
 
  Currently, my take would be to move packages to non-free. This would
  also allow us to ship the PDF documentation that we currently delete.
 
 In these cases, we should split the package out into a non-free
 component and a free component.

I personally would agree with the ftpmaster policy used in the past to
accept these packages in main.
 
 I should note that I'm currently distributing via debian-r.debian.net a
 few hundred packages which probably have this particular problem too.

... in case you really regard this as problem.

Kind regards

Andreas. 

-- 
http://fam-tille.de


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130806214555.gc18...@an3as.eu



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Ian Jackson
Paul Tagliamonte writes (Re: We need a global decision about R data in binary 
format, and stick to it.):
 On Mon, Aug 05, 2013 at 09:57:35AM +0900, Charles Plessy wrote:
  it is the common practice in upstream R packages to store data in binary
  objects.  Those objects can be modified with R, and exported into various
  formats.  The Debian archive if full of them.
 
 This is not unlike a Python pickle.
 
 However, even more to the point, with *this* package, that was a
 *generated data table*. These *generated* values are clearly not prefered
 form of modification. I asked the uploader to point to where they came
 from. I don't think this is unfair.

We need to separate these two issues.

One is the file format question.  It doesn't seem to me that there is
anything wrong with a binary format as the preferred form for
modification, in principle.  For a file which is typically edited
using R, including by upstream when they what to edit it, then there
is no problem.

The other is the assertion that this particular case involves a
generated data table.  If this is the case then the source package
needs to contain the source code which generates the table - and,
really, it should regenerate the table during the build.  (The source
might be in the form of another R binary object.)

(Of course there is a third issue: it is probably not the best
engineering decision to use a binary save format rather than text
source code.  But that's not something the Debian maintainer
necessarily gets to choose and it's not a reason for an ftpmaster
reject.)

  The question asked by Paul is a recurrent question that comes each
  time the FTP trainees rotate (basically once per release cycle,
  because during the Freeze the FTP trainees find other exciting
  tasks to do, and then do not seem to have much time to process NEW
  anymore).
 
 This must mean many people who care deeply about this topic see this as an
 issue.

I don't think this is a helpful response to someone who is raising
what they see as a systematic problem.

Paul, would it be possible to update the ftpmaster assistant reference
materials to discuss R's binary files ?

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20991.42219.341036.231...@chiark.greenend.org.uk



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Paul Tagliamonte
On Mon, Aug 05, 2013 at 02:13:15PM +0100, Ian Jackson wrote:
 We need to separate these two issues.

Aye.

IMVHO, this is the same as how we should treat images (I mean, for any
data format, not just this one case of a pickled object) - if the image
was a photo, clearly the .jpg or .png or whatever we get is the best way
to communicate this data, but if the image was generated off an .svg,
it should be distributed with it (and even rebuilt at build-time).

 One is the file format question.  It doesn't seem to me that there is
 anything wrong with a binary format as the preferred form for
 modification, in principle.  For a file which is typically edited
 using R, including by upstream when they what to edit it, then there
 is no problem.

Sure. If this data wasn't collected off some scientific
instrument or lovingly hand-made, I strongly believe that we should
rebuild such objects at build time, and use those in the binary
packages.

 The other is the assertion that this particular case involves a
 generated data table.  If this is the case then the source package
 needs to contain the source code which generates the table - and,
 really, it should regenerate the table during the build.  (The source
 might be in the form of another R binary object.)

I completely agree.

 (Of course there is a third issue: it is probably not the best
 engineering decision to use a binary save format rather than text
 source code.  But that's not something the Debian maintainer
 necessarily gets to choose and it's not a reason for an ftpmaster
 reject.)
 
   The question asked by Paul is a recurrent question that comes each
   time the FTP trainees rotate (basically once per release cycle,
   because during the Freeze the FTP trainees find other exciting
   tasks to do, and then do not seem to have much time to process NEW
   anymore).
  
  This must mean many people who care deeply about this topic see this as an
  issue.
 
 I don't think this is a helpful response to someone who is raising
 what they see as a systematic problem.

I'm sorry, Charles. Ian's right. That was a poor tone.

 
 Paul, would it be possible to update the ftpmaster assistant reference
 materials to discuss R's binary files ?

I would be happy to document what is and isn't OK with these files. I'll
have to seek a bit of consensus from the rest of the ftp-team, but I
think treating them as if they were any other data format should be
fine.

 
 Ian.

Thanks, Ian,
  Paul




-- 
 .''`.  Paul Tagliamonte paul...@debian.org
: :'  : Proud Debian Developer
`. `'`  4096R / 8F04 9AD8 2C92 066C 7352  D28A 7B58 5B30 807C 2A87
 `- http://people.debian.org/~paultag


signature.asc
Description: Digital signature


Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Bastien ROUCARIES
Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit :

 On Mon, Aug 05, 2013 at 02:13:15PM +0100, Ian Jackson wrote:
  We need to separate these two issues.

 Aye.

 IMVHO, this is the same as how we should treat images (I mean, for any
 data format, not just this one case of a pickled object) - if the image
 was a photo, clearly the .jpg or .png or whatever we get is the best way
 to communicate this data, but if the image was generated off an .svg,
 it should be distributed with it (and even rebuilt at build-time).

Could we made an exception for specially crafted image in order to exercice
buffer oveeflow ? (I think particularly art libpng ImageMagick)

  One is the file format question.  It doesn't seem to me that there is
  anything wrong with a binary format as the preferred form for
  modification, in principle.  For a file which is typically edited
  using R, including by upstream when they what to edit it, then there
  is no problem.

 Sure. If this data wasn't collected off some scientific
 instrument or lovingly hand-made, I strongly believe that we should
 rebuild such objects at build time, and use those in the binary
 packages.

  The other is the assertion that this particular case involves a
  generated data table.  If this is the case then the source package
  needs to contain the source code which generates the table - and,
  really, it should regenerate the table during the build.  (The source
  might be in the form of another R binary object.)

 I completely agree.

  (Of course there is a third issue: it is probably not the best
  engineering decision to use a binary save format rather than text
  source code.  But that's not something the Debian maintainer
  necessarily gets to choose and it's not a reason for an ftpmaster
  reject.)
 
The question asked by Paul is a recurrent question that comes each
time the FTP trainees rotate (basically once per release cycle,
because during the Freeze the FTP trainees find other exciting
tasks to do, and then do not seem to have much time to process NEW
anymore).
  
   This must mean many people who care deeply about this topic see this
as an
   issue.
 
  I don't think this is a helpful response to someone who is raising
  what they see as a systematic problem.

 I'm sorry, Charles. Ian's right. That was a poor tone.

 
  Paul, would it be possible to update the ftpmaster assistant reference
  materials to discuss R's binary files ?

 I would be happy to document what is and isn't OK with these files. I'll
 have to seek a bit of consensus from the rest of the ftp-team, but I
 think treating them as if they were any other data format should be
 fine.

 
  Ian.

 Thanks, Ian,
   Paul




 --
  .''`.  Paul Tagliamonte paul...@debian.org
 : :'  : Proud Debian Developer
 `. `'`  4096R / 8F04 9AD8 2C92 066C 7352  D28A 7B58 5B30 807C 2A87
  `- http://people.debian.org/~paultag


Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Ian Jackson
Bastien ROUCARIES writes (Re: We need a global decision about R data in binary 
format, and stick to it.):
 Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit :
  IMVHO, this is the same as how we should treat images (I mean, for any
  data format, not just this one case of a pickled object) - if the image
  was a photo, clearly the .jpg or .png or whatever we get is the best way
  to communicate this data, but if the image was generated off an .svg,
  it should be distributed with it (and even rebuilt at build-time).
 
 Could we made an exception for specially crafted image in order to exercice
 buffer oveeflow ? (I think particularly art libpng ImageMagick)

I think this is something of a red herring corner case, and not really
related to the question about R binary objects.

If the last thing that happened to the image file was that upstream
edited it with a hex editor to introduce a buffer overflow, then the
resulting binary file is the preferred form for modification (after
all, that's how the last person to do so modified it...)

Ian.


--
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20991.46045.904978.836...@chiark.greenend.org.uk



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Sune Vuorela
On 2013-08-05, Paul Tagliamonte paul...@debian.org wrote:
 IMVHO, this is the same as how we should treat images (I mean, for any
 data format, not just this one case of a pickled object) - if the image
 was a photo, clearly the .jpg or .png or whatever we get is the best way
 to communicate this data, but if the image was generated off an .svg,
 it should be distributed with it (and even rebuilt at build-time).

Whattabout svg files that are converted into png's and then manually
adjusted?

/Sune


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/slrnkvvdku.j0.nos...@sshway.ssh.pusling.com



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Jeremy Stanley
On 2013-08-05 14:13:15 +0100 (+0100), Ian Jackson wrote:
[...]
 The other is the assertion that this particular case involves a
 generated data table. If this is the case then the source package
 needs to contain the source code which generates the table - and,
 really, it should regenerate the table during the build.
[...]

No argument on the first, but the second sets a bad precedent if
interpreted strongly. For example I have a program which relies on a
fairly large set of correlative data requiring hours of expensive
computation to generate. In the source package I include the
original data on which the resulting tables are based and provide a
means to regenerate it on the fly at package build time, but disable
it by default so that it doesn't chew up build resources
unnecessarily.

Since I need to generate the correlation data for other (non-Debian)
users of the software anyway, I ship the generated files in the
source package too and just include them in the binary package
(along with instructions and tooling for the end user to be able to
build datasets they can use to override the default ones provided).
While my example is Python rather than R, I expect it's
representative of situations for many scientific tools. Perhaps some
guidance on when this tactic is or is not appropriate would be
beneficial.
-- 
{ PGP( 48F9961143495829 ); FINGER( fu...@cthulhu.yuggoth.org );
WWW( http://fungi.yuggoth.org/ ); IRC( fu...@irc.yuggoth.org#ccl );
WHOIS( STANL3-ARIN ); MUD( kin...@katarsis.mudpy.org:6669 ); }


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805151657.gd1...@yuggoth.org



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Ian Jackson
Jeremy Stanley writes (Re: We need a global decision about R data in binary 
format, and stick to it.):
 No argument on the first, but the second sets a bad precedent if
 interpreted strongly. For example I have a program which relies on a
 fairly large set of correlative data requiring hours of expensive
 computation to generate. In the source package I include the
 original data on which the resulting tables are based and provide a
 means to regenerate it on the fly at package build time, but disable
 it by default so that it doesn't chew up build resources
 unnecessarily.

That makes sense, and is IMO a good reason for not doing the complete
from-scratch build each time.

 Since I need to generate the correlation data for other (non-Debian)
 users of the software anyway, I ship the generated files in the
 source package too and just include them in the binary package
 (along with instructions and tooling for the end user to be able to
 build datasets they can use to override the default ones provided).
 While my example is Python rather than R, I expect it's
 representative of situations for many scientific tools. Perhaps some
 guidance on when this tactic is or is not appropriate would be
 beneficial.

There should IMO be a standard way to request a source package to do
from-scratch rebuilds for this kind of thing, for QA purposes.

Ian.


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/20991.51097.617273.783...@chiark.greenend.org.uk



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Paul Wise
On Mon, Aug 5, 2013 at 4:28 PM, Sune Vuorela wrote:

 What about svg files that are converted into png's and then manually
 adjusted?

I'd say the source is the combination of the SVG files plus the adjusted PNGs.

I guess you are thinking of a particular case here? What is the reason
for manually adjusting them?

-- 
bye,
pabs

http://wiki.debian.org/PaulWise


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: 
http://lists.debian.org/caktje6e1xcubomaeuajzmkvdhjzumhnpwh04+fw8m19qd8p...@mail.gmail.com



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Jeremy Stanley
On 2013-08-05 16:41:13 +0100 (+0100), Ian Jackson wrote:
[...]
 There should IMO be a standard way to request a source package to do
 from-scratch rebuilds for this kind of thing, for QA purposes.

I absolutely agree. If there were a standard make target or envvar
for this purpose I would gladly implement it in my debian/rules.
-- 
{ PGP( 48F9961143495829 ); FINGER( fu...@cthulhu.yuggoth.org );
WWW( http://fungi.yuggoth.org/ ); IRC( fu...@irc.yuggoth.org#ccl );
WHOIS( STANL3-ARIN ); MUD( kin...@katarsis.mudpy.org:6669 ); }


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805155503.ge1...@yuggoth.org



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Tollef Fog Heen
]] Ian Jackson 

 Bastien ROUCARIES writes (Re: We need a global decision about R data in 
 binary format, and stick to it.):
  Le 5 août 2013 15:42, Paul Tagliamonte paul...@debian.org a écrit :
   IMVHO, this is the same as how we should treat images (I mean, for any
   data format, not just this one case of a pickled object) - if the image
   was a photo, clearly the .jpg or .png or whatever we get is the best way
   to communicate this data, but if the image was generated off an .svg,
   it should be distributed with it (and even rebuilt at build-time).
  
  Could we made an exception for specially crafted image in order to exercice
  buffer oveeflow ? (I think particularly art libpng ImageMagick)
 
 I think this is something of a red herring corner case, and not really
 related to the question about R binary objects.

Agreed.

 If the last thing that happened to the image file was that upstream
 edited it with a hex editor to introduce a buffer overflow, then the
 resulting binary file is the preferred form for modification (after
 all, that's how the last person to do so modified it...)

Or more precisely, it's no longer an image that you tend to use for,
well, displaying something.  It's a test for a buffer overflow that also
happens to be an image.  (Saying that just because somebody last edited
a file with a hex editor then that's the preferred form for modification
leaves a pretty large hole.  If I make a change to a blob and change a
2012 to 2013 in a copyright notice, it's obvious that the blob isn't its
own source.)

-- 
Tollef Fog Heen
UNIX is user friendly, it's just picky about who its friends are


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/m2siyoqa93@rahvafeir.err.no



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Don Armstrong
On Mon, 05 Aug 2013, Ian Jackson wrote:
 The other is the assertion that this particular case involves a
 generated data table. If this is the case then the source package
 needs to contain the source code which generates the table - and,
 really, it should regenerate the table during the build. (The source
 might be in the form of another R binary object.)

I know of almost no cases where someone actually generated the R binary
object directly.

In general, you have a data table represented as some kind of text file,
and then you do operations on it, which result in a R binary object
being created from a collection of text files. Subsequently, you might
load the R binary object and modify it within R, but for some
modifications, you might want to go back to the original data table.

It's unfortunately common practice for R upstreams to ship the binary
object instead of the combination of original tables and R source
necessary to generate the actual R binary save data, but this is
something that should be changed, and Debian should be working to lead
the charge to do this.

In almost all cases, dropping the R binary object(s) do not appreciably
change the functionality of the R module; it just means that it is more
difficult to use the examples because there is no example data.

-- 
Don Armstrong  http://www.donarmstrong.com

in Just-
spring  when the world is mud-
luscious the little lame baloonman 

whistles   far   and wee 
 -- e.e. cummings [in Just-]


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805224955.gd14...@rzlab.ucr.edu



Re: [Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Charles Plessy
Hi Joerg and Paul,

thank you for your prompt answers and thank for everybody's contribution.

I would like to focus my questions on R binary objects that represent data that
was not entirely computer-generated (that is, for which the source code can not
be summarised by a mathematical formula and simple starting values).  Note also
that a large number of other software, like LibreOffice for instance, allow to
store unformatted textual data as a binary object.  Therefore binary object
does not mean that the content is impractical to retreive.

My first question is: to what extent do we need to verify that the object can
be regenerated.

  - The starting point is a source package with a R binary object.
  - With this starting point only, it may be impossible to know if it has a
source or not.  Has the upstream developer typed the results by hand
in a R session, for instance when collecting data from a table in a
printed report, did he collect his data in a file, not provided
in the source package, or does he need a combination of data and scripts
to regenerate the binary object ?  Unless the answer can be found on the
Internet, one has to ask the author directly.
  - If we have to ask, how long do we need to wait for the answer, and what
is the conclusion in case there is no answer.

My second question is: to what extent do we need the source.

  - When the R binary object is a table that has been generated by hand,
my understanding is that it does not matter whatever format Upstream
prefers, since it is trivial for anybody to export the R object into
his favorite format for modification.
  - When the data in the R binary object has been produced by processing
another data file, to what point do we need to go backwards ?  This
is an important question, because at the end of the chain of
rebuildability, there can be gigabytes of data.
  - When the source of the binary object is not strictly necessary for
making relevant modifications, can we distribute the package in Debian ?

My last question is, given the answers to the previous questions, what do we do
with the R packages that are already in the archive and also contain data that 
is
editable as is but do have an original source, who will do it, and what is the
timeline in case of inaction.

Also, since the case of pictures have been discussed, here is a parallel
between R objects and PNG files is the following.

1) In the PNG file's metadata, there is a field that can indicate if for 
instance
it was made by Inkscape.  However, in presence of that field, one can not
conclude if the SVG source is still existing, or if it exists on the computer
of a contributor, but the upstream developers decided to discard it.

2) If a program displays an image in PNG format and does not use its SVG
source, while one can regret that the source is not available, it does not
prevent from editing the PNG, or even replacing it entirely.

3) One could consider to scan the Debian archive for PNG files made with
Inkscape with no corresponding SVG file in the source package.  Would such
packages be non-Free ?  If yes, how long would you wait before removing the
package ?

While writing this answer, I also read Don's email advocating for Debian to
take the lead and change the current practice in the R community, that prefers
to ditribute data as R binary objects in the source packages.  This is
laudable, but I expect that it will take time, and it needs people who have
roots in both communities.

In the current situation, that I describe as active bitrotting, we do not
apply the same rules to the packages that enter the archive and the packages
that are already in, which cause the packages under active development to
become obsolete each time new dependancies can not enter in Debian.  Given the
rotten tomatoes that fly on my face because I can not update anymore the
r-cran-ggplot2 package, I do not feel fit to the task of negociating with the
R community to change its traditions.

In any case, I think that we need clear guidelines, that help to foresee if a R
package is acceptable or not in Debian, so that we can better decide if we
undertake the work at all.

Currently, my take would be to move packages to non-free.  This would also
allow us to ship the PDF documentation that we currently delete.

Cheers,

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805232904.ga8...@falafel.plessy.net



Re: [Debian-med-packaging] We need a global decision about R data in binary format, and stick to it.

2013-08-05 Thread Don Armstrong
On Tue, 06 Aug 2013, Charles Plessy wrote:
 My first question is: to what extent do we need to verify that the
 object can be regenerated.
 
   - The starting point is a source package with a R binary object.
   - With this starting point only, it may be impossible to know if it
   has a source or not. 
[...]
   Unless the answer can be found on the Internet, one has to ask the
   author directly.
   - If we have to ask, how long do we need to wait for the answer, and
   what is the conclusion in case there is no answer.

We should ask if there is any question. If we get no answer, we should
use our best judgment as to the likely case. Non-responsive upstreams
also should cause us to question whether we should be distributing the
package at all.
 
 My second question is: to what extent do we need the source.
 
   - When the R binary object is a table that has been generated by
   hand, my understanding is that it does not matter whatever format
   Upstream prefers, since it is trivial for anybody to export the R
   object into his favorite format for modification.

The original table in any form is source, then. But if there are any
subsequent alterations to the table, we should distribute those
subsequent alterations. In many cases, you take the original raw data,
and then alter it. If the code to do that exists, we should take the
original raw data, and do the alterations. [This should really be SOP
for all modules in R, because to do otherwise means that it is very
difficult to reproduce your alterations in the event of wrong data or
new data.]

   - When the data in the R binary object has been produced by
   processing another data file, to what point do we need to go
   backwards ? This is an important question, because at the end of the
   chain of rebuildability, there can be gigabytes of data.

This is a far more difficult case, but if this data exists and can be
digitally distributed Debian should have it and distribute it. Perhaps
not in the source package, but almost certainly in a data package
somewhere. [And honestly, there are very few interesting R packages
which we can actually distribute where this is really the case. I can't
think of any we currently distribute, and the main ones I can think of
involve databases of sequences for microarrays, and there you actually
want the complete data anyway.]

   - When the source of the binary object is not strictly necessary for
   making relevant modifications, can we distribute the package in
   Debian ?

If the source isn't strictly necessary, we should remove the binary
object, and distribute the package.
 
 My last question is, given the answers to the previous questions, what
 do we do with the R packages that are already in the archive and also
 contain data that is editable as is but do have an original source,
 who will do it, and what is the timeline in case of inaction.

The package maintainer should handle it; in the case of inaction from
upstream, the package maintainer can then either remove the data, split
the package, move the package to non-free, or remove the package from
Debian entirely. The timeline should be the standard one that is used
for all RC bugs.

 In the current situation, that I describe as active bitrotting, we
 do not apply the same rules to the packages that enter the archive and
 the packages that are already in, which cause the packages under
 active development to become obsolete each time new dependancies can
 not enter in Debian.

We actually do and should apply the same rules. Sometimes violations of
the rules are missed for a while, though, and we have to come back and
file bugs with severity serious to deal with the problem.

 Currently, my take would be to move packages to non-free. This would
 also allow us to ship the PDF documentation that we currently delete.

In these cases, we should split the package out into a non-free
component and a free component.

I should note that I'm currently distributing via debian-r.debian.net a
few hundred packages which probably have this particular problem too.

-- 
Don Armstrong  http://www.donarmstrong.com

in Just-
spring  when the world is mud-
luscious the little lame baloonman 

whistles   far   and wee 
 -- e.e. cummings [in Just-]


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130806004416.gf14...@rzlab.ucr.edu



We need a global decision about R data in binary format, and stick to it.

2013-08-04 Thread Charles Plessy
Le Mon, Aug 05, 2013 at 12:37:17AM +, Paul Richards Tagliamonte a écrit :
 Hi maintainer,
 
 sysdata.rda appears to be in your source, which is a dataset compressed
 into pickled R objects.
 
 Can you assure me of one of two things:
 
   1. that this data is *not* used anywhere in the binary packages
  (and is not shipped) and *can* be rebuilt from *just* the contents
  of the package and that it is *not* shipped.
 
   2. that you rebuild this at build-time, and that is included.
 
 I see two sysdata files that are getting installed.
 
 If these are coming from this binary file, please respond asking
 for a REJECT and re-upload this package fixing the situation

Dear Paul and everybody,

it is the common practice in upstream R packages to store data in binary
objects.  Those objects can be modified with R, and exported into various
formats.  The Debian archive if full of them.

The question asked by Paul is a recurrent question that comes each time the FTP
trainees rotate (basically once per release cycle, because during the Freeze
the FTP trainees find other exciting tasks to do, and then do not seem to have
much time to process NEW anymore).

The proble is that if there is a too strong mismatch between what the R modules
currently in the Debian archive, and the criteria for introducing new packages.
As a consequence, the work on packages that are actively developped stops, and
Debian slowly retains only the packages that nobody uses anymore, and that
therefore do not pick extra dependancies that have to go through the NEW queue.
This is active bitrotting at its worst.

I would like to have a global decision about R packages in Debian, not only
about the new ones, and then document this decision and stick to it.  But I
warn that it may have the consequence of moving most of them to non-free,
despite the data in binaryformat is freely modifiable or exportable to text
format with R, which is a Free software that we distribute.

Have a nice day,

-- 
Charles Plessy
Debian Med packaging team,
http://www.debian.org/devel/debian-med
Tsurumi, Kanagawa, Japan


-- 
To UNSUBSCRIBE, email to debian-devel-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Archive: http://lists.debian.org/20130805005735.ge22...@falafel.plessy.net



Re: We need a global decision about R data in binary format, and stick to it.

2013-08-04 Thread Paul Tagliamonte
On Mon, Aug 05, 2013 at 09:57:35AM +0900, Charles Plessy wrote:
 Dear Paul and everybody,
 
 it is the common practice in upstream R packages to store data in binary
 objects.  Those objects can be modified with R, and exported into various
 formats.  The Debian archive if full of them.

This is not unlike a Python pickle.

However, even more to the point, with *this* package, that was a
*generated data table*. These *generated* values are clearly not prefered
form of modification. I asked the uploader to point to where they came
from. I don't think this is unfair.

Surely you can see this.

 The question asked by Paul is a recurrent question that comes each time the 
 FTP
 trainees rotate (basically once per release cycle, because during the Freeze
 the FTP trainees find other exciting tasks to do, and then do not seem to have
 much time to process NEW anymore).

This must mean many people who care deeply about this topic see this as an
issue.

Cheers,
  Paul

-- 
 .''`.  Paul Tagliamonte paul...@debian.org
: :'  : Proud Debian Developer
`. `'`  4096R / 8F04 9AD8 2C92 066C 7352  D28A 7B58 5B30 807C 2A87
 `- http://people.debian.org/~paultag


signature.asc
Description: Digital signature