Hello Marc Le 21/11/15 12:53, Marc Le Bihan a écrit :
> Its really boring to debate about the source of each test file, > especially when they come from public organizations or gouvernments > and are downloadable freely by everyone, but I understand that you > have legacy quarrels that forces you to take care of everything, even > if it causes your project no to benefit from as much informations > sources than others. I want to see the day where someone will attack > any apache project because it has a file inside his test resources... > Really for what checkings are you loosing time ?! Freely downloadable does not mean compatible with Apache license. For example the EPSG database is freely downloadable, but we have still not yet received the permission to include it in SIS (https://issues.apache.org/jira/browse/LEGAL-183). Note that the license that you found for the Shapefiles has a clause ("la réutilisation est toutefois subordonnée au respect de l'intégrité de l'information et des données") which is similar to an EPSG clause against which the Apache legal team has raised objections. In my experience with LEGAL-183, if the user can not freely modify the data, it is not compatible with Apache license. It may nevertheless be included in SIS, but we need to ask the legal team to grant us an exception. This is what I'm trying to do with LEGAL-183. But we can obviously ask such exception only for data important enough. Other example: the "datum shift grid" for transforming coordinates from the old French system to the new French system is freely downloadable from the French mapping agency (http://www.ign.fr). Nevertheless, it is not included in Debian distribution because of its redistribution conditions (which are very similar to the EPSG conditions). Data or software licensed under GPL is also a well-known example of freely downloadable things that we can not include in SIS. > 1) Many files that come from open data have a large amout of real case > data, and among this data, you have a lot of interresting cases. For > example, DEPARTEMENT.SHP shapefile had a the Finistere Departement > inside, a feature created with a three-part polygon. > Useful to challenge some displays or calculations. The size of the > file was only 3 MB. An update or a pull return it in 0.1 seconds for > anyone having an ADSL connection. > The one who will want to do these testings will have to create himself > a Shapefile, I think. Thanks. In addition to Shapefile, other modules like GeoTIFF or NetCDF could also have big test files. The total size of test data grow very quickly in geospatial libraries. Testing interesting cases like this three-part polygon is important, but it can be done as well with a Shapefile trimmed to contain only the interesting cases. This is what I did with other kind of test data (e.g. NetCDF) in GeoAPI and SIS. If a 3 Mb test file is committed for each interesting case in every module, we will have problems. By coincidence, a discussion started today on another Apache mailing list about removing a 103 Mb file committed accidentally, which is causing them issues with GitHub. Another project took the opportunity for requesting the removal of a 20 Mb binary file on their repository too. They raised (among others) the same concern that I did: the cost imposed on anyone who clone the project history. It is okay to have some big test files, but we can make them optional and outside the main repository. We even have a SVN directory for that (while not yet used or part of any SIS download): http://svn.apache.org/repos/asf/sis/data/ So we could start a separated thread about how to handle big files. I'm not against them. I just suggest to 1) make sure that we are allowed by Apache rules to copy them, 2) find the right place for them and 3) favour data files that are likely to be used in more than one test. (Note: if we decide to bring back DEPARTEMENT.SHP shapefile in the above-cited "data" directory, we need to make sure to use "svn copy" in order to not impose the file weight on the Apache server twice. I would volunteer for doing this operation if this is what peoples want). > The allowed duration of unit tests is 0 second x 100 tests = 10 > seconds. Only in-memory tests, nothing else. If the test uses : any > file, any external resource, any building that takes time, it has no > more to classified as an unit test. Else, as you did, you attempt to > discard tests one way or another because you are feeling (and you are > right) that they took too much time for the only mvn clean install > that you just want to do. I'm not yet too much concerned about build time. This is sometime that can be easily revisited in the future, for example using Maven profiles. I was rather concerned about the size of committed files, because they are (in principle) irremediable actions: those files will stay in the history and be part of Git clone even after we deleted them. Martin
