That's a lot of responses, thanks for the interest and the suggestions!
Are there other languages or software communities that do something like > this? It would be nice not to have to invent this wheel. Eventually a PEP > and an implementation should be presented, but first the idea needs to be > explored more. To my knowledge, R is the only language that implements such a feature. Package developers add a CITATION text file containing a text with whatever text citation format for their package. A specialized citation() built-in function can be called from the REPL that would return a citation for the R itself, including a BibTex file for LateX users. When citation is called on a package instead, it returns the contents of CITATION for that package specifically (eg. citation("ggplot2")) or alternatively uses package metadata to build a sane citation. Given that most of work with R is done within a REPL and packages are installed/imported with commands such as install.package("ggplot2")/import("ggplot2"), this approach makes sense in that context. This, however, didn't feel terribly Pythonic to me. As for PEP and a reference implementation, I will gladly take care of them if the idea gets enough traction, but there seems to be already a PEP draft as well as an attempt at implementation by one of the AstroPy/AstroML maintainers, using the __citation__ field and citation() function to unpack it: https://github.com/adrn/CitationPEP There also seem some packages in the community using __bibtex__ rather than __citation__ to store BibTeX entries but I haven't found yet any large project implementing it or PEP drafts associated to it. The software sustainability institute in the UK have written several blog > posts advocating the use of CITATION files containing this sort of metadata: > https://software.ac.uk/blog/2017-12-12-standard-format-citation-files Yes, that's the R approach I presented above. It is viable, especially if hooked to something accessible from the REPL directly, such as __cite__ or __citation__ attribute/method for modules. I would, however, advocate for a more structured approach - perhaps JSON or BibTeX that would get parsed and converted to suitable citation format by the __cite__, if it was implemented as a method. A github code search for __citation__ also gets 127 hits that mostly seem > to be research software that are using this attribute more or less as > suggested here: > https://github.com/search?q=__citation__&type=Code Most of them are from the AstroPy universe or from the CitationPEP draft I've referenced above. This is indeed a serious problem. I suspect python-ideas isn't the > best venue for addressing it though – there's nothing here that needs > changes to the Python interpreter itself (I think), and the people who > understand this problem the best and who are most affected by it, > mostly aren't here. There has been localized discussion popping up among the large scientific package maintainers and some attempts to solve the problem at the local level. Until now they seemed to be winding down due to a lack of a large-scale citation mechanism and a discussion about what is concretely doable at the scale of the language is likely to finalize As for the list, reserving a __citation__/__cite__ for packages at the same level as __version__ is now reserved and adding a citation()/cite() function to the standard library seemed large enough modifications to warrant searching a buy-in from the maintainers and the community at large. You'll want to check out the duecredit project: > https://github.com/duecredit/duecredit > One of the things they've thought about is the ability to track > citation information at a more fine-grained way than per-package – for > example, there might be a paper that should be cited by anyone who > calls a particular method (or even passes a specific argument to some > specific method, when that turns on some fancy algorithm). Due credit looks amazing - I will definitely check it out. The idea was, however, to bring the barrier for adoption and usage as low as possible. In my experience, the vast majority of Python users in academic environment who aren't citing the packages properly are beginners. As such they are unlikely to search for third-party libraries beyond those they've found and used to solve their specific problem. who just assembled a pipeline based on widely-used libraries and would need to generate a citation list for it to pass on to their colleagues responsible for the paper assembly and submission. I'd actually like to see a more general solution that isn't restricted > to any one language, because multi-language analysis pipelines are > very common. For example, we could standardize a convention where if a > certain environment variable is set, then the software writes out > citation information to a certain location, and then implement > libraries that do this in multiple languages. Of course, that's a > "dynamic" solution that requires running the software -- which is > probably necessary if you want to do fine-grained citations, but it > might be useful to also have static metadata, e.g. as part of the > package metadata that goes into sdists, wheels, and on PyPI. That > would be a discussion for the distutils-sig mailing list, which > manages that metadata. Thanks for the reference to the distutils-sig list. I will talk to them if the idea gets traction here I am not entirely convinced for the multi-language pipelines. In bioinformatics, often the heavy lifting is done by a single package (for instance bowtie for RNA-seq alignment) and the output is piped to the custom script, mostly in R or Python. The citations for the library doing the heavy-lifting is often well-known and widely cited and the issues arise in the custom scripts importing and using libraries that should be cited without citing them. One challenge in standardizing this kind of thing is choosing a > standard way to represent citation information. Maybe CSL-JSON? > There's a lot of complexity as you dig into this, though of course one > shouldn't let the perfect be the enemy of the good... CLS-JSON represented as a dict to be supplied to the setup file is definitely one way of doing it. I was, however, thinking more about the BibTeX format, given that CLS-JSON is more closely affiliated with Mendeley Why does this have to be a dunder method? In general, application code shouldn't be calling dunders directly, they're reserved for Python. I was under the impression that sometimes the dunders are used to store relevant information that would not be of use to the most users, such as __version__ and sometimes to better control the execution flow (for instance the if __name__== "main") I think your description of what this method should do is not > really coherent. On the one hand, you have __citation__() be a method > that you call (how?) but on the other hand you have it being a data > field __citation__ that you scan. My initial idea was to have a __cite__ method embedded in the import mechanism that would parse data from config and upon a call on a package, return the citation developers want to see associated to the current package version in the format user needs. (for instance numpy.__cite__('bibtex') would return a citation for the current numpy version in BibTeX format). If called on the script itself __cite__('bibtex') would iterate through all the imported modules and retrieve their citations one by one, at least for those that modules that have associated citation. After reading the feedback in this thread, I believe that a __citation__ reserved field that pulls the data from the setup script and a cite() script in the standard library would be a better approach. In the end, I believe the best would be to implement both of them and see which one feels more pythonic. I do think you have identified an important feature, but I think this is > a *tool*, not a *language feature*. My spur of the moment thought is: > - we could have a script (a third party script? or in the std lib?) > which the user calls, giving the name of their module or package as > argument > e.g. "python -m cite myapplication.py" > - this script knows how to analyse myapplication.py for a list of > dependencies, perhaps filtering out standard library packages; > - it interrogates myapplication, and each dependency, for a citation; > - this might involve reserving a standard __citation__ data field > in each module, or a __citation__.xml file in the package, or > some other protocol; > - or perhaps the cite script nows how to generate the appropriate > citation itself, from any of the standard formatted data fields > found in many common modules, like __author__, __version__ etc. > - either way, the script would generate a list of packages and > modules used by myapplication, plus citations for them. Yes, that's the idea! The biggest reason for me to send the discussion to this list is to check if it would be acceptable to reserve the __citation__ data field in each module and include the cite() script in the standard library. Presumably you would need to be able to specify which citation style to > use. Yes, but to avoid building a configurable citation engine for the thousands of formats there are in the wild, it would take a couple of standard formats and interchangeable formats, such as bibtex or EndNote xref - both text formats that are simple to use. I was thinking about the approach taken by Google Scholar from that perspective. > What does Python core team think about addition and long-term maintenance > > of such a feature to the import and setup mechanisms? > What does this have to do with either import or setup? The implementation I was thinking about would have required __citation__/__cite__ dunder reservation or implementation of a function that would be injected into installed packages. For setup I was thinking about adding the citation field to the distutils setup. I was not really aware of the distutils-sig discussion list that would be more appropriate with that regards. A long time ago, I added a feature request for a page in the > documentation to show how to cite Python in various formats: > https://bugs.python.org/issue26597 > I don't believe there has been any progress on this. (I certainly don't > know the right way to cite software.) Perhaps this can be merged with > your idea. That's a good point. Unfortunately, I have not thought about how to cite code that would not have an associated publication. From what I see by checking google scholar, as of now people are citing the Python language reference manual if they want to cite Python itself in a scientific publication. GVM didn't seem interested in citations for Python and from what I understand the vast majority of non-scientific package developer, given citations are not essential for their career advancement. Should Python have a standard sys.__citation__ field that provides the > relevant detail in some format-independent, machine-readable object like > a named tuple? Then this hypothetical cite.py tool could read the tuple > and format it according to any citation style. The idea for Python itself seems good! However, rather than using a named tuple, I was thinking about using a dict consistent with CSL-JSON or BibTeX. And writing a citation generating engine that would be consistent with hundreds if not thousands journal-specific formats is a bit of the scope of the proposal for now - most of the time people just want something their citation/bibliography engine can ingest and generate a citation from there in their Word/LaTeX documents. Bibtex/EndNote export formats are perfect for that task in my experience. > > just thought that it might be worth pointing out that this should > actually work both ways i.e. if a specific package, module or function > is inspired by or directly implements the methods included in a specific > publication then any __citation__ entries within it should also cite > that/those or allow references to them to be recovered. > The general principle is if you are expecting to be cited you also have > to cite. The general convention is to cite the top-level publication. While some methods definitely deserve a citation on their own (such as Sobol filter in Scikits-image), they provide a link to the relevant citation in their documentation to them and would normally cite them in their master publication. That's definitely an idea to look at but I don't see a straightforward of implementing this so far. I think this is a fine idea, but could be achieved by convention, like > __version__, rather than by fiat. > And it’s certainly not a language feature. > So Nathaniel’s right — the thing to do now is work out the convention, > and then advocate for it. This already seems to be an idea floating in the air - AstroPy is inching towards that implementation. The idea is to modify the language to make citing as straightforward as possible and create a universal mechanism for that. Best, *Andrei Kucharavy* Post-Doc @ *Joel S. Bader** Lab* Johns Hopkins University, Baltimore, USA. On Thu, Jun 28, 2018 at 11:48 AM Chris Barker - NOAA Federal via Python-ideas <python-ideas@python.org> wrote: > I think this is a fine idea, but could be achieved by convention, like > __version__, rather than by fiat. > > And it’s certainly not a language feature. > > So Nathaniel’s right — the thing to do now is work out the convention, > and then advocate for it. > > -CHB > _______________________________________________ > Python-ideas mailing list > Python-ideas@python.org > https://mail.python.org/mailman/listinfo/python-ideas > Code of Conduct: http://python.org/psf/codeofconduct/ >
_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/