control: tags -1 +wontfix Hi Enrico,
> It would have been an entirely different story if the datasets that nltk > needs were also packaged in Debian, so that it could have worked out of > the box. I totally understand your preference and I also prefer the libraries that work out of box without network access. However, the recent advances in the computational linguistics field (or say natural language processing) involve more and more machine learning (including deep learning, i.e. deep neural networks) stuff. NLTK's data tarball includes various datasets (corpus) and pre-trained models, and I guess some of them are highly copyright-problematic. Pre-trained neural nets are quite involved, and there is still no clear conclution for it. (Topic: https://lwn.net/Articles/760142/) To make nltk-data DFSG-compatible, we may need someone with NLP experience[1] to review all the contents, including the pre-trained models. Models trained on non-DFSG datasets will have to be removed as well. I can do this by myself but such workload seems not worthwhile. I also use Spacy[2] in my research work, which also needs pre-trained blobs. One will definitely encounter similar licensing problem if he/she wants to package spacy-models as well. > I am extremely reluctant to run unreviewed code that downloads random > data from the internet in some unspecified way, and does unspecified > things with it, to the point that I decided to give up using the library > altogether. If you trust Archlinux's signing keys, download this: https://www.archlinux.org/packages/community/any/nltk-data/ That's the best suggestion I have (And I'm really doing so for myself). > CC: debian-devel Since this bug associates to the pretrained neural net problem. For NEW software, these key words usually implies a great possibility that "non-DFSG blobs inside": computational linguistics, natural language processing, computer vision, XXX (e.g. machine, deep, reinforcement, suprevised, unsupervised) learning, artificial intelligence. My current attitude is to avoid packaging any related data package. [1] Experience really boosts up the copyright-reviewing speed. [2] This might have been packaged&uploaded by Andreas.

