Hello, Jay, Ionut and Pieter! This mail is not intended to be rude or harsh, I just want to bring up real problems with using content from RSS files as base for file naming. If you can come up with a stable, sane and secure scheme for creating human-readable file names for all possible RSS feeds, please tell me :)
On Wed, 2007-10-31 at 11:03 +0000, Jay Bradley wrote: > I was wondering why gpodder stores the downloads in crazily named > directories? I realise that it is partly to ensure unique directories > so there are no clashes but it means that it is impossible to browse > through the podcast files manually. I know I can sync to a filesystem > so I do this for my mp3 player but I also normally use a soft link to > the podcast downloads directory for my mythtv installation as well. > Currently I'm changing the device directory and syncing to my mp3 > player and changing the device directory again to a separate directory > for mythtv. If the directory names were human readable then it would > save me a lot of hassle. I see you have read the mailing list and are aware of the alternatives (MP3 Player sync). Anyway, this topic has been discussed several times on this list, I guess it's time for a FAQ on the gPodder website.. ;) First of all, here are some relevant postings related to the topic. Please read through them to get an overview of what has been proposed and discussed already: https://lists.berlios.de/pipermail/gpodder-devel/2006-November/000283.html https://lists.berlios.de/pipermail/gpodder-devel/2007-June/000723.html https://lists.berlios.de/pipermail/gpodder-devel/2007-July/000756.html Script that tries to solve that problem: http://lists.berlios.de/pipermail/gpodder-devel/2007-August/000911.html I'm going to describe the problem you mention a bit further... Basically, it's hard to create human-readable names because of the nature of RSS feeds. It's like with HTML - if browsers were going to reject non-standard HTML, all documents on the web would adhere to the standards, but thanks to such "useful" features as quirks mode, browsers try to fix the shortcomings of bad markup in the parser code. But the problems with RSS feeds doesn't lie in bad markup. Most of the time, fields are not set (no <title> element in <item>), fields have empty value (<title> exists, but is empty) or very stupid usage of fields (just recently, we had a feed where <title> contained a description of the episode, a very long string). There are two options here: a) reject any feeds that have no title, have a too long title or have some other weird properties that are not usual RSS practice b) accept all feeds and try to make the best of "what we have" gPodder tries to to the "b)" route and so we have to be prepared to accept feeds without <title>. As you can read from the november 2006 post above (I think one of the inital thoughts about hashed filenames), hashing feed and episode URLs always gives us strings that have some sane and stable properties: 1.) (high probability of) uniqueness 2.) sane length (even fixed, but at least not empty or too long) 3.) sane alphabet (hexadecimal, i.e. only the characters 0-9 and a-f) So, for every given URL (and _every_ feed has an URL), we have a sane "ID" that we can use to identify that feed. When depending on human-readable strings (i.e. title, etc..) we run into several problems: i.) what is the directory name of feeds with "<title></title>"?? ii.) what is the directory name of feed A with title "radio x podcast" when there already is a feed B with title "radio x podcast"? iii.) what is the directory name of a feed with a loooong title? iv.) what is the directory name of a feed with chinese characters as title (from the top of my head, imagine (e.g. "ウェブ") when using FAT32 as file system? We might be able to create a unique filename for a podcast episode from the basename of its url, but is there always an unique basename of the podcast feed? It might be "index.xml" or "podcast.rss". > I never understand why some programs add a layer of complexity which > removes the user one step from their files. I believe programs should > be as transparent as possible to allow people to do what they like > with the data produced by that program. gPodder is transparent in that the user doesn't have to care about the directory layout, as the user can use the gPodder GUI to browse and listen to feeds - all feed information is displayed in the GUI. You can always determine feed and episode info for given hashes: -> Hash (md5) the URLs in ~/.config/gpodder/channels.opml -> MD5 of URL = directory name of feed -> Open the file "index.xml" in the feed download directory -> Hash (md5) the URLs in that file -> MD5 of URL + extension of basename of URL = filename of episode In pseuco-code, this is something like: opml_file = $HOME + '/.config/gpodder/channels.opml' ( ... feed_url is to be obtained from opml_file ... ) feed_directory = gpodder_download_dir + '/' + md5sum( feed_url ) feed_index = feed_directory + '/index.xml' ( ... episode_url is to be obtained from feed_index ... ) extension = file_extension_of( basename( episode_url ) ) episode_name = feed_directory + '/' + md5sum ( episode_url ) + extension > Using the non-human readable directory and filenames stops users from > accessing their files except through one program (gpodder) which is a > shame. You can use the above method to find more information (metadata) for the files than you can with human readable directories, including the title and description of episodes. > I've looked through the source code but cannot find where the > directory names are set. I'm an okay programmer so could do this > myself if someone could point me in the right direction. I'd do it to > just my local copy if this wasn't something anyone else would be > interested in. Please, by all means try to do it. If it works for all RSS feeds, I would be very happy to merge it into gPodder, as it would be the better solution than what we have now. But because of the reasons I mentioned above, I am very skeptic if this is possible at all. The directory name for channels is determined by the "get_filename" function of the class "podcastChannel" in "src/gpodder/libpodcasts.py". The attributes that _should_ be available when this function is called are "url", "title" and "description" (i.e. "self.title"). The filename for an episode is determined by the "local_filename" function of the class "podcastItem" in "src/gpodder/libpodcasts.py". Only the "url" attribute is guranteed to be available, for all other properties, the best possible value is extracted from the RSS feed, but you can expect the "title" value to be somewhat identifying, but not unique. You also have to be aware that the "title" value _could_ be very long (think of a description field value that has been misplaced). > I realise there may be some other reason why the names are non-human > readable so if I've missed it then please could someone let me know. Apart from the practical reasons I mentioned above, there is no real reason why the hashes are chosen. It was a simple and straightforward solution to a problem for which we have not yet found a better solution. It would be quite cool if you could come up with something friendlier :) If you want, please send the modifications you make to make gPodder's directory structure human-readable. It will be a nice-to-have patch for interested people :) Thanks and Good Luck! Thomas _______________________________________________ gpodder-devel mailing list gpodder-devel@lists.berlios.de https://lists.berlios.de/mailman/listinfo/gpodder-devel