Re: [gpodder-devel] Non-human readable directory and file names

Thomas Perl Wed, 31 Oct 2007 06:06:42 -0800

Hello, Jay, Ionut and Pieter!

This mail is not intended to be rude or harsh, I just want to bring up
real problems with using content from RSS files as base for file naming.
If you can come up with a stable, sane and secure scheme for creating
human-readable file names for all possible RSS feeds, please tell me :)

On Wed, 2007-10-31 at 11:03 +0000, Jay Bradley wrote:
> I was wondering why gpodder stores the downloads in crazily named 
> directories? I realise that it is partly to ensure unique directories
> so there are no clashes but it means that it is impossible to browse 
> through the podcast files manually. I know I can sync to a filesystem
> so I do this for my mp3 player but I also normally use a soft link to
> the podcast downloads directory for my mythtv installation as well. 
> Currently I'm changing the device directory and syncing to my mp3
> player and changing the device directory again to a separate directory
> for mythtv. If the directory names were human readable then it would
> save me a lot of hassle.

I see you have read the mailing list and are aware of the alternatives
(MP3 Player sync).

Anyway, this topic has been discussed several times on this list, I
guess it's time for a FAQ on the gPodder website.. ;)

First of all, here are some relevant postings related to the topic.
Please read through them to get an overview of what has been proposed
and discussed already:

https://lists.berlios.de/pipermail/gpodder-devel/2006-November/000283.html
https://lists.berlios.de/pipermail/gpodder-devel/2007-June/000723.html
https://lists.berlios.de/pipermail/gpodder-devel/2007-July/000756.html

Script that tries to solve that problem:

http://lists.berlios.de/pipermail/gpodder-devel/2007-August/000911.html 

I'm going to describe the problem you mention a bit further...

Basically, it's hard to create human-readable names because of the
nature of RSS feeds. It's like with HTML - if browsers were going to
reject non-standard HTML, all documents on the web would adhere to the
standards, but thanks to such "useful" features as quirks mode, browsers
try to fix the shortcomings of bad markup in the parser code.

But the problems with RSS feeds doesn't lie in bad markup. Most of the
time, fields are not set (no <title> element in <item>), fields have
empty value (<title> exists, but is empty) or very stupid usage of
fields (just recently, we had a feed where <title> contained a
description of the episode, a very long string).

There are two options here:

 a) reject any feeds that have no title, have a too long title or have 
    some other weird properties that are not usual RSS practice
 b) accept all feeds and try to make the best of "what we have"

gPodder tries to to the "b)" route and so we have to be prepared to
accept feeds without <title>. As you can read from the november 2006
post above (I think one of the inital thoughts about hashed filenames),
hashing feed and episode URLs always gives us strings that have some
sane and stable properties:

 1.) (high probability of) uniqueness
 2.) sane length (even fixed, but at least not empty or too long)
 3.) sane alphabet (hexadecimal, i.e. only the characters 0-9 and a-f)

So, for every given URL (and _every_ feed has an URL), we have a sane
"ID" that we can use to identify that feed.

When depending on human-readable strings (i.e. title, etc..) we run into
several problems:

 i.) what is the directory name of feeds with "<title></title>"??
 ii.) what is the directory name of feed A with title "radio x podcast" 
      when there already is a feed B with title "radio x podcast"?
 iii.) what is the directory name of a feed with a loooong title?
 iv.) what is the directory name of a feed with chinese characters as
      title (from the top of my head, imagine (e.g. "ウェブ") when 
      using FAT32 as file system?

We might be able to create a unique filename for a podcast episode from
the basename of its url, but is there always an unique basename of the
podcast feed? It might be "index.xml" or "podcast.rss".

> I never understand why some programs add a layer of complexity which 
> removes the user one step from their files. I believe programs should
> be as transparent as possible to allow people to do what they like
> with the data produced by that program.

gPodder is transparent in that the user doesn't have to care about the
directory layout, as the user can use the gPodder GUI to browse and
listen to feeds - all feed information is displayed in the GUI.

You can always determine feed and episode info for given hashes:

 -> Hash (md5) the URLs in ~/.config/gpodder/channels.opml
 -> MD5 of URL = directory name of feed

 -> Open the file "index.xml" in the feed download directory
 -> Hash (md5) the URLs in that file
 -> MD5 of URL + extension of basename of URL = filename of episode

In pseuco-code, this is something like:

opml_file = $HOME + '/.config/gpodder/channels.opml'

( ... feed_url is to be obtained from opml_file ... )
feed_directory = gpodder_download_dir + '/' + md5sum( feed_url )
feed_index = feed_directory + '/index.xml'

( ... episode_url is to be obtained from feed_index ... )
extension = file_extension_of( basename( episode_url ) )
episode_name = feed_directory + '/' + md5sum ( episode_url ) + extension

> Using the non-human readable directory and filenames stops users from
> accessing their files except through one program (gpodder) which is a
> shame.

You can use the above method to find more information (metadata) for the
files than you can with human readable directories, including the title
and description of episodes.

> I've looked through the source code but cannot find where the
> directory names are set. I'm an okay programmer so could do this
> myself if someone could point me in the right direction. I'd do it to
> just my local copy if this wasn't something anyone else would be
> interested in.

Please, by all means try to do it. If it works for all RSS feeds, I
would be very happy to merge it into gPodder, as it would be the better
solution than what we have now. But because of the reasons I mentioned
above, I am very skeptic if this is possible at all.

The directory name for channels is determined by the "get_filename"
function of the class "podcastChannel" in "src/gpodder/libpodcasts.py".
The attributes that _should_ be available when this function is called
are "url", "title" and "description" (i.e. "self.title").

The filename for an episode is determined by the "local_filename"
function of the class "podcastItem" in "src/gpodder/libpodcasts.py".
Only the "url" attribute is guranteed to be available, for all other
properties, the best possible value is extracted from the RSS feed, but
you can expect the "title" value to be somewhat identifying, but not
unique. You also have to be aware that the "title" value _could_ be very
long (think of a description field value that has been misplaced).

> I realise there may be some other reason why the names are non-human 
> readable so if I've missed it then please could someone let me know.

Apart from the practical reasons I mentioned above, there is no real
reason why the hashes are chosen. It was a simple and straightforward
solution to a problem for which we have not yet found a better solution.

It would be quite cool if you could come up with something friendlier :)

If you want, please send the modifications you make to make gPodder's
directory structure human-readable. It will be a nice-to-have patch for
interested people :)

Thanks and Good Luck!
Thomas
_______________________________________________
gpodder-devel mailing list
gpodder-devel@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/gpodder-devel

Re: [gpodder-devel] Non-human readable directory and file names

Reply via email to