The above notwithstanding, I've spent a few days hacking the
video_parser.py in-order to get my library recognised. It's an
assortment of anime, TV and movies coming from various sources and
**probably** typical of most users content. I've gone from the 1.05
version having 10-20% of the content being recognised and added to the
movie/tv show libraries (with 80% uncategorised), to @90% being
categorised with most of the content that is being problematic being so
for understandable reasons. I'd like to submit a patch at some point (I
need to tidy my code and I wouldn't mind knocking a few of the outlying
cases on the head first) - is it acceptable to submit patches to this
list rather than going the bzr route?
Hi Lee,
I'm answering only this part of your email as I see other people have
already addressed most of your points in other emails.
I briefly worked on the media scanner myself a few weeks ago, here at
Fluendo. We recognized that the media scanner could be improved and
started to do some work on it. However, we quickly recognized that any
changes in the media scanner have a very high risk of introducing
regressions, i.e. improving on the recognition of some files but making
it fail to recognize some fails that were recognized before.
Because of that, we decided to put on hold all work on the media scanner
until we can write a comprehensive battery of unit tests to ensure that
we can test the scanner for regressions.
I think there's some time scheduled to create these unit tests somewhere
in the coming few months, but at the moment i'm not sure exactly when.
It would actually help if you could send to me (even in private if you
prefer it like that) a list of the file names that you tested the media
scanner on ? It will help us build the unit tests when we eventually get
to doing that.
That said, we already have another community member (in CC) that did
some work on the media scanner as well [1][2].
It would be interesting to try and merge the stuff that I did, the stuff
that he did and the stuff that you did, then see if any of you can
contribute the unit tests as well.
After that we will be more than happy to get that stuff reviewed and
committed.
Clearly it's not exactly a trivial effort we're talking about here, so
obviously I understand if you want to just wait for us to do it.
Still, you can find the code from the other contributor in the links at
bottom, and my code is attached as well just in case you want to do
something with it or just look at it for ideas.
Speaking of which, if you look at my code you will see it's basically
stand-alone. I ripped the scanner off of the moovida code so i could
test it more easily (it's all very crude, of course, as it's just a work
in progress that got shelved temporarily).
But that got me (and other people here) thinking a bit about actually
having the media scanner as a totally separate python library, in a way
that it can be used not only by moovida, but by other projects as well
(besides of the obvious advantages of being much more easily testable, etc).
Mind you, we didn't go much further on this other than brainstorming
some ideas (e.g. pluggable web-based helpers to search imdb or tmdb or
similar sites, pluggable extra filters and recognizers, etc).
So having some more discussion about this may be helpful here for
everyone interested in improving the media scanner, and maybe someone
will want to pick up the idea and run with it for a while.
To me a well designed and reusable media scanner library seems like an
interesting "summer" project. Hell, if we were participating in Google's
Summer of Code this year I would have put it out as a student project
for sure. But maybe even without that there's someone interested in
looking at it (when they're not spending their time at the beach ;)).
Cheers,
--
Ugo
[1]
https://www.moovida.com/quality/review/request/%[email protected]%3e
[2] https://code.launchpad.net/~mattbrown/elisa/bugfixes
import os, re
from os import path
class Test:
# These match patterns similar to S05E06 and several variations of the
# same theme (ex: S03-E01 or Season3, Episode1, or [s3]_[e1] etc.)
# Defined separately to keep regular expressions more readable (i hope)
_parens_se_pat = r"[\(\[]?\s*(?:s|se|season)\s*([0-9]+)\s*[\]\)]?"
_parens_se_pat_noncap = r"[\(\[]?\s*(?:s|se|season)\s*(?:[0-9]+)\s*[\]\)]?"
_parens_ep_pat = r"[\(\[]?\s*(?:e|ep|episode)\s*([0-9]+)\s*[\]\)]?"
# These patterns cover the cases where all the episode information can be
# found in the file name, without going to look into the path.
# Note that patterns below don't cover the cases like
# lost[s3][e2].avi where we don't have spaces between title and
# the (possibly parenthesized) se/ep list.
# It's a pretty dumb naming scheme but we may want to cover it anyway later.
_tvshow_patterns = [
# ex: lost.[s3]_[e5].hdtv-lol.avi
# ex: Chuck - Season 1, Episode 04 [WS DVDRip Xvid].avi
# ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
r"^(.+)\s+" + _parens_se_pat + r"\s*" + _parens_ep_pat
+ r"\s*(.*)$",
# ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
r"^(.+)\s*([0-9]+)x([0-9]+)\s*(.*)$",
# ex: lost.305.hdtv-lol.avi
# Note that this is assuming at least 3 digits, of which
# the last 2 are considered the episode number and the
# others the season number. However sometimes this is
# just a sequence number (e.g. animes)
# FIXME: one way to fix this is to examine the files in
# the same dir and see if they fit the sequence number
# pattern or the se/ep pattern. MythTV code has some
# implementation of this we can borrow or learn from.
r"^(.+)\s+([0-9]+)([0-9][0-9])\s+(.*)$",
]
_tvshow_regexes = [re.compile(pattern, re.I) \
for pattern in _tvshow_patterns]
# These patterns cover the same cases as the _tvshow_patterns, however they
# don't try to capture show and season name because this information will be
# retrieved from the path. However they try to match the season name as part
# of the pattern, if it is available, to decrease the likelihood of false
# matches)
_tvshow_path_patterns = [
# ex: lost.[s3]_[e5].hdtv-lol.avi
# ex: The IT Crowd - Season 1, Episode 04 [WS DVDRip Xvid].avi
# ex: Day.Break.S01E03.HDTV.XviD-XOR.avi
r".*(?:" + _parens_se_pat_noncap + r"[-_\s\.,]?)?" +
_parens_ep_pat + r"(.*)$",
# ex: Dexter - 01x06 (HDTV-LOL) Return to Sender.avi
r".*[0-9]+x([0-9]+)(.*)$",
# ex: lost.305.hdtv-lol.avi
# please note that in this case the number almost surely is not
# in "SEE" format but it's just a sequential episode number, so
# we match it entirely.
# To avoid false matches with basically anything with a number
# in the title, we force the number to be at the start of the
# filename.
r"([0-9]+)\s+(.*)$",
]
_base_path_pat = r"(.*)" + os.sep + r"(?:season|se|s)\s*([0-9]+)" + os.sep
_tvshow_path_regexes = [re.compile(_base_path_pat + pattern, re.I) \
for pattern in _tvshow_path_patterns]
_prematch_discard = re.compile(r"[-_\.,]")
def __init__(self):
self._postmatch_filters = [self.post_match_year]
_ok = []
_fail = []
def print_result(self, m, f):
if len(m.groups()) > 4:
return "TOO_MANY_GROUPS|%s|%s" % (m.groups(), f)
show_name, season_nb, episode_nb, remain = m.groups()
se = int(season_nb, 10)
ep = int(episode_nb, 10)
return "%s|%s|%s|%s|%s" % (show_name, se, ep, remain, f)
def descend(self, dir, output = True, level = 0):
pad = ''
for _ in xrange(0, level):
pad += ' '
for f in os.listdir(dir):
p = os.path.join(dir, f)
if os.path.isdir(p):
if output:
print pad + "[%s]" % f
self.descend(p, level + 1)
else:
m = self.process(p, f)
if m is None:
self._fail.append(p)
if output:
print pad + "FAIL: %s" % f
else:
self._ok.append((p, m))
if output:
print pad + self.print_result(m, f)
def process(self, dir, fname):
text = self.path_tail(dir, 1)
text = self.clean_video_name_prematch(text)
for rx in self._tvshow_regexes:
matches = rx.search(text)
if not matches is None:
if self.apply_postmatch_filters(dir, matches):
return matches
text = self.path_tail(dir, 3)
text = self.clean_video_name_prematch(text)
for rx in self._tvshow_path_regexes:
matches = rx.search(text)
if not matches is None:
if self.apply_postmatch_filters(dir, matches):
return matches
return None
def path_tail(self, filepath, components):
"""
Returns the rightmost N components of filepath (joined as a valid path)
For example path_tail('/videos/tv/lost/season1/01-pilot.avi', 3) will
return 'lost/season1/01-pilot.avi'
"""
return os.sep.join(filepath.split(os.sep)[-components:])
def clean_video_name_prematch(self, video_name):
"""
Remove file extension then replace a set characters often used in
downloaded filenames in place of spaces (such as . and _) with spaces.
Also replace common separators (such as - and ,) with spaces.
"""
extpos = video_name.rfind(os.extsep)
if not extpos is None:
video_name = video_name[:extpos]
return self._prematch_discard.sub(" ", video_name)
def apply_postmatch_filters(self, path, matches):
"""
Apply all the the post-match filters and return True only if all filters
return True (i.e. all filters approve the match)
"""
for filt in self._postmatch_filters:
if not filt(path, matches):
return False
return True
def post_match_year(self, path, match):
"""
This filter is to check the result of regexps that match serial numbers.
If these regexen are applied to movie titles with year in them, they
wrongly categorize the movie as a tv show. For example 1998 can be seen
as season 19 episode 98.
This filter don't allow anything larger than 1900 to be considered a TV
show, as rarely TV shows have more than 18 seasons, while most digitally
available movies are dated after the year 1900.
"""
if match is None or len(match.groups()) != 4:
print "!!! NO MATCH or WRONG MATCH !!!"
return False
_, season, episode, _ = match.groups()
year = int(season + episode, 10)
return year < 1900
t = Test()
t.descend('/home/uriboni/tmp/video/film', False)
print "----------------- SUCCESS: -----------------------"
for ok in t._ok:
filepath, matches = ok
print t.print_result(matches, filepath)
print "----------------- FAILURE: ------------------------"
for fail in t._fail:
print fail
print "----------------- STATS: -------------------------"
print "OK:", len(t._ok)
print "FAIL:", len(t._fail)