adding metadata to documents via web scraping

2008-05-29 Thread D Baser
Hi,
I'm trying to add metadata to local files, here films, by indexing
appropriate web content, here text from the imdb site.

Thus, for a start I set up an external filter (internal would be nicer of
course for adding specific properties such as director, actor, title, year,
rating, etc.):

filter
  mimetypevideo/x-msvideo/mimetype
  extension.avi/extension
  commandbeagleFilterMovies.pl/command
  arguments%s/arguments
/filter

This external filter calls a perl script to retrieve the appropriate webpage
from a filmsite and return its content as plain text. The filename, e.g.
Indiana_Jones_4.avi is used within a Google I'm Feeling Lucky query...
(see script below).

Somehow I do net get results back when searching afterwards in beagle for,
say, harrison ford.
Any idea why that doesn't work? The script gets called as I see in my
test.log.

Is maybe for videos content indexing disabled?

Cheers, d. baser


Perl script beagleFilterMovies.pl:

#!/usr/bin/perl
$s = $ARGV[0];

`echo beagle found file $s  beagle-test.log`;

# clean filename to use in query
$s =~ s/\.avi$//ig;
$s =~ s/[^a-zA-Z0-9-]/+/g;

# get html of film page
$c = `lynx -source http://www.google'.com/search?q=$s+site%3Awww.imdb.com
%2FtitlebtnI'`;

# strip html tags
$c =~ s/script.*?(.*?|\n)*\/script/ /g;
$c =~ s/style.*?(.*?|\n)*\/style/ /g;
$c =~ s/(([^])+)/ /g;
$c =~ s/[a-z#0-9]+;/ /g;

print $c;
___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers


Re: adding metadata to documents via web scraping

2008-05-29 Thread Debajyoti Bera
 Updated my perl script (had a copy/paste error in the url somehow), now it
 seems to work.

Nice.

 Unfortunately the Desktop Search doesn't show snippets for videos -- see
 attached screenshot: harrison ford is found for Indy.avi but it doesn't
 show where.

Yeah, thats a bug :-(
http://bugzilla.gnome.org/show_bug.cgi?id=371152

- dBera

-- 
-
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE / Mandriva / Inspiron-1100
___
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers