Hi guys sorry for some basic questions but I've been considering use of Nepomuk for a project gathering meta data from web pages...
FWIW An initial cut at the project was done using java and xquery that spidered the web pages and downloaded them prior to creating RDF/XML that drove a web based UI for searching. And I'll spare the details, but some parts of that worked well, and some not so well, and I was trying to understand if something like Nepomuk would bring more "off the shelf" help for doing what we'd done, yet be open enough for us to enhance the generated meta data where needed. My questions: 1. Is the Mandriva Linux distro the one most likely to get me the latest/greatest goodies for using Nepomuk KDE? The initial effort happens to be on Ubuntu, but reading the archives, just installing kubuntu-desktop won't get me all that Mandriva has - correct? (hope it's not a dumb question btw - old time Solaris guy here still a little green with Linux). 2. I assume html files are indexed? But is anything more than basic meta data (file size, etc.) gotten? In particular, the project requires that triples are created that refer to the other resources linked to by an html page. i.e. The uris of images the page may use, other web pages it links to, flash files it might embed, etc. have to wind up as triples in the meta data. 3. To add to the fun, the project also wants entities that are more conceptual. e.g. if the html pages represent a book broken down into volumes and sections and chapters etc. the meta data must include the names of the volume, section, chapter, etc. that the html page refers to. i.e. This is more in the realm of "entity extraction"/"NLP" kind of stuff. Are there examples of something like that around? Where an app would customize the meta data being extracted? Does this mean I'm using "Scribo" for it's NLP extraction features? Or that I'm customizing how the "Strigi" indexing works? Are such things a part of the current Mandriva distro or are these only in the playground? 4. Barring anything real specific for #3, do I understand that Virtuoso is now the default/preferred triple store? And that I should be able to write software that adds/updates triples in Virtuoso directly if I choose to? Or that hits a SPARQL endpoint, e.g. to display the info using a custom web app? (part of this question also relates to the project using java btw so solutions that avoid the C++ api are a better fit for my work-mates - though I wouldn't mind :). 5. Not a show stopper but just curious: Is Sesame still supported as an alternative backend store? Apologize for all the newbie questions. But so far Nepomuk looks like the bees knees btw. :) Darren _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
