Some follow-ups on Vishesh's answers: On 01/02/2011 07:54 PM, Darren Cruse wrote: > Hi guys sorry for some basic questions but I've been considering use > of Nepomuk for a project gathering meta data from web pages... > > FWIW An initial cut at the project was done using java and xquery that > spidered the web pages and downloaded them prior to creating RDF/XML > that drove a web based UI for searching. And I'll spare the details, > but some parts of that worked well, and some not so well, and I was > trying to understand if something like Nepomuk would bring more "off > the shelf" help for doing what we'd done, yet be open enough for us to > enhance the generated meta data where needed. > > My questions: > > 1. Is the Mandriva Linux distro the one most likely to get me the > latest/greatest goodies for using Nepomuk KDE? > > The initial effort happens to be on Ubuntu, but reading the archives, > just installing kubuntu-desktop won't get me all that Mandriva has - > correct? > > (hope it's not a dumb question btw - old time Solaris guy here still a > little green with Linux). > > 2. I assume html files are indexed? But is anything more than basic > meta data (file size, etc.) gotten? > > In particular, the project requires that triples are created that > refer to the other resources linked to by an html page. i.e. The uris > of images the page may use, other web pages it links to, flash files > it might embed, etc. have to wind up as triples in the meta data.
This is not done currently but should be fairly easy. I even think it could fit into the strigi plugin. All one has to do is to gather all the links and resolve them. If they are absolute: simple. If they are relative: check if the file exists and link to the corresponding Nepomuk resource (in the strigi plugin this means to simply use the local file URL). > 3. To add to the fun, the project also wants entities that are more > conceptual. e.g. if the html pages represent a book broken down into > volumes and sections and chapters etc. the meta data must include the > names of the volume, section, chapter, etc. that the html page refers > to. i.e. This is more in the realm of "entity extraction"/"NLP" kind > of stuff. > > Are there examples of something like that around? Where an app would > customize the meta data being extracted? > > Does this mean I'm using "Scribo" for it's NLP extraction features? > Or that I'm customizing how the "Strigi" indexing works? > > Are such things a part of the current Mandriva distro or are these > only in the playground? Nothing has been done in this direction. But as long as you have the chapter/section information storing it in Nepomuk is simple. The only "hard" part is checking wether the book or chapter already exists. But I think even that could be done easily by linking the actual files to the book and chapter resources properly. Of course you would need an ontology to describe books. I did not check if something like that exists yet. > 4. Barring anything real specific for #3, do I understand that > Virtuoso is now the default/preferred triple store? > > And that I should be able to write software that adds/updates triples > in Virtuoso directly if I choose to? > > Or that hits a SPARQL endpoint, e.g. to display the info using a custom web > app? > > (part of this question also relates to the project using java btw so > solutions that avoid the C++ api are a better fit for my work-mates - > though I wouldn't mind :). > > 5. Not a show stopper but just curious: Is Sesame still supported as > an alternative backend store? There is only Virtuoso and nothing else. Sesame was a nightmare. Cheers, Sebastian > Apologize for all the newbie questions. > > But so far Nepomuk looks like the bees knees btw. :) > > Darren > _______________________________________________ > Nepomuk mailing list > [email protected] > https://mail.kde.org/mailman/listinfo/nepomuk > _______________________________________________ Nepomuk mailing list [email protected] https://mail.kde.org/mailman/listinfo/nepomuk
