Mark Diggory wrote: > Hi Stefano, > > On Jul 17, 2007, at 8:27 PM, Stefano Mazzocchi wrote: > >> Mark Diggory wrote: >>> Hello Simile, >>> >>> I'm hunting for any resources on SPARQL/RDF driven reporting >>> engines. We're reviewing possible solutions for reporting on top of >>> [EMAIL PROTECTED] and given we're very bent on getting RDF usage more main- >>> stream, we are interested in something that would be very flexible >>> and allow various sources to Query and return result sets that can be >>> processed in something like JSP/Velocity/Java/Whatever to produce >>> canned reports we design against the following types of Data sources. >>> >>> Apache/Log4j Logs >>> DSpace Relation Databases (Postgresql) >>> DSpace Object Model (Java) >>> Metadata, Policy and History RDF triple-stores (Sesame/Java/SPARQL/ >>> Longwell) >>> >>> I've been exploring some of the Simile tools, especially "Referee", >>> with the inital interest of getting Apache Log data into a triple- >>> store and available for generating reports against. >> That's not what Referee is about, btw. Referee is not a way to >> transform >> apache logs into RDF, but it's a way to mine referrer logs out of >> apache >> logs and find out who links to you and provide a little metadata about >> that. *that* metadata is then dumped out as RDF, not the logs. > > Yes, I understand what the functionality in Referee is and "is not". > I've fired it up in Eclipse, tested its output against our Apache > Logs and reviewed the codebase. Still, your work shows how to > efficiently parse the logs and "mine" them for content, which is my > interest.
Sure, but there is nothing RDF-related in that stage: log parsing is done with regexps and 'mining' is done by heavily cached HTTP lookups. > I was also considering that the RDF that one would generate > would have a different "spo" structure more appropriate for combining > it with other data about the Web servers resources (I.E. joining the > log information with the Community/Collection/Item/Bitstream database > content rather than the ). I know what you're thinking: if we can make the logs 'mixable', then I can get the raw logs from various web servers and the metadata about the URLs these logs indicate, RDFize them, and them pour them together in a sort of 'liquid' form so that I can query over them later, combined. I thought about this too and yes, I consider that an appealing scenario, on paper. What scared me enough not to try it, though, was the sheer magnitude of the data: each apache log event needs around 10 statements to be modeled even in the simplest form. simile.mit.edu, alone, generates around 300 * 10 = 3M statements a day, which is something like 1B statements a year at the current rate. And that is just a single server! (I'm not even counting static.simile.mit.edu which serves our js libraries). Having the entire logs in there (and the entire svn history, the entire jira history, the entire email archives, etc.. as we started to collet at http://simile.mit.edu/data/) would also allow us to do very interesting operations on the data... for example one could mine the hot 'URL paths', or one could infer who links to particular resources (referee-like) and so on... but truth is, packages like awstats or analog already do all that for you and certainly don't require RDF to do it. >>> While tools for >>> processing Apache logs and gathering statistics do exist, it might be >>> of greater interest to get such data into a common reporting >>> framework with other data sources. My initial ideas were originally >>> based on Relational tools like Crystal Reports/Jasper Reports. I'm >>> seeking any information on hybrid/common solutions that might span >>> sources of various format/protocol and one platform of interest to me >>> at the moment is BIRT: >>> >>> http://www.eclipse.org/birt/phoenix/intro/ >>> >>> The logic being that RDF/SPARQL data-sources could be created/adapted >>> to the framework which already has its own report generation tooling. >>> One could go directly from Apache Logs into BIRT, but that would be >>> much less RDF centric and we would still need to explore access to >>> our triple-stores used in our Policy and History subsystems. >>> >>> Any recommendations or suggestions would be received with much >>> gratitude. >> Without knowing what your reporting requirements are, there is no >> much I >> can recommend. Even the definition of reporting is a little fuzzy, I'm >> afraid. > > I was being purposefully vague to see what would come back from the > community at large ( tools/research I may not be privy to at this > time). But more specifically, the typical definition of a SQL centric > report generated against an SQL database like those one would acquire > from SQL Server, Pentaho, BIRT, Crystal Reprots etc... > >> I've seen BIRT, which seems to me one of those things that appeal to >> management more than to developers, as I never understood, really, the >> different between a web report and a web age generated out of one or >> more database queries... I guess reporting tools appeal to those who >> can't create a database-driven web page on their own. > > Think outside the box. I suppose it'd be great if we lived in a > world where children were taught programming languages in grade > school and everyone could be a software developer, but thats not the > reality. Its unrealistic to expect folks to do their jobs > efficiently and productively if they have to invent everything from > scratch. We need a solution for users of our systems to generate > reports and not require a software engineer for such a remedial task... I'm perfectly aware that the world is not composed of computer programmers or I wouldn't be working for this project that tries exactly to solve the problem of how pervasive that assumption is around the semweb. On the other hand, it has been shown repeatedly that painting tools or IDEs on things doesn't remove the needs for a certain mental model to exist in the user's mindset: Access will never be more popular than Excel, no matter how usable the Access UI is. My point is that you can hide the complexity of syntaxes with GUI tools, but you can't hide the need for a certain type of person having a specific mental model to be able to solve a particular problem. > BIRT is just an Open source example of the the type of tool that > might fill a need I'd like to make available. Its not a perfect fit > (just as your Referee code isn't). But, this is an exploration of > available tools I'm currently completing and its need is "two-fold": > > 1.) I'd like not to have to "program" reports for my users (our > Operations team) but give them tools to easily do it without a > Computer Science degree. Sometimes, if you give the user a "blank > slate" and require them to customize, compile and deploy an > application its too much, and they get overwhelmed (and with good > reason). I certainly did not advocate that you give your average users vi and a compiler and tell them to make a report. I'm advocating that just like you build dspace so that people can use it without having to write all the code themselves, it's perfectly conceivable to write some software to do report that allows a minimal configurability for users. Such configurability could be: 1) which queries to run 2) which xslt stylesheet to use on the resultset sure, you need a computer science guy even for that stuff (a junior now, not a senior)... but if you think that painting a GUI over it would make that requirement go away, I think you're setting yourself up for failure. >> But really, what is a reporting engine for you? Nothing prevents you, >> right now, to take your data, RDFize-it, dump it into a triple >> store of >> your choice and then run sparql queries on top, obtain an XML >> representation and XSLT transform it to anything you want. > > 2.) I'd rather not reinvent a "wheel" I have to bug fix, maintain and > document (and which doesn't align with existing best practices and > tools already out there). Fair enough, that I understand. >> Since I doubt your reports need to be 'fast' in being generated, this >> can work just fine for you and can well integrate into dspace's future >> cocoon-based XML pipelined frontend. > > Nor do they change dramatically over time... But rendering/generation > is not such a big issue and can be done anywhere via any technology > we choose... capability/functionality to join disparate sources of > data is the more important requirement. Sure, but this influences the entire workflow as current report tools are pretty much all based on relational technology or their own internal models. AFAIK, there is no report tool that depends on RDF as its internal data model. >> But if what you're looking for is an IDE to graphically construct your >> sparql query, or a visual drag/drop interface to construct your report >> as a portal output, no, I haven't seen anything like that nor I would >> hold my breath for it. > > Thats too bad... It would make for a very powerful tool. Possibly. But I think the research on the UI interaction, alone, would very likely need an entire project. -- Stefano Mazzocchi Digital Libraries Research Group Research Scientist Massachusetts Institute of Technology E25-131, 77 Massachusetts Ave skype: stefanomazzocchi Cambridge, MA 02139-4307, USA email: stefanom at mit . edu ------------------------------------------------------------------- _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
