Re: SPARQL Reporting Engines?

Stefano Mazzocchi Wed, 18 Jul 2007 11:22:03 -0700

Mark Diggory wrote:
> Hi Stefano,
> 
> On Jul 17, 2007, at 8:27 PM, Stefano Mazzocchi wrote:
> 
>> Mark Diggory wrote:
>>> Hello Simile,
>>>
>>> I'm hunting for any resources on SPARQL/RDF driven reporting
>>> engines.  We're reviewing possible solutions for reporting on top of
>>> [EMAIL PROTECTED] and given we're very bent on getting RDF usage more main-
>>> stream, we are interested in something that would be very flexible
>>> and allow various sources to Query and return result sets that can be
>>> processed in something like JSP/Velocity/Java/Whatever to produce
>>> canned reports we design against the following types of Data sources.
>>>
>>> Apache/Log4j Logs
>>> DSpace Relation Databases (Postgresql)
>>> DSpace Object Model (Java)
>>> Metadata, Policy and History RDF triple-stores (Sesame/Java/SPARQL/
>>> Longwell)
>>>
>>> I've been exploring some of the Simile tools, especially "Referee",
>>> with the inital interest of getting Apache Log data into a triple-
>>> store and available for generating reports against.
>> That's not what Referee is about, btw. Referee is not a way to  
>> transform
>> apache logs into RDF, but it's a way to mine referrer logs out of  
>> apache
>> logs and find out who links to you and provide a little metadata about
>> that. *that* metadata is then dumped out as RDF, not the logs.
> 
> Yes, I understand what the functionality in Referee is and "is not".  
> I've fired it up in Eclipse, tested its output against our Apache  
> Logs and reviewed the codebase.  Still, your work shows how to  
> efficiently parse the logs and "mine" them for content, which is my  
> interest.


Sure, but there is nothing RDF-related in that stage: log parsing is
done with regexps and 'mining' is done by heavily cached HTTP lookups.

> I was also considering that the RDF that one would generate  
> would have a different "spo" structure more appropriate for combining  
> it with other data about the Web servers resources (I.E. joining the  
> log information with the Community/Collection/Item/Bitstream database  
> content rather than the ).

I know what you're thinking: if we can make the logs 'mixable', then I
can get the raw logs from various web servers and the metadata about the
URLs these logs indicate, RDFize them, and them pour them together in a
sort of 'liquid' form so that I can query over them later, combined.

I thought about this too and yes, I consider that an appealing scenario,
on paper.

What scared me enough not to try it, though, was the sheer magnitude of
the data: each apache log event needs around 10 statements to be modeled
even in the simplest form. simile.mit.edu, alone, generates around 300 *
10 = 3M statements a day, which is something like 1B statements a year
at the current rate. And that is just a single server! (I'm not even
counting static.simile.mit.edu which serves our js libraries).

Having the entire logs in there (and the entire svn history, the entire
jira history, the entire email archives, etc.. as we started to collet
at http://simile.mit.edu/data/) would also allow us to do very
interesting operations on the data... for example one could mine the hot
'URL paths', or one could infer who links to particular resources
(referee-like) and so on... but truth is, packages like awstats or
analog already do all that for you and certainly don't require RDF to do it.

>>> While tools for
>>> processing Apache logs and gathering statistics do exist, it might be
>>> of greater interest to get such data into a common reporting
>>> framework with other data sources. My initial ideas were originally
>>> based on Relational tools like Crystal Reports/Jasper Reports.  I'm
>>> seeking any information on hybrid/common solutions that might span
>>> sources of various format/protocol and one platform of interest to me
>>> at the moment is BIRT:
>>>
>>> http://www.eclipse.org/birt/phoenix/intro/
>>>
>>> The logic being that RDF/SPARQL data-sources could be created/adapted
>>> to the framework which already has its own report generation tooling.
>>> One could go directly from Apache Logs into BIRT, but that would be
>>> much less RDF centric and we would still need to explore access to
>>> our triple-stores used in our Policy and History subsystems.
>>>
>>> Any recommendations or suggestions would be received with much
>>> gratitude.
>> Without knowing what your reporting requirements are, there is no  
>> much I
>> can recommend. Even the definition of reporting is a little fuzzy, I'm
>> afraid.
> 
> I was being purposefully vague to see what would come back from the  
> community at large ( tools/research I may not be privy to at this  
> time). But more specifically, the typical definition of a SQL centric  
> report generated against an SQL database like those one would acquire  
> from SQL Server, Pentaho, BIRT, Crystal Reprots etc...
> 
>> I've seen BIRT, which seems to me one of those things that appeal to
>> management more than to developers, as I never understood, really, the
>> different between a web report and a web age generated out of one or
>> more database queries... I guess reporting tools appeal to those who
>> can't create a database-driven web page on their own.
> 
> Think outside the box.  I suppose it'd be great if we lived in a  
> world where children were taught programming languages in grade  
> school and everyone could be a software developer, but thats not the  
> reality.  Its unrealistic to expect folks to do their jobs  
> efficiently and productively if they have to invent everything from  
> scratch.  We need a solution for users of our systems to generate  
> reports and not require a software engineer for such a remedial task...

I'm perfectly aware that the world is not composed of computer
programmers or I wouldn't be working for this project that tries exactly
to solve the problem of how pervasive that assumption is around the semweb.

On the other hand, it has been shown repeatedly that painting tools or
IDEs on things doesn't remove the needs for a certain mental model to
exist in the user's mindset: Access will never be more popular than
Excel, no matter how usable the Access UI is.

My point is that you can hide the complexity of syntaxes with GUI tools,
but you can't hide the need for a certain type of person having a
specific mental model to be able to solve a particular problem.

> BIRT is just an Open source example of the the type of  tool that  
> might fill a need I'd like to make available.  Its not a perfect fit  
> (just as your Referee code isn't). But, this is an exploration of  
> available tools I'm currently completing and its need is "two-fold":
> 
> 1.) I'd like not to have to "program" reports for my users (our  
> Operations team) but give them tools to easily do it without a  
> Computer Science degree. Sometimes, if you give the user a "blank
> slate" and require them to customize, compile and deploy an  
> application its too much, and they get overwhelmed (and with good  
> reason).

I certainly did not advocate that you give your average users vi and a
compiler and tell them to make a report.

I'm advocating that just like you build dspace so that people can use it
without having to write all the code themselves, it's perfectly
conceivable to write some software to do report that allows a minimal
configurability for users.

Such configurability could be:

 1) which queries to run
 2) which xslt stylesheet to use on the resultset

sure, you need a computer science guy even for that stuff (a junior now,
not a senior)... but if you think that painting a GUI over it would make
that requirement go away, I think you're setting yourself up for failure.

>> But really, what is a reporting engine for you? Nothing prevents you,
>> right now, to take your data, RDFize-it, dump it into a triple  
>> store of
>> your choice and then run sparql queries on top, obtain an XML
>> representation and XSLT transform it to anything you want.
> 
> 2.) I'd rather not reinvent a "wheel" I have to bug fix, maintain and  
> document (and which doesn't align with existing best practices and  
> tools already out there).

Fair enough, that I understand.

>> Since I doubt your reports need to be 'fast' in being generated, this
>> can work just fine for you and can well integrate into dspace's future
>> cocoon-based XML pipelined frontend.
> 
> Nor do they change dramatically over time... But rendering/generation  
> is not such a big issue and can be done anywhere via any technology  
> we choose... capability/functionality to join disparate sources of  
> data is the more important requirement.

Sure, but this influences the entire workflow as current report tools
are pretty much all based on relational technology or their own internal
models. AFAIK, there is no report tool that depends on RDF as its
internal data model.

>> But if what you're looking for is an IDE to graphically construct your
>> sparql query, or a visual drag/drop interface to construct your report
>> as a portal output, no, I haven't seen anything like that nor I would
>> hold my breath for it.
> 
> Thats too bad... It would make for a very powerful tool.

Possibly. But I think the research on the UI interaction, alone, would
very likely need an entire project.

-- 
Stefano Mazzocchi
Digital Libraries Research Group                 Research Scientist
Massachusetts Institute of Technology
E25-131, 77 Massachusetts Ave               skype: stefanomazzocchi
Cambridge, MA  02139-4307, USA         email: stefanom at mit . edu
-------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Re: SPARQL Reporting Engines?

Reply via email to