Hi all,
just to clarify, so far I didn't contact the authors.
Right now I'm trying to reproduce the experiments but it looks like I'd
need some more details:
* did they use the Jena in-memory engine or was it TDB?
* did they increase the Java heap space? when using the CLI of Jena, the
JVM_ARGS should probably be set - maybe I'm wrong, but it looks like for
3.0.1 the default value is hard-coded to -Xmx 1024M
So far I tried different versions of Apache Jena (3.0.1, 3.1.1, 3.4.0)
but could not reproduce any of the reported errors. But I didn't use the
larger BTC dataset (~100G unzipped) yet. I'm using the mentioned Polish
DBpedia dump, but even here I'm a bit lost as I couldn't figure out
which files they loaded to get the 1.3 million triples (even the dataset
with mapping-based properties comprises already ~3 million triples).
The type of query they reported to fail with an OOM exception was
SELECT ?o WHERE {A B* ?o.} LIMIT 100
with A and B being valid URIs in the dataset. Thus, I used
SELECT ?o {<http://dbpedia.org/resource/Nissan_Almera>
<http://dbpedia.org/ontology/successor>* ?o } LIMIT 100
and it works as expected
╔═════════════════════════════════════════════╗
║ o
║
╠═════════════════════════════════════════════╣
║ <http://dbpedia.org/resource/Nissan_Almera> ║
║ <http://dbpedia.org/resource/Nissan_Tiida> ║
╚═════════════════════════════════════════════╝
Note that dbr:Nissan_Almera has a dbo:successor relation to itself -
something that I would expect to be a corner case that could force the
problem.
@Andy can you think of a special case that would lead to this weird bug
and return 100 times the subject resource? I can see that you changed
the datastructure which keeps track of the visited to a set, but even
with a list containment check would be done by equality check on the
Node object.
I also tried the case with the inverse operator
SELECT ?o1 WHERE {?o1 ˆP1 S1 . }
and it did return an non-empty result for me - as expected.
Either something forces Jena to fail on the BTC dataset or I'm doing
something wrong (which cannot be ruled out for sure :D )
In general, it would just be interesting to know whether those bugs
still occur or have been fixed by recent code changes.
Cheers,
Lorenz
On 19.10.2017 11:08, Marco Neumann wrote:
> did you try to contact Daniel Janke, Adrian Skubella or Steffen Staab
> to get a response?
>
> the findings seem to based on work that has been published online as
> part of a bachelor’s thesis by Adrian Skubella.
>
> https://west.uni-koblenz.de/sites/default/files/studying/theses-files/bachelorarbeit-adrian-skubella-benchmarks-for-sparql-property-paths.pdf
>
>
>
> On Thu, Oct 19, 2017 at 10:54 AM, Lorenz B. <[email protected]>
> wrote:
>> For me this is really bad practice. It also looks like they did the
>> benchmark more than one year ago. Otherwise due to JENA-1195 this error
>> wouldn't occur anymore. And submission deadline was August 6th, 2017 .
>> Their experiments contain 8 queries, rerunning those shouldn't take ages...
>>
>> I'm currently trying to reproduce the results of the paper, but the
>> whole experimental setup remains unclear. I'm wondering if they used
>> just the Jena CLI or TDB. The same holds for RDF4J. I'm puzzled because
>> the runtimes in the eval section are quite small, but even loading the
>> data of their benchmark takes much more time. So maybe they used the
>> RDF4J server.
>>
>> The worst thing is that they didn't contact any of the developers. Or
>> did they talk to somebody here and then Andy created the ticket
>> JENA-1195? Also for the other queries that failed, I would expect to see
>> tickets on Apache JIRA or at least a hint on the Jena mailing list...
>>
>> @Andy I'm also wondering whether JENA-1317 addresses the problem with
>> the empty result of benchmark query containing an inverse property path.
>>
>>
>> On 18.10.2017 17:03, [email protected] wrote:
>>> As you know, Andy, I'm going to ISWC this year-- shall I buttonhole
>>> them and give them our POV? :grin:
>>>
>>> In all seriousness, from what I can tell the results amount to "Using
>>> older versions of our comparands and without contacting the projects
>>> in question we couldn't find a store that implements every property
>>> path feature correctly and some fail entirely."
>>>
>>> I'm not really sure how useful that information is...? But I am ready
>>> to do a benchmarking paper for next year. Seems like it's a lot easier
>>> than I thought!
>>>
>>>
>>> ajs6f
>>>
>>>
>>> Andy Seaborne wrote on 10/17/17 9:28 AM:
>>>> Hi Lorenz,
>>>>
>>>> Looks like JENA-1195 which is fixed. Does that look like it?
>>>>
>>>> I think it is shame when papers focus on bugs rather than discussing
>>>> and even fixing them. Bugs aren't research.
>>>>
>>>> Path evaluation could improved to stream in more cases (that's why
>>>> LIMIT didn't help), but 1195 explains the slowness
>>>> and memory.
>>>>
>>>> Andy
>>>>
>>>> On 17/10/17 07:58, Lorenz B. wrote:
>>>>> Hi,
>>>>>
>>>>> I just walked through the papers for the upcoming ISWC conference and
>>>>> found a paper about benchmarking of SPARQL property paths [1] .
>>>>>
>>>>> Not sure if this is relevant, but it looks like Jena has some issues
>>>>> with different types of queries using the property path. For example,
>>>>>
>>>>> SELECT ?o WHERE {A B* ?o.} LIMIT 100
>>>>>
>>>>> lead to an OOM error on non-cyclic data. Here is the relevant part of
>>>>> the paper:
>>>>>
>>>>>> While benchmarking Virtuoso, RDF4J and Allegrograph no errors or
>>>>>> exceptions have occurred. During the benchmark process of Jena an
>>>>>> OutOfMemoryError has been thrown whenever a query with the * operator
>>>>>> was used. In order to identify the cause of the error, the amount of
>>>>>> results the query should return has been limited to 100. The results
>>>>>> that have been returned by a query of the form SELECT ?o WHERE {A B*
>>>>>> ?o.} LIMIT 100 where A and B are valid IRIs, consisted of 100 times A.
>>>>>> Due to this fact it is presumable that the query containing the *
>>>>>> operator returns A recursively until the main memory was full. To
>>>>>> ensure that this behaviour is not caused by cycles in the dataset a
>>>>>> query of the same form but with a predicate IRI that did not exist in
>>>>>> the dataset was executed. This query still returned 100 times A. This
>>>>>> indicates, that the * operator is not implemented correctly.
>>>>> In addition, the experiments showed that:
>>>>>> Due to the problems with the * operator the queries 4, 7 and 8 could
>>>>>> not be processed. Additionally query 3, 5, and 6 returned no results
>>>>>> after 1 hour and thus, were aborted. Query 1 returned an empty and
>>>>>> thus, incomplete result set. Only for query 2 a valid result was
>>>>>> returned. Due to the lack of comparable results, Jena has been omitted
>>>>>> in the comparison of triple stores.
>>>>> In the discussion section, they summarize the overall performance of
>>>>> Jena by
>>>>>
>>>>>> Jena could not return results for any query in under 1 hour besides
>>>>>> query 2. Furthermore, the * operator could not be evaluated at all and
>>>>>> the inverse operator returned empty result sets.
>>>>> It looks like they used version 3.0.1, so maybe this doesn't hold
>>>>> anymore for all of the queries. If not, it could be interesting to
>>>>> improve performance and/or completeness.
>>>>>
>>>>> I hope I didn't miss some open JIRA ticket, but in general I just
>>>>> wanted
>>>>> to highlight the presence of some published benchmark for those kind of
>>>>> queries.
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Lorenz
>>>>>
>>>>> [1] http://ceur-ws.org/Vol-1932/paper-04.pdf
>>>>>