lalewis1 opened a new issue, #3535:
URL: https://github.com/apache/jena/issues/3535

   ### Version
   
   5.5.0
   
   ### What happened?
   
   ## issue description
   
   Some queries will ignore the arq:queryTimeout value and run until completion.
   this can cause compounding performance problems if multiple such queries are 
run.
   
   This has been an issue for us as we have some large datasets 300Gb + where 
these erroneous
   queries can take hours to complete and if you get a few running at the same 
time, the 
   CPUs will be consumed with these tasks for multiple hours making the system 
unresponsive /
   seriously degrading performance during this time.
   
   I've done my best to narrow down the simplest forms of the queries that are 
causing the
   problem, but I suspect there will be more cases I am yet to identify.
   
   ## Reproducing it
   
   Below is a server I set up specifically to test this issue. But I have found 
the same problem
   on a few machines of different sizes and across datasets of 300gb, 5gb, and 
300mb.
   
   I have also tested using podman and docker with a dockerized 
jena-fuseki-server.
   
   And I have tested without the text index and the issue is still there.
   
   ### VM
   
   Azure VM
   Linux (rhel 9.4)
   Standard D8ls v6 (8 vcpus, 16 GiB memory)
   
   128Gb SSD (5000 Max IOPS) with xfs filesystem mounted at /etc/fuseki
   
   ### fuseki
   
   jena-fuseki-server 5.5.0
   run as a systemd service using [this unit 
file](https://github.com/apache/jena/blob/main/jena-fuseki2/apache-jena-fuseki/fuseki.service)
   Java 21 openjdk
   nginx as reverse_proxy
   
   ### data
   
   300mb of ntriples data
   
   ```bash
   #!/bin/bash
   
   # adapted from 
https://github.com/qlever-dev/qlever-control/blob/main/src/qlever/Qleverfiles/Qleverfile.freebase
   
   # results in a dataset of ~ 300mb
   
   rm /etc/fuseki/rdf/olympics.nt
   wget -nc 
"https://github.com/wallscope/olympics-rdf/raw/master/data/olympics-nt-nodup.zip";
 -O /etc/fuseki/rdf/olympics.zip
   unzip -qo /etc/fuseki/rdf/olympics.zip -d /etc/fuseki/rdf
   rm /etc/fuseki/rdf/olympics.zip
   ```
   
   ### loading
   
   tdb2.tdbloader and jena.textindexer used to create a text indexed TDB2 
dataset
   
   specifically, created using this command.
   
   ```bash
   #!/bin/bash
   
   rm -rf /etc/fuseki/databases/ds
   time podman run \
     -v "/etc/fuseki/rdf:/rdf:z" \
     -v "/etc/fuseki/databases:/etc/fuseki/databases:z" \
     -v "/etc/fuseki/configuration/ds.ttl:/config.ttl:z" \
     -e "TDB2_MODE=parallel" \
     -e "TEXT=true" \
     --name tdb \
     --rm \
     "ghcr.io/kurrawong/tdb2-generation:master"
   ```
   
   ### assembler
   
   ```turtle
   PREFIX :        <#>
   PREFIX fuseki:  <http://jena.apache.org/fuseki#>
   PREFIX rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX rdfs:    <http://www.w3.org/2000/01/rdf-schema#>
   PREFIX ja:      <http://jena.hpl.hp.com/2005/11/Assembler#>
   PREFIX text:    <http://jena.apache.org/text#>
   PREFIX tdb2:    <http://jena.apache.org/2016/tdb#>
   
   [] rdf:type fuseki:Server ;
      fuseki:services (
        :service
      ) .
   
   :service rdf:type fuseki:Service ;
       fuseki:name "ds" ;
       fuseki:endpoint [
           fuseki:operation fuseki:query ;
           fuseki:name "sparql" ;
       ] ;
       fuseki:endpoint [
           fuseki:operation fuseki:query ;
           fuseki:name "query"
       ] ;
       fuseki:endpoint [
           fuseki:operation fuseki:update ;
           fuseki:name "update"
       ] ;
       fuseki:endpoint [
           fuseki:operation fuseki:gsp-r ;
           fuseki:name "get"
       ] ;
       fuseki:endpoint [
           fuseki:operation fuseki:gsp-rw ;
           fuseki:name "data"
       ] ;
       fuseki:endpoint [
           fuseki:operation fuseki:patch ;
           fuseki:name "patch"
       ] ;
       fuseki:dataset :text_dataset ;
       .
   
   :text_dataset rdf:type     text:TextDataset ;
       text:dataset   :dataset_tdb2 ;
       text:index     :indexLucene ;
       .
   
   :indexLucene a text:TextIndexLucene ;
       text:directory "/etc/fuseki/databases/ds/lucene" ;
       text:entityMap :entMap ;
       .
   
   <#entMap> a text:EntityMap ;
       text:entityField      "uri" ;
       text:defaultField     "rdfs-label" ;
       text:uidField         "uid" ;
       text:map (
            [ text:field "rdfs-label" ; text:predicate rdfs:label ]
            )
       .
   
   :dataset_tdb2 rdf:type  tdb2:DatasetTDB ;
       tdb2:location "/etc/fuseki/databases/ds" ;
       ja:context [ ja:cxtName "arq:queryTimeout" ; ja:cxtValue "1000" ] ;
       .
   ```
   
   ### queries
   
   ```sparql
   # runs for a very long time and eventually returns 200. ignores 1 sec timeout
   PREFIX text: <http://jena.apache.org/text#>
   select *
   where {
     ?s ?p ?o .
     ?s text:query "totallynotfindingthisstring" .
     ?s ?p ?o .
   }
   limit 1
   ```
   
   ```sparql
   # times out after 1 second as expected, unless run after the text query 
above in which case it will run until finished.
   SELECT *
   WHERE {
     ?s ?p ?o
   }
   ```
   
   ```sparql
   # runs for about 6s and then causes a 503. ignores 1s timeout.
   SELECT *
   WHERE {
     ?s ?p ?o
     {
       SELECT *
       WHERE {
         ?s ?p ?o
       }
       limit 1
     }
     ?s ?p ?o
   }
   limit 1
   ```
   
   ### Relevant output and stacktrace
   
   ```shell
   
   ```
   
   ### Are you interested in making a pull request?
   
   None


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to