Re: fuseki backup process / policy - similar capabilities to autopostgresqlbackup ?

2022-08-30 Thread Andy Seaborne




On 30/08/2022 12:17, Eugen Stan wrote:

Hi Andy,

Thanks for the feedback.
I think we are in agreement.
Nice touch with cleanup on server startup :).

Should I raise a JIRA issue for the server side bits?


Yes please, or a github issue (we use both)

https://github.com/apache/jena/issues

(The codebase already has some "safe write" code in IOX.safeWrite)

Andy



I will setup the backup script as separate git repo.

Thanks,
Eugen

On 30.08.2022 13:02, Andy Seaborne wrote:

Hi Eugen,

Yes, the backup should be written then atomically moved (i.e. same 
directory). Cleanup would then be "delete" by pattern in the server 
startup script.


As to putting a process script around the functionality, it is an 
external script which needs access to the server file area (to know 
the state of backups). The file system state is the definitive state - 
not the jobs (that's a UI feature).


This would make a good independent project or contribution. Or 
published example as a starting point because the requirements will be 
depend on the deployment environment and it seems unlikely to me that 
there is a one size fits all.


Fuseki should make sure it has the right behaviours (like atomic write).

 Andy

autopostgresqlbackup itself is GPL.

On 29/08/2022 11:20, Eugen Stan wrote:

Hello,

We are using fuseki and we would like to implement a backup policy 
similar in capabilities to what [autopostgresqlbackup] has to offer.


Are there any existing solutions out there that can do all / part of 
these?


We would like to take:
* daily backups for a week
* weekly backups - 1 per week, last 4 weeks
* monthly backups - 1/ month, last 6 months


I believe this could be scripted with via the HTTP API + directory 
access.


The backup api in [fuseki-server-protocol] can trigger a backup and 
can also list existing backups.


Unfortunately in the current implementation, backup is not consistent.
In case of a server crash during backup, the file will remain there 
incomplete.
Also, since tasks are stored in memory and cleaned (periodically / on 
restart) there is no way to know for sure if the backup was 
successful or not.


In have encountered the above quite often in some workloads.

The in-consistency could be solved by writing the backup to temporary 
file name and on completion, renaming it to final file name.

Rename is usually atomic operation on POSIX file systems.

/backup-list API can list all backups or split backups in complete / 
incomplete. IMO for now, it can list all of them.


The in progress backup could be stored alongside the other backups 
with a file marker like: dataset_date.nq.gz.INCOMPLETE .

Once it's done it can be renamed to dataset_date.nq.gz .

Cleanup might be handled externally. In case of a crash, the file 
will remain INCOMPLETE until it is removed by system by checking a 
specific amount of time has passed since backup was started (1-2 days).


WDYT?


[autopostgresqlbackup] https://github.com/k0lter/autopostgresqlbackup
[fuseki-server-protocol] 
https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html 




Thanks,

z


Re: fuseki backup process / policy - similar capabilities to autopostgresqlbackup ?

2022-08-30 Thread Eugen Stan

Hi Andy,

Thanks for the feedback.
I think we are in agreement.
Nice touch with cleanup on server startup :).

Should I raise a JIRA issue for the server side bits?

I will setup the backup script as separate git repo.

Thanks,
Eugen

On 30.08.2022 13:02, Andy Seaborne wrote:

Hi Eugen,

Yes, the backup should be written then atomically moved (i.e. same 
directory). Cleanup would then be "delete" by pattern in the server 
startup script.


As to putting a process script around the functionality, it is an 
external script which needs access to the server file area (to know the 
state of backups). The file system state is the definitive state - not 
the jobs (that's a UI feature).


This would make a good independent project or contribution. Or published 
example as a starting point because the requirements will be depend on 
the deployment environment and it seems unlikely to me that there is a 
one size fits all.


Fuseki should make sure it has the right behaviours (like atomic write).

     Andy

autopostgresqlbackup itself is GPL.

On 29/08/2022 11:20, Eugen Stan wrote:

Hello,

We are using fuseki and we would like to implement a backup policy 
similar in capabilities to what [autopostgresqlbackup] has to offer.


Are there any existing solutions out there that can do all / part of 
these?


We would like to take:
* daily backups for a week
* weekly backups - 1 per week, last 4 weeks
* monthly backups - 1/ month, last 6 months


I believe this could be scripted with via the HTTP API + directory 
access.


The backup api in [fuseki-server-protocol] can trigger a backup and 
can also list existing backups.


Unfortunately in the current implementation, backup is not consistent.
In case of a server crash during backup, the file will remain there 
incomplete.
Also, since tasks are stored in memory and cleaned (periodically / on 
restart) there is no way to know for sure if the backup was successful 
or not.


In have encountered the above quite often in some workloads.

The in-consistency could be solved by writing the backup to temporary 
file name and on completion, renaming it to final file name.

Rename is usually atomic operation on POSIX file systems.

/backup-list API can list all backups or split backups in complete / 
incomplete. IMO for now, it can list all of them.


The in progress backup could be stored alongside the other backups 
with a file marker like: dataset_date.nq.gz.INCOMPLETE .

Once it's done it can be renamed to dataset_date.nq.gz .

Cleanup might be handled externally. In case of a crash, the file will 
remain INCOMPLETE until it is removed by system by checking a specific 
amount of time has passed since backup was started (1-2 days).


WDYT?


[autopostgresqlbackup] https://github.com/k0lter/autopostgresqlbackup
[fuseki-server-protocol] 
https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html



Thanks,

z
--
Eugen Stan

+40770 941 271  / https://www.netdava.com
begin:vcard
fn:Eugen Stan
n:Stan;Eugen
email;internet:eugen.s...@netdava.com
tel;cell:+40720898747
x-mozilla-html:FALSE
url:https://www.netdava.com
version:2.1
end:vcard



Re: fuseki backup process / policy - similar capabilities to autopostgresqlbackup ?

2022-08-30 Thread Andy Seaborne

Hi Eugen,

Yes, the backup should be written then atomically moved (i.e. same 
directory). Cleanup would then be "delete" by pattern in the server 
startup script.


As to putting a process script around the functionality, it is an 
external script which needs access to the server file area (to know the 
state of backups). The file system state is the definitive state - not 
the jobs (that's a UI feature).


This would make a good independent project or contribution. Or published 
example as a starting point because the requirements will be depend on 
the deployment environment and it seems unlikely to me that there is a 
one size fits all.


Fuseki should make sure it has the right behaviours (like atomic write).

Andy

autopostgresqlbackup itself is GPL.

On 29/08/2022 11:20, Eugen Stan wrote:

Hello,

We are using fuseki and we would like to implement a backup policy 
similar in capabilities to what [autopostgresqlbackup] has to offer.


Are there any existing solutions out there that can do all / part of these?

We would like to take:
* daily backups for a week
* weekly backups - 1 per week, last 4 weeks
* monthly backups - 1/ month, last 6 months


I believe this could be scripted with via the HTTP API + directory access.

The backup api in [fuseki-server-protocol] can trigger a backup and can 
also list existing backups.


Unfortunately in the current implementation, backup is not consistent.
In case of a server crash during backup, the file will remain there 
incomplete.
Also, since tasks are stored in memory and cleaned (periodically / on 
restart) there is no way to know for sure if the backup was successful 
or not.


In have encountered the above quite often in some workloads.

The in-consistency could be solved by writing the backup to temporary 
file name and on completion, renaming it to final file name.

Rename is usually atomic operation on POSIX file systems.

/backup-list API can list all backups or split backups in complete / 
incomplete. IMO for now, it can list all of them.


The in progress backup could be stored alongside the other backups with 
a file marker like: dataset_date.nq.gz.INCOMPLETE .

Once it's done it can be renamed to dataset_date.nq.gz .

Cleanup might be handled externally. In case of a crash, the file will 
remain INCOMPLETE until it is removed by system by checking a specific 
amount of time has passed since backup was started (1-2 days).


WDYT?


[autopostgresqlbackup] https://github.com/k0lter/autopostgresqlbackup
[fuseki-server-protocol] 
https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html



Thanks,


Re: TDB2 bulk loader - multiple files into different graph per file

2022-08-30 Thread Andy Seaborne




On 29/08/2022 18:58, Andy Seaborne wrote:



On 29/08/2022 10:24, Lorenz Buehmann wrote:
...

We checked code and the Apache Commons Compress docs, a colleague 
spotted the hint at 
https://commons.apache.org/proper/commons-compress/examples.html#Buffering 
:


The stream classes all wrap around streams provided by the calling 
code and they work on them directly without any additional buffering. 
On the other hand most of them will benefit from buffering so it is 
highly recommended that users wrap their stream in 
Buffered(In|Out)putStreams before using the Commons Compress API.
we were curious about this statement, checked 
org.apache.jena.atlas.io.IO class and added one line in openFileEx


in = new BufferedInputStream(in);

which wraps the file stream before its passed to the decompressor streams


Run again the parsing:


riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 
4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file 
stream in IO class)


Triples = 163,310,838
1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors 
: 31 warnings



What do you think?


Yes.

IO.ensureBuffered.

It buffers if not already buffered and if not a ByteArrayInputStream.
It also makes all buffering findable in the IDE.

RIOT buffers (128K char buffer) so calls down to chars-UTF8-bytes are in 
chunks.  Your observation indicates BZip2CompressorInputStream is not 
not exploiting read(byte[] dest) calls ... yep - it's loop calling 
internal the one byte "read0".


GZIPInputStream has a default 512 byte buffer - maybe a bigger one there 
will help a bit.


A quick test on BSBM-25 million...

Adding buffering in gzip caused a 0.1% slow-down. (Data from SSD)

Andy



SnappyCompressorInputStream has a 32k buffer.

So it is bz2 needing IO.ensureBuffered, the others may benefit - or may 
go slower.


     Andy




On 28.08.22 14:22, Andy Seaborne wrote:




If you are relying on Jena to do the bz2 decompress, then it is 
using Commons Compress.


gz is done (via Commons Compress) in native code. I use gz and if I 
get a bz2 file, I decompress it with OS tools.


Could you try an experiment please?

Run on the same hardware as the loader was run:

riot --time --count river_planet-latest.osm.pbf.ttl
riot --time --count river_planet-latest.osm.pbf.ttl.bz2

    Andy

gz vs plain: NVMe m2 SSD : Dell XPS 13 9310

riot --time --count .../BSBM/bsbm-25m.nt.gz
Triples = 24,997,044
118.02 sec : 24,997,044 Triples : 211,808.84 per second

riot --time --count .../BSBM/bsbm-25m.nt
Triples = 24,997,044
109.97 sec : 24,997,044 Triples : 227,314.05 per second


News - New W3C working groups in the RDF area.

2022-08-30 Thread Andy Seaborne

There are two new RDF-related working groups starting up:

RDF Star Working Group:
  Home page:
  https://www.w3.org/groups/wg/rdf-star
  Announcement:
  https://lists.w3.org/Archives/Public/public-new-work/2022Aug/0004.html

RDF Dataset Canonicalization and Hash Working Group:
  Home Page:
  https://www.w3.org/groups/wg/rch
  Announcement:
  https://lists.w3.org/Archives/Public/public-new-work/2022Jul/0004.html