date:20230617

Re: Why DatasetGraphInMemory?

2023-06-17 Thread Arne Bernhardt

Hi Andy,

in the meantime, I've been using my bulk update tests (e.g. 1 in 1M graph
updating about 200,000 triples over and over again) to observe if it slows
down and how memory is freed up when I call CG between iterations and
measure memory. RoaringBitmaps seem to be quite GC friendly. There was no
increase in memory usage. It even seems to be more GC friendly than
GraphMem in the same scenario.
So I decided to create https://github.com/apache/jena/issues/1912.

   Arne

Am Sa., 17. Juni 2023 um 17:33 Uhr schrieb Andy Seaborne :

>
>
> On 12/06/2023 21:36, Arne Bernhardt wrote:
> > Hi Andy
> >
> > you mentioned RoaringBitmaps. I took the time to experiment with them.
> > They are really amazing. The performance of #add, #remove and #contains
> is
> > comparable to Java HashSet. RoaringBitmaps are much faster at iterating
> > over values and they perform bit operations even between two quite large
> > bitmaps like a charm. RoaringBitmaps also need less memory than a
> > JavaHasSet. (even less than an optimized integer hash set based on the
> > concepts in HashCommon)
> > A first graph implementation was easy to create. (albeit with a little
> help
> > from ChatGPT, as I had no idea how to use RoaringBitmaps yet).
> > One only needs an indexed set of all triples and three maps indexed by
> > subject, predicate and object and bitmaps as values.
> > Each bitmap contains all indices of the triples with the corresponding
> node.
> > To find SPO --> use the set with all triples.
> > To find S__, _P_, or __O --> lookup the bitmap in the corresponding map
> and
> > iterate over all indices mapping to triples via the indexed set.
> > To find SP_, S_O, or _PO --> lookup the two bitmaps for both given nodes,
> > perform an "and" operation with both bitmaps and again iterate over the
> > resulting indices mapping to triples via the indexed set.
> > Especially the query of _PO is incredibly fast compared to GraphMem or
> > similarly structured graphs.
> > Just for fun, I replaced the bitmaps with two sets of integers and
> > simulated the "and" operation by iterating over the smallest set and
> > checking the entries in the larger set using #contains --> it is 10-100
> > times slower than the "and" operation of RoaringBitmaps.
> > Now I really understand the hype around RoaringBitmaps. It seems
> absolutely
> > justified to me.
> > Smaller graphs with RoaringBitmaps need about twice as much memory for
> the
> > indexing structures (triples excluded) as GraphMem.
> > (The additional memory requirement is not only due to the bitmaps, but
> also
> > to the additional indexed set of triples).
> > For larger graphs (> 500k and above), this gap begins to close. At 1M
> > triples, the variant with roaring bitmaps wins the advantage with 88MB
> > compared to 106MB with GraphMem.
> > After loading all the triples from bsbm-25m.nt.gz and two JVM warmup
> > iterations, it only took about 18 seconds to add them to the new graph,
> and
> > this graph only required an additional 1941 MB of memory.
> >
> > I'm not sure how RoaringBitmaps handles permanent updates. I have tried
> > many #add and #remove calls on larger graphs and they seem to work well.
> > But there are two methods that caught my attention:
> > *
> >
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#runOptimize()
> > *
> >
> https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#trim()
> > I have no idea when it would be a good time to use them.
> > Removing and adding triples from a graph of size x in y iterations and
> > measuring the impact on memory and performance could be one way to find
> > potential problems.
> > Do you have a scenario in mind that I could use to test if I ever need
> one
> > of these methods?
>
> Just from reading the javadoc - #runOptimize() might be useful for a
> load-and-readonly graph - do a lot of loading work and switch to the
> more efficient. It depends no how much space it saves. My instinct is
> that the saving for the overall graph may not be that great because the
> RDF terms take up a log of the space at scale so savings on the the
> bitmaps might, overall, not be significant.
>
> >
> > Arne
> >
> > Andy Seaborne  schrieb am Mo., 22. Mai 2023, 16:52:
> >
> >>
> >>
> >> On 20/05/2023 17:18, Arne Bernhardt wrote:
> >>> Hi Andy,
> >>> thank you, that was very helpful to get the whole picture.
> >>>
> >>> Some time ago, I told you that at my workplace we implemented an
> >> in-memory
> >>> SPARQL-Server based on a Delta
> >>> <
> >>
> https://jena.apache.org/documentation/javadoc/jena/org.apache.jena.core/org/apache/jena/graph/compose/Delta.html
> >>>
> >>> .
> >>> We started a few years ago, before RDF-patch
> >>> , based on the
> >> "difference
> >>> model"
> >>> <
> https://lists.w3.org/Archives/Public/www-rdf-interest/2001Mar/0216.html
> >>> ,
> >>> that has become part of the

Re: Why DatasetGraphInMemory?

2023-06-17 Thread Andy Seaborne

On 12/06/2023 21:36, Arne Bernhardt wrote:

Hi Andy

you mentioned RoaringBitmaps. I took the time to experiment with them.
They are really amazing. The performance of #add, #remove and #contains is
comparable to Java HashSet. RoaringBitmaps are much faster at iterating
over values and they perform bit operations even between two quite large
bitmaps like a charm. RoaringBitmaps also need less memory than a
JavaHasSet. (even less than an optimized integer hash set based on the
concepts in HashCommon)
A first graph implementation was easy to create. (albeit with a little help
from ChatGPT, as I had no idea how to use RoaringBitmaps yet).
One only needs an indexed set of all triples and three maps indexed by
subject, predicate and object and bitmaps as values.
Each bitmap contains all indices of the triples with the corresponding node.
To find SPO --> use the set with all triples.
To find S__, _P_, or __O --> lookup the bitmap in the corresponding map and
iterate over all indices mapping to triples via the indexed set.
To find SP_, S_O, or _PO --> lookup the two bitmaps for both given nodes,
perform an "and" operation with both bitmaps and again iterate over the
resulting indices mapping to triples via the indexed set.
Especially the query of _PO is incredibly fast compared to GraphMem or
similarly structured graphs.
Just for fun, I replaced the bitmaps with two sets of integers and
simulated the "and" operation by iterating over the smallest set and
checking the entries in the larger set using #contains --> it is 10-100
times slower than the "and" operation of RoaringBitmaps.
Now I really understand the hype around RoaringBitmaps. It seems absolutely
justified to me.
Smaller graphs with RoaringBitmaps need about twice as much memory for the
indexing structures (triples excluded) as GraphMem.
(The additional memory requirement is not only due to the bitmaps, but also
to the additional indexed set of triples).
For larger graphs (> 500k and above), this gap begins to close. At 1M
triples, the variant with roaring bitmaps wins the advantage with 88MB
compared to 106MB with GraphMem.
After loading all the triples from bsbm-25m.nt.gz and two JVM warmup
iterations, it only took about 18 seconds to add them to the new graph, and
this graph only required an additional 1941 MB of memory.

I'm not sure how RoaringBitmaps handles permanent updates. I have tried
many #add and #remove calls on larger graphs and they seem to work well.
But there are two methods that caught my attention:
*
https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#runOptimize()
*
https://javadoc.io/doc/org.roaringbitmap/RoaringBitmap/latest/org/roaringbitmap/RoaringBitmap.html#trim()
I have no idea when it would be a good time to use them.
Removing and adding triples from a graph of size x in y iterations and
measuring the impact on memory and performance could be one way to find
potential problems.
Do you have a scenario in mind that I could use to test if I ever need one
of these methods?

Just from reading the javadoc - #runOptimize() might be useful for a
load-and-readonly graph - do a lot of loading work and switch to the
more efficient. It depends no how much space it saves. My instinct is
that the saving for the overall graph may not be that great because the
RDF terms take up a log of the space at scale so savings on the the
bitmaps might, overall, not be significant.

Arne

Andy Seaborne schrieb am Mo., 22. Mai 2023, 16:52:

On 20/05/2023 17:18, Arne Bernhardt wrote:

Hi Andy,
thank you, that was very helpful to get the whole picture.

Some time ago, I told you that at my workplace we implemented an

in-memory

SPARQL-Server based on a Delta
<

https://jena.apache.org/documentation/javadoc/jena/org.apache.jena.core/org/apache/jena/graph/compose/Delta.html

.
We started a few years ago, before RDF-patch
, based on the

"difference

model"

pattern. All transactions are recorded as an event with a list of triples
added and a list of triples removed.
The events are stored in an RDBMS (Oracle or PostgreSQL). For query
execution we need the relevant data to fit into memory but all data and
versions are also persisted.
To be able to store and load graphs very fast, we use RDF Thrift with LZ4
compression and store them in blobs.
All queries are executed on projected datasets for the requested version
(any previous version) of the data and the requested named graphs.
Thanks to the versioning, we fully support MR+SW. We even support

multiple

writers, with a git-like branching and merging approach and optimistic
locking.

How does that work for RDF?

Is at the unit of an "entity"

[GitHub] [jena-site] afs commented on pull request #160: Document script allow list

2023-06-17 Thread via GitHub



afs commented on PR #160:
URL: https://github.com/apache/jena-site/pull/160#issuecomment-1595776449

   This will be merged on release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@jena.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[reboot required] unattended-upgrades result for jena-vm.apache.org: SUCCESS

2023-06-17 Thread root

Unattended upgrade result: All upgrades installed

Warning: A reboot is required to complete this upgrade, or a previous one.

Packages that were upgraded:
 libx11-6 libx11-data libx11-dev libx11-xcb1 linux-generic
 linux-headers-generic linux-image-generic linux-libc-dev

Packages that were auto-removed:
 linux-modules-extra-5.4.0-149-generic linux-modules-5.4.0-149-generic
 linux-image-5.4.0-149-generic linux-headers-5.4.0-149
 linux-headers-5.4.0-149-generic

Package installation log:
Log started: 2023-06-17  06:07:17
apt-listchanges: Reading changelogs...
apt-listchanges: Reading changelogs...
Preparing to unpack .../linux-libc-dev_5.4.0-152.169_amd64.deb ...
Unpacking linux-libc-dev:amd64 (5.4.0-152.169) over (5.4.0-150.167) ...
Setting up linux-libc-dev:amd64 (5.4.0-152.169) ...
Log ended: 2023-06-17  06:07:21

Log started: 2023-06-17  06:07:21
apt-listchanges: Reading changelogs...
apt-listchanges: Reading changelogs...
Selecting previously unselected package linux-modules-5.4.0-152-generic.
Preparing to unpack 
.../0-linux-modules-5.4.0-152-generic_5.4.0-152.169_amd64.deb ...
Unpacking linux-modules-5.4.0-152-generic (5.4.0-152.169) ...
Selecting previously unselected package linux-image-5.4.0-152-generic.
Preparing to unpack .../1-linux-image-5.4.0-152-generic_5.4.0-152.169_amd64.deb 
...
Unpacking linux-image-5.4.0-152-generic (5.4.0-152.169) ...
Selecting previously unselected package linux-modules-extra-5.4.0-152-generic.
Preparing to unpack 
.../2-linux-modules-extra-5.4.0-152-generic_5.4.0-152.169_amd64.deb ...
Unpacking linux-modules-extra-5.4.0-152-generic (5.4.0-152.169) ...
Preparing to unpack .../3-linux-generic_5.4.0.152.149_amd64.deb ...
Unpacking linux-generic (5.4.0.152.149) over (5.4.0.150.148) ...
Preparing to unpack .../4-linux-image-generic_5.4.0.152.149_amd64.deb ...
Unpacking linux-image-generic (5.4.0.152.149) over (5.4.0.150.148) ...
Selecting previously unselected package linux-headers-5.4.0-152.
Preparing to unpack .../5-linux-headers-5.4.0-152_5.4.0-152.169_all.deb ...
Unpacking linux-headers-5.4.0-152 (5.4.0-152.169) ...
Selecting previously unselected package linux-headers-5.4.0-152-generic.
Preparing to unpack 
.../6-linux-headers-5.4.0-152-generic_5.4.0-152.169_amd64.deb ...
Unpacking linux-headers-5.4.0-152-generic (5.4.0-152.169) ...
Preparing to unpack .../7-linux-headers-generic_5.4.0.152.149_amd64.deb ...
Unpacking linux-headers-generic (5.4.0.152.149) over (5.4.0.150.148) ...
Setting up linux-headers-5.4.0-152 (5.4.0-152.169) ...
Setting up linux-headers-5.4.0-152-generic (5.4.0-152.169) ...
Setting up linux-modules-5.4.0-152-generic (5.4.0-152.169) ...
Setting up linux-image-5.4.0-152-generic (5.4.0-152.169) ...
I: /boot/vmlinuz.old is now a symlink to vmlinuz-5.4.0-150-generic
I: /boot/initrd.img.old is now a symlink to initrd.img-5.4.0-150-generic
I: /boot/vmlinuz is now a symlink to vmlinuz-5.4.0-152-generic
I: /boot/initrd.img is now a symlink to initrd.img-5.4.0-152-generic
Setting up linux-headers-generic (5.4.0.152.149) ...
Setting up linux-modules-extra-5.4.0-152-generic (5.4.0-152.169) ...
Setting up linux-image-generic (5.4.0.152.149) ...
Setting up linux-generic (5.4.0.152.149) ...
Processing triggers for linux-image-5.4.0-152-generic (5.4.0-152.169) ...
/etc/kernel/postinst.d/initramfs-tools:
update-initramfs: Generating /boot/initrd.img-5.4.0-152-generic
/etc/kernel/postinst.d/zz-update-grub:
Sourcing file `/etc/default/grub'
Sourcing file `/etc/default/grub.d/init-select.cfg'
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.4.0-152-generic
Found initrd image: /boot/initrd.img-5.4.0-152-generic
Found linux image: /boot/vmlinuz-5.4.0-150-generic
Found initrd image: /boot/initrd.img-5.4.0-150-generic
Found linux image: /boot/vmlinuz-5.4.0-149-generic
Found initrd image: /boot/initrd.img-5.4.0-149-generic
done
Log ended: 2023-06-17  06:08:37

Log started: 2023-06-17  06:08:38
apt-listchanges: Reading changelogs...
apt-listchanges: Reading changelogs...
Preparing to unpack .../libx11-dev_2%3a1.6.9-2ubuntu1.5_amd64.deb ...
Unpacking libx11-dev:amd64 (2:1.6.9-2ubuntu1.5) over (2:1.6.9-2ubuntu1.2) ...
Preparing to unpack .../libx11-6_2%3a1.6.9-2ubuntu1.5_amd64.deb ...
Unpacking libx11-6:amd64 (2:1.6.9-2ubuntu1.5) over (2:1.6.9-2ubuntu1.2) ...
Setting up libx11-6:amd64 (2:1.6.9-2ubuntu1.5) ...
Setting up libx11-dev:amd64 (2:1.6.9-2ubuntu1.5) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
Log ended: 2023-06-17  06:08:44

Log started: 2023-06-17  06:08:44
apt-listchanges: Reading changelogs...
apt-listchanges: Reading changelogs...
Preparing to unpack .../libx11-xcb1_2%3a1.6.9-2ubuntu1.5_amd64.deb ...
Unpacking libx11-xcb1:amd64 (2:1.6.9-2ubuntu1.5) over (2:1.6.9-2ubuntu1.2) ...
Setting up libx11-xcb1:amd64 (2:1.6.9-2ubuntu1.5) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
Log ended: 2023-06-17  06:08:47

Log started: 2023-06-17  06:08:47
apt-listchanges: Reading changelogs...
apt-listchanges: Reading

Re: Why DatasetGraphInMemory?

Re: Why DatasetGraphInMemory?

[GitHub] [jena-site] afs commented on pull request #160: Document script allow list

[reboot required] unattended-upgrades result for jena-vm.apache.org: SUCCESS

4 matches

Site Navigation

Mail list logo

Footer information