This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/jena-site.git
The following commit(s) were added to refs/heads/asf-staging by this push:
new 2d3f4f5f5 Staged site from tdb-faqs
(c375f3c36c1a85ff3317845e9ae2f87e4c2e7259)
2d3f4f5f5 is described below
commit 2d3f4f5f58afbe0544d5411378a996ad1acc49b7
Author: jenkins <[email protected]>
AuthorDate: Tue Feb 11 09:51:09 2025 +0000
Staged site from tdb-faqs (c375f3c36c1a85ff3317845e9ae2f87e4c2e7259)
---
content/documentation/tdb/faqs.html | 40 ++++++++++++++++++++++++-------------
content/index.json | 2 +-
content/sitemap.xml | 4 ++--
3 files changed, 29 insertions(+), 17 deletions(-)
diff --git a/content/documentation/tdb/faqs.html
b/content/documentation/tdb/faqs.html
index 39fe69def..ab8044615 100644
--- a/content/documentation/tdb/faqs.html
+++ b/content/documentation/tdb/faqs.html
@@ -334,29 +334,41 @@ journal can be flushed by closing any datasets and
releasing the TDB resources.<
}
</code></pre>
<h3 id="input-vs-database-size">Why is the database much larger on disk than
my input data?</h3>
-<p>TDB2 uses copy-on-write data structures. This means that each new write
transaction takes copies of any data blocks it
-modifies during the transaction and writes new copies of those blocks with the
required modifications. The old blocks
-are not automatically removed as they might still be referenced by ongoing
read transactions. Depending on how you’ve
-loaded your data into TDB2 - how many transactions were used, how large each
transaction was, input data characteristics
-etc. - this can lead to much larger database disk size than your original
input data size.</p>
+<p>Firstly, TDB2 uses copy-on-write data structures. This means that each new
write transaction takes copies of any data
+blocks it modifies during the transaction and writes new copies of those
blocks with the required modifications. The
+old blocks are not automatically removed as they might still be referenced by
ongoing read transactions. Depending on
+how you’ve loaded your data into TDB2 - how many transactions were used,
how large each transaction was, whether named
+graphs are used, input data characteristics etc. - this can lead to much
larger database disk size than your original
+input data size.</p>
+<p>Secondly it is also worth noting that both TDB and TDB2 use <a
href="https://en.wikipedia.org/wiki/Sparse_file">sparse files</a>
+for their on disk storage. Depending on the file system and operating system
you are using, and the tools you use to
+inspect it, you may see larger sizes reported than are actually being consumed
e.g.</p>
+<pre tabindex="0"><code>$ ls -lh SPOG.idn
+-rw-r--r-- 1 user group 8.0M 23 Sep 15:23 SPOG.idn
+$ du -h SPOG.idn
+6.1M SPOG.idn
+</code></pre><p>In the above example, on a small toy dataset, we can see that
<code>ls</code> reports a file size as <code>8.0M</code> while <code>du</code>
reports a
+file size of <code>6.1M</code>. Since a database is comprised of many files
the total logical size vs total physical size may be
+quite different.</p>
<p>You can run a <a href="../tdb2/tdb2_admin.md#compaction">Compaction</a>
operation on your database to have TDB2 prune the data
-structures to only preserve the current data blocks. Compactions require
exclusive write access to the database i.e. no
-other read/write transactions may occur while a compaction is running. Thus,
compactions should generally be run
+structures to only preserve the current data blocks. Compactions require
exclusive write access to the database, i.e.
+no other read/write transactions may occur while a compaction is running.
Thus, compactions should generally be run
offline, or at quiet times if exposing your database to multiple applications
per <a href="#multi-jvm">Can I share a TDB dataset between
multiple applications?</a>.</p>
+<p><strong>NB</strong> If you loaded your data using one of the TDB bulk
loaders, e.g. <a href="#tdbloader-vs-tdbloader2"><code>tdbloader2</code></a> and
+<a href="#tdb-xloader"><code>xloader</code></a>, then those already generate a
(near) maximally compacted database and compaction will offer
+little/no benefit!</p>
<p>Please note that compaction creates a new <code>Data-NNNN</code> directory
per <a href="../tdb2/tdb2_admin.md#tdb2-directory-layout">TDB2 Directory
Layout</a> into which it writes the compacted copy of the database. The old
directory won’t be automatically removed unless the compaction operation
was explicitly configured to do so. Therefore,
the immediate effect of a compaction may actually be more disk space usage
until the old data directory can be removed.
If the database was already maximally compacted then there will be no
difference in size between the old and new data
directories.</p>
-<p>We would recommend that you consider running a compaction after an initial
bulk data load, although some bulk loading
-methods may already generate a maximally compacted database e.g. <a
href="#tdbloader-vs-tdbloader2"><code>tdbloader2</code></a>. Also, if
-your database has ongoing updates over time we would also recommend that you
consider running a compaction periodically
-e.g. once a day/week etc. We cannot provide exact recommendations here as to
the frequency of compactions you should run
-as how much disk size inflation you experience will vary depending on many
factors - size and frequency of write
-transactions, data characteristics, etc. - and you will need to determine a
suitable schedule based on your use case for
-database.</p>
+<p>If your database has ongoing updates over time, particularly spread across
many separate transactions, we would
+recommend that you consider running a compaction periodically e.g. once a
day/week etc. We cannot provide exact
+recommendations here as to the frequency of compactions you should run as how
much disk size inflation you experience
+will vary depending on many factors - size and frequency of write
transactions, data characteristics, etc. - and you
+will need to determine a suitable schedule based on your use case for
database.</p>
<p>Note also that if running on Windows then it won’t be possible to
delete the old data directory due a OS limitation, see
<a href="#windows-dataset-delete">Why can’t I delete a dataset (MS
Windows/64 bit)?</a>.</p>
<h3 id="ssd">Should I use a SSD?</h3>
diff --git a/content/index.json b/content/index.json
index c93b11beb..83dfdbfa0 100644
--- a/content/index.json
+++ b/content/index.json
@@ -1 +1 @@
-[{"categories":null,"contents":"This page is historical \u0026ldquo;for
information only\u0026rdquo; - there is no Apache release of Eyeball and the
code has not been updated for Jena3.\nThe original source code is available. So
you\u0026rsquo;ve got Eyeball installed and you\u0026rsquo;ve run it on one of
your files, and Eyeball doesn\u0026rsquo;t like it. You\u0026rsquo;re not sure
why, or what to do about it. Here\u0026rsquo;s what\u0026rsquo;s going
on.\nEyeball inspects your model a [...]
\ No newline at end of file
+[{"categories":null,"contents":"This page is historical \u0026ldquo;for
information only\u0026rdquo; - there is no Apache release of Eyeball and the
code has not been updated for Jena3.\nThe original source code is available. So
you\u0026rsquo;ve got Eyeball installed and you\u0026rsquo;ve run it on one of
your files, and Eyeball doesn\u0026rsquo;t like it. You\u0026rsquo;re not sure
why, or what to do about it. Here\u0026rsquo;s what\u0026rsquo;s going
on.\nEyeball inspects your model a [...]
\ No newline at end of file
diff --git a/content/sitemap.xml b/content/sitemap.xml
index 541ea71af..816a856ea 100644
--- a/content/sitemap.xml
+++ b/content/sitemap.xml
@@ -209,7 +209,7 @@
<lastmod>2023-04-09T15:11:22+02:00</lastmod>
</url><url>
<loc>https://jena.apache.org/documentation.html</loc>
- <lastmod>2025-02-10T11:47:30+00:00</lastmod>
+ <lastmod>2025-02-11T09:38:10+00:00</lastmod>
</url><url>
<loc>https://jena.apache.org/download.html</loc>
<lastmod>2025-01-21T15:04:14+00:00</lastmod>
@@ -625,7 +625,7 @@
<lastmod>2024-03-28T22:35:37+01:00</lastmod>
</url><url>
<loc>https://jena.apache.org/documentation/tdb/faqs.html</loc>
- <lastmod>2025-02-10T11:47:30+00:00</lastmod>
+ <lastmod>2025-02-11T09:38:10+00:00</lastmod>
</url><url>
<loc>https://jena.apache.org/documentation/tdb/java_api.html</loc>
<lastmod>2024-03-28T22:35:37+01:00</lastmod>