Re: [PR] New input vs database size FAQ entry [jena-site]

via GitHub Tue, 11 Feb 2025 01:41:32 -0800


rvesse commented on code in PR #205:
URL: https://github.com/apache/jena-site/pull/205#discussion_r1950520378



##########
source/documentation/tdb/faqs.md:
##########
@@ -159,78 +145,139 @@ Fuseki the journal will be flushed to disk. When using 
the [TDB Java API](java_a
          TDBFactory.release(dataset);
       }
 
-<a name="ssd"></a>
-## Should I use a SSD?
+### Why is the database much larger on disk than my input data? 
{#input-vs-database-size}
 
-Yes if you are able to
+TDB2 uses copy-on-write data structures.  This means that each new write 
transaction takes copies of any data blocks it
+modifies during the transaction and writes new copies of those blocks with the 
required modifications.  The old blocks
+are not automatically removed as they might still be referenced by ongoing 
read transactions.  Depending on how you've
+loaded your data into TDB2 - how many transactions were used, how large each 
transaction was, input data characteristics
+etc. - this can lead to much larger database disk size than your original 
input data size.
 
-Using a SSD boost performance in a number of ways.  Firstly bulk loads, 
inserts and deletions will be faster i.e. operations that modify the 
-database and have to be flushed to disk at some point due to faster IO.  
Secondly TDB will start faster because the files can be mapped into
-memory faster.
+You can run a [Compaction](../tdb2/tdb2_admin.md#compaction) operation on your 
database to have TDB2 prune the data
+structures to only preserve the current data blocks.  Compactions require 
exclusive write access to the database i.e. no
+other read/write transactions may occur while a compaction is running.  Thus, 
compactions should generally be run
+offline, or at quiet times if exposing your database to multiple applications 
per [Can I share a TDB dataset between
+multiple applications?](#multi-jvm).
 
-SSDs will make the most difference when performing bulk loads since the 
on-disk database format for TDB is entirely portable and may be
-safely copied between systems (provided there is no process accessing the 
database at the time).  Therefore even if you can't run your production
-system with a SSD you can always perform your bulk load on a SSD equipped 
system first and then move the database to your production system.
+Please note that compaction creates a new `Data-NNNN` directory per [TDB2 
Directory
+Layout](../tdb2/tdb2_admin.md#tdb2-directory-layout) into which it writes the 
compacted copy of the database.  The old
+directory won't be automatically removed unless the compaction operation was 
explicitly configured to do so. Therefore,
+the immediate effect of a compaction may actually be more disk space usage 
until the old data directory can be removed.
+If the database was already maximally compacted then there will be no 
difference in size between the old and new data
+directories.
 
-<a name="lock-exception"></a>
-## Why do I get the exception *Can't open database at location /path/to/db as 
it is already locked by the process with PID 1234* when trying to open a TDB 
database?
+We would recommend that you consider running a compaction after an initial 
bulk data load, although some bulk loading

Review Comment:
   In the new commit made the following clarifications:
   
   - Noted that bulk loading generates (near) maximally compacted databases so 
compaction is unnecessary for those
   - Added use of named graphs into list of factors that affect database size
   - Also added notes about usage of sparse files that can cause logical vs 
physical database size to differ depending on how users are inspecting the 
databases



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] New input vs database size FAQ entry [jena-site]

Reply via email to