rvesse commented on code in PR #205: URL: https://github.com/apache/jena-site/pull/205#discussion_r1950520378
########## source/documentation/tdb/faqs.md: ########## @@ -159,78 +145,139 @@ Fuseki the journal will be flushed to disk. When using the [TDB Java API](java_a TDBFactory.release(dataset); } -<a name="ssd"></a> -## Should I use a SSD? +### Why is the database much larger on disk than my input data? {#input-vs-database-size} -Yes if you are able to +TDB2 uses copy-on-write data structures. This means that each new write transaction takes copies of any data blocks it +modifies during the transaction and writes new copies of those blocks with the required modifications. The old blocks +are not automatically removed as they might still be referenced by ongoing read transactions. Depending on how you've +loaded your data into TDB2 - how many transactions were used, how large each transaction was, input data characteristics +etc. - this can lead to much larger database disk size than your original input data size. -Using a SSD boost performance in a number of ways. Firstly bulk loads, inserts and deletions will be faster i.e. operations that modify the -database and have to be flushed to disk at some point due to faster IO. Secondly TDB will start faster because the files can be mapped into -memory faster. +You can run a [Compaction](../tdb2/tdb2_admin.md#compaction) operation on your database to have TDB2 prune the data +structures to only preserve the current data blocks. Compactions require exclusive write access to the database i.e. no +other read/write transactions may occur while a compaction is running. Thus, compactions should generally be run +offline, or at quiet times if exposing your database to multiple applications per [Can I share a TDB dataset between +multiple applications?](#multi-jvm). -SSDs will make the most difference when performing bulk loads since the on-disk database format for TDB is entirely portable and may be -safely copied between systems (provided there is no process accessing the database at the time). Therefore even if you can't run your production -system with a SSD you can always perform your bulk load on a SSD equipped system first and then move the database to your production system. +Please note that compaction creates a new `Data-NNNN` directory per [TDB2 Directory +Layout](../tdb2/tdb2_admin.md#tdb2-directory-layout) into which it writes the compacted copy of the database. The old +directory won't be automatically removed unless the compaction operation was explicitly configured to do so. Therefore, +the immediate effect of a compaction may actually be more disk space usage until the old data directory can be removed. +If the database was already maximally compacted then there will be no difference in size between the old and new data +directories. -<a name="lock-exception"></a> -## Why do I get the exception *Can't open database at location /path/to/db as it is already locked by the process with PID 1234* when trying to open a TDB database? +We would recommend that you consider running a compaction after an initial bulk data load, although some bulk loading Review Comment: In the new commit made the following clarifications: - Noted that bulk loading generates (near) maximally compacted databases so compaction is unnecessary for those - Added use of named graphs into list of factors that affect database size - Also added notes about usage of sparse files that can cause logical vs physical database size to differ depending on how users are inspecting the databases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@jena.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org