[
https://issues.apache.org/jira/browse/JENA-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182752#comment-14182752
]
Andy Seaborne commented on JENA-804:
------------------------------------
There are several areas where space usage can just grow, not all factors in
this report but for some sort of completeness:
*Node Tables* Slots are reused but not GC'ed. BNodes and randomly generated
URIs will not be reused.
*Memory mapped files* The files are allocated in 8M chunks and some OS report
these spare files in different ways. It can look like large jumps occur when
in fact it's incremental growth. Most noticable with small databases and
growing ones. FAQ.
*tdbloader2* (and to some extent tdbloader1) tdbloader2 creates maximally
packed threaded B+Trees. As data is added, the blocks become fragmented,
tending towards a more normal distribution of block packing sizes for B+Trees
and the indexes expand.
*Free Block Mananagement* Block are not recycled across restarts.
*Transactions* Currently, recycling freed blocks across transactions does not
work properly. A transaction is like the single JVM case.
Things that can be done fall into three classes:
# requiring a change of the on-disk file format (data reload)
# a one way version change (no data reload, can't simply downgrade)
# in-JVM changes only
_These are just some ideas for the transactions/restart cases, not a
comprehensive list._
*Disk format changes 1* A comprehensive solution is to use MVCC for the
indexes. This change has various other advantages as well, including speeding
up transactions because it is no persistent copy (no data write-read via the
write-ahead journal); it avoids certain burstiness in performance in the
presence of writes.
*Disk format changes 2* Free list management on-disk. Dpending on scope and how
this is done, it can blur into just a one way version upgrade.
*Version changes* The journal could be made more intelligent about freed
blocks. Even without needing data reload, there might be something that can be
done to manage some/all freed blocks across transactions and restart.
*In-JVM* Getting better cross-transaction block reuse, possibly with a
variation of the above.
> Jena is not reusing already allocated space on the file system which results
> in large amounts of disk space reserved by Jena files
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: JENA-804
> URL: https://issues.apache.org/jira/browse/JENA-804
> Project: Apache Jena
> Issue Type: Bug
> Components: Jena
> Affects Versions: Jena 2.11.2, TDB 1.0.2
> Environment: Windows 7, IBM JRE 1.7, Tomcat 7.0.54
> Reporter: Keith Wells
> Attachments: out.txt, test-tdb-size.sh
>
>
> We have a product based on Jena TDB where we insert quads to Jena TDB along
> with the deletion of quads. We understand the performance over space
> architectural decision to not clean up deleted nodeids from the indexes. But
> the usage of disk space appears that Jena TDB is not reusing allocated space
> which had been allocated by Jena previously. Based on this comment there
> appears to be something that is not correct on file space utilization,
> http://mail-archives.apache.org/mod_mbox/jena-users/201310.mbox/%3cce7d7929.2a707%[email protected]%3E:
> "The indexes won't shrink - TDB never gives disk space back to the OS - but
> disk space is reused when reallocated within the same JVM.".
> In this scenario on the same JVM with NO server stops or starts, we add 27765
> graphs to IndexTdb and immediately remove them, repeating this process
> several times.
> {noformat}
> MB Bytes Diff (Bytes)
> Start 193 203239424
>
> Reindex 5 249 262066176 58826752
> Reindex 6 249 262086656 20480
> Reindex 10 298 312500224 50413568
> Reindex 11 298 312520704 20480
> Reindex 12 298 312541184 20480
> Reindex 13 298 312586240 45056
> Reindex 14 306 320995328 8409088
> Reindex 15 330 346181632 25186304
> Reindex 16 330 346198538 16906
> Reindex 17 346 362999808 16801270
> Reindex 18 346 363020288 20480
> Reindex 19 346 363040768 20480
> Reindex 20 346 363061248 20480
> Reindex 21 346 363081728 20480
> Reindex 22 354 371490816 8409088
> Reindex 23 378 396677120 25186304
>
> End 193 203239424
> {noformat}
> The system starts with 193MB of data allocated by indexTdb. A reindex
> consists of a remove followed by an add of these graphs. As you can see from
> the data there is a dramatic increase in the size of indexTdb on the disk
> after repeadedly removing and adding graphs. After Reindex 23, there is 378
> MB of disk space used. If Jena TDB reused allocated space there would be no
> need to allocate more space other than what is used by deleted node ids
> (unless nodeid storage is eating all of this space?). Jena does not appear
> to be reusing the allocated disk space. At the very end of this scenario, we
> exported the nquads and reloaded them to show the original disk space was
> 193MB back to where it started.
> We believe Jena TDB is not reusing the space allocated by the TDB file system
> within the same JVM.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)