Vlastik, The 470,000 citation items were added to our repository several years ago. With the exception of the ingest of these large citation collections, our repository growth is relatively modest each year.
After ingesting one large collection with 50,000 bitstreams, we discovered that we needed to disable the checksum checker process. We have discussed resuming that process, but it has not been a priority for us. Aside from the checksum checker, we observed that it took longer to rebuild our discovery index as our item count increased. Our instance runs on a single server. After ingesting the large collections, we increased the RAM on our server. We further increased RAM further when we migrated from DSpace 5 to DSpace 6. We run tomcat with -Xmx8g. We run each of our command line tasks with 2-3g of RAM depending on the task. It takes 3-4 hours to rebuild our discovery index. Other than the RAM allocation, we have stuck with most of the recommended configuration settings. See https://wiki.duraspace.org/display/DSDOC6x/Performance+Tuning+DSpace. Terry On Fri, Aug 23, 2019 at 2:10 PM Vlastimil Krejčíř <[email protected]> wrote: > Thank you Terry. How fast do your DSpace grow? How many items per month > or year? Do you do clustering / load balancing? What kind of hardware do > you need to run it? I would be grateful if you can share those information. > > Vlastik > > On 8/23/19 6:28 PM, Terry Brady wrote: > > Here are some details about DigitalGeorgetown. > > > > * Total items: 546,000 > > * Public items: 397,000 > > * Citation only items: ~470,000 > > > > As we tested and migrated to DSpace 6x, we did encounter a few > > performance issues. We have contributed patches to DSpace 6x releases > > (and to the future DSpace 6.4 release) to help resolve these issues. > > > > We preserve our assets in the APTrust (Academic Preservation Trust) > > service, so we do not run the DSpace checksum checker on our DSpace > > instance. > > > > Terry > > > > On Fri, Aug 23, 2019 at 7:48 AM Tim Donohue <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hello Vlastimil, > > > > Unfortunately, the size of DSpace sites is very difficult to track > > overall (it relies entirely on self reporting). > > > > I know there are very large sites out there... a few that come to > > mind are U of Cambridge (https://www.repository.cam.ac.uk > > <https://www.repository.cam.ac.uk/>), and Georgetown University > > (https://repository.library.georgetown.edu/). I cannot claim to > > know exactly how large the sites are though, as each of these sites > > may have access restricted content (which is not even visible on the > > web). However, in terms of public content alone each has 250-350 > > thousand items. > > > > I also admit that I don't know whether there are larger sites out > > there. But, maybe institutions on this mailing list will > > self-report if they have more than 400 thousand items. (I know I'd > > love to hear which sites have >400K items!) > > > > I think Mark Wood gave a thorough answer regarding the number of > > items possible in a DSpace. Technically, the biggest limitation is > > the amount of server space & memory available (as larger sites need > > more of each). For each release we attempt to make DSpace as > > performant (and memory lean) as we can, and as memory issues are > > reported we resolve them as bugs in a new release. For example, for > > the upcoming DSpace 7 release (which is still under active > > development) we are running more detailed performance testing as > > detailed > > here: > https://wiki.duraspace.org/display/DSPACE/DSpace+7+Performance+Testing > > At this time, that performance testing is more geared towards > > minimizing CPU load and memory overall (which will also help in > > scaling). > > > > Tim > > > > > ------------------------------------------------------------------------ > > *From:* [email protected] > > <mailto:[email protected]> > > <[email protected] > > <mailto:[email protected]>> on behalf of Vlastimil > > Krejčíř <[email protected] <mailto:[email protected]>> > > *Sent:* Friday, August 23, 2019 5:57 AM > > *To:* DSpace Community <[email protected] > > <mailto:[email protected]>> > > *Subject:* [dspace-community] Scalability of DSpace > > > > Hi all, > > > > back in April 2013 I asked the community about the DSpace > > scalability, see: > > > > > http://dspace.2283337.n4.nabble.com/DSpace-scalability-tens-of-hundreds-TBs-tt4662988.html#a4663047 > > > > Now, at 2019, it is time to ask the same question :-). > > > > How much data / how many items can DSpace handle? The DSpace system > > at Cambridge University (https://www.repository.cam.ac.uk/) was > > reported as the largest then. I can see it stores about 245 > > thousands of items nowadays. > > > > Does anyone else have bigger one? Are there new information on > > scalability since 2013? > > > > Regards, > > > > Vlastik Krejčíř > > > > -- > > > ---------------------------------------------------------------------------- > > Vlastimil Krejčíř > > Library and Information Centre, Institute of Computer Science > > Masaryk University, Brno, Czech Republic > > Email: krejcir (at) ics (dot) muni (dot) cz > > Phone: +420 549 49 3872 > > OpenPGP key: https://kic-internal.ics.muni.cz/~krejvl/pgp/ > > Fingerprint: 7800 64B2 6E20 645B 56AF C303 34CB 1495 C641 11B9 > > > ---------------------------------------------------------------------------- > > > > -- > > All messages to this mailing list should adhere to the DuraSpace > > Code of Conduct: > https://duraspace.org/about/policies/code-of-conduct/ > > --- > > You received this message because you are subscribed to the Google > > Groups "DSpace Community" group. > > To unsubscribe from this group and stop receiving emails from it, > > send an email to [email protected] > > <mailto:[email protected]>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com > > < > https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com?utm_medium=email&utm_source=footer > >. > > > > -- > > All messages to this mailing list should adhere to the DuraSpace > > Code of Conduct: > https://duraspace.org/about/policies/code-of-conduct/ > > --- > > You received this message because you are subscribed to the Google > > Groups "DSpace Community" group. > > To unsubscribe from this group and stop receiving emails from it, > > send an email to [email protected] > > <mailto:[email protected]>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com > > < > https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com?utm_medium=email&utm_source=footer > >. > > > > > > > > -- > > Terry Brady > > Applications Programmer Analyst > > Georgetown University Library Information Technology > > https://github.com/terrywbrady/info > > 425-298-5498 (Seattle, WA) > > > > -- > > All messages to this mailing list should adhere to the DuraSpace Code of > > Conduct: https://duraspace.org/about/policies/code-of-conduct/ > > --- > > You received this message because you are subscribed to the Google > > Groups "DSpace Community" group. > > To unsubscribe from this group and stop receiving emails from it, send > > an email to [email protected] > > <mailto:[email protected]>. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com > > < > https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com?utm_medium=email&utm_source=footer > >. > > -- > All messages to this mailing list should adhere to the DuraSpace Code of > Conduct: https://duraspace.org/about/policies/code-of-conduct/ > --- > You received this message because you are subscribed to the Google Groups > "DSpace Community" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/dspace-community/15980bcc-7f2e-9b95-e6a3-6b9777b43332%40ics.muni.cz > . > -- Terry Brady Applications Programmer Analyst Georgetown University Library Information Technology https://github.com/terrywbrady/info 425-298-5498 (Seattle, WA) -- All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/ --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/CAMp2YEz0ZDbqJix2EjnNQCpX8CV0Q5%2BKkGesOfuDAp6PdFb_AQ%40mail.gmail.com.
