Cool thread, just picking up on it now ! Here are a few examples of large DSpace instances, in terms of item count:
https://repository.globethics.net/discover Based on the Atmire Open Repository platform <http://www.openrepository.com/> (DSpace 5) Result of a migration from Fedora to DSpace. +665.000 items https://archives.lib.state.ma.us/discover Based on the Atmire DSpace Express platform <https://www.atmire.com/dspace-express> (DSpace 6), where we basically offer DSpace without customizations +717.000 items https://bibliotecadigital.trt7.jus.br/xmlui/discover A Brazilian court / institute of the justice system +911.000 items http://dspace.nplg.gov.ge/simple-search?query= National parliamentary library of Georgia +307.000 items The size of the database, SOLR discovery index, are indeed challenging. But scaling and performance wise, we generally find it a bigger challenge to keep response times down for repositories that are subject to a lot of traffic. So if your repository has 100k items, but is very frequently visited, you may have a bigger challenge on your hand than a repository with half a million items that has less usage. cheers, Bram [image: logo] Bram Luyten *250-B Luci*us Gordon Drive, Suite 3A, West Henrietta, NY 14586 Gaston Geenslaan 14, 3001 Leuven, Belgium atmire.com <http://atmire.com/website/?q=services&utm_source=emailfooter&utm_medium=email&utm_campaign=braml> On Sat, 24 Aug 2019 at 00:54, Terry Brady <[email protected]> wrote: > Vlastik, > > The 470,000 citation items were added to our repository several years > ago. With the exception of the ingest of these large citation collections, > our repository growth is relatively modest each year. > > After ingesting one large collection with 50,000 bitstreams, we discovered > that we needed to disable the checksum checker process. We have discussed > resuming that process, but it has not been a priority for us. Aside from > the checksum checker, we observed that it took longer to rebuild our > discovery index as our item count increased. > > Our instance runs on a single server. After ingesting the large > collections, we increased the RAM on our server. We further increased RAM > further when we migrated from DSpace 5 to DSpace 6. We run tomcat with > -Xmx8g. We run each of our command line tasks with 2-3g of RAM depending > on the task. It takes 3-4 hours to rebuild our discovery index. Other > than the RAM allocation, we have stuck with most of the recommended > configuration settings. See > https://wiki.duraspace.org/display/DSDOC6x/Performance+Tuning+DSpace. > > Terry > > On Fri, Aug 23, 2019 at 2:10 PM Vlastimil Krejčíř <[email protected]> > wrote: > >> Thank you Terry. How fast do your DSpace grow? How many items per month >> or year? Do you do clustering / load balancing? What kind of hardware do >> you need to run it? I would be grateful if you can share those >> information. >> >> Vlastik >> >> On 8/23/19 6:28 PM, Terry Brady wrote: >> > Here are some details about DigitalGeorgetown. >> > >> > * Total items: 546,000 >> > * Public items: 397,000 >> > * Citation only items: ~470,000 >> > >> > As we tested and migrated to DSpace 6x, we did encounter a few >> > performance issues. We have contributed patches to DSpace 6x releases >> > (and to the future DSpace 6.4 release) to help resolve these issues. >> > >> > We preserve our assets in the APTrust (Academic Preservation Trust) >> > service, so we do not run the DSpace checksum checker on our DSpace >> > instance. >> > >> > Terry >> > >> > On Fri, Aug 23, 2019 at 7:48 AM Tim Donohue <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > Hello Vlastimil, >> > >> > Unfortunately, the size of DSpace sites is very difficult to track >> > overall (it relies entirely on self reporting). >> > >> > I know there are very large sites out there... a few that come to >> > mind are U of Cambridge (https://www.repository.cam.ac.uk >> > <https://www.repository.cam.ac.uk/>), and Georgetown University >> > (https://repository.library.georgetown.edu/). I cannot claim to >> > know exactly how large the sites are though, as each of these sites >> > may have access restricted content (which is not even visible on the >> > web). However, in terms of public content alone each has 250-350 >> > thousand items. >> > >> > I also admit that I don't know whether there are larger sites out >> > there. But, maybe institutions on this mailing list will >> > self-report if they have more than 400 thousand items. (I know I'd >> > love to hear which sites have >400K items!) >> > >> > I think Mark Wood gave a thorough answer regarding the number of >> > items possible in a DSpace. Technically, the biggest limitation is >> > the amount of server space & memory available (as larger sites need >> > more of each). For each release we attempt to make DSpace as >> > performant (and memory lean) as we can, and as memory issues are >> > reported we resolve them as bugs in a new release. For example, for >> > the upcoming DSpace 7 release (which is still under active >> > development) we are running more detailed performance testing as >> > detailed >> > here: >> https://wiki.duraspace.org/display/DSPACE/DSpace+7+Performance+Testing >> > At this time, that performance testing is more geared towards >> > minimizing CPU load and memory overall (which will also help in >> > scaling). >> > >> > Tim >> > >> > >> ------------------------------------------------------------------------ >> > *From:* [email protected] >> > <mailto:[email protected]> >> > <[email protected] >> > <mailto:[email protected]>> on behalf of Vlastimil >> > Krejčíř <[email protected] <mailto:[email protected]>> >> > *Sent:* Friday, August 23, 2019 5:57 AM >> > *To:* DSpace Community <[email protected] >> > <mailto:[email protected]>> >> > *Subject:* [dspace-community] Scalability of DSpace >> > >> > Hi all, >> > >> > back in April 2013 I asked the community about the DSpace >> > scalability, see: >> > >> > >> http://dspace.2283337.n4.nabble.com/DSpace-scalability-tens-of-hundreds-TBs-tt4662988.html#a4663047 >> > >> > Now, at 2019, it is time to ask the same question :-). >> > >> > How much data / how many items can DSpace handle? The DSpace system >> > at Cambridge University (https://www.repository.cam.ac.uk/) was >> > reported as the largest then. I can see it stores about 245 >> > thousands of items nowadays. >> > >> > Does anyone else have bigger one? Are there new information on >> > scalability since 2013? >> > >> > Regards, >> > >> > Vlastik Krejčíř >> > >> > -- >> > >> ---------------------------------------------------------------------------- >> > Vlastimil Krejčíř >> > Library and Information Centre, Institute of Computer Science >> > Masaryk University, Brno, Czech Republic >> > Email: krejcir (at) ics (dot) muni (dot) cz >> > Phone: +420 549 49 3872 >> > OpenPGP key: https://kic-internal.ics.muni.cz/~krejvl/pgp/ >> > Fingerprint: 7800 64B2 6E20 645B 56AF C303 34CB 1495 C641 11B9 >> > >> ---------------------------------------------------------------------------- >> > >> > -- >> > All messages to this mailing list should adhere to the DuraSpace >> > Code of Conduct: >> https://duraspace.org/about/policies/code-of-conduct/ >> > --- >> > You received this message because you are subscribed to the Google >> > Groups "DSpace Community" group. >> > To unsubscribe from this group and stop receiving emails from it, >> > send an email to [email protected] >> > <mailto:[email protected]>. >> > To view this discussion on the web visit >> > >> https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com >> > < >> https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com?utm_medium=email&utm_source=footer >> >. >> > >> > -- >> > All messages to this mailing list should adhere to the DuraSpace >> > Code of Conduct: >> https://duraspace.org/about/policies/code-of-conduct/ >> > --- >> > You received this message because you are subscribed to the Google >> > Groups "DSpace Community" group. >> > To unsubscribe from this group and stop receiving emails from it, >> > send an email to [email protected] >> > <mailto:[email protected]>. >> > To view this discussion on the web visit >> > >> https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com >> > < >> https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com?utm_medium=email&utm_source=footer >> >. >> > >> > >> > >> > -- >> > Terry Brady >> > Applications Programmer Analyst >> > Georgetown University Library Information Technology >> > https://github.com/terrywbrady/info >> > 425-298-5498 (Seattle, WA) >> > >> > -- >> > All messages to this mailing list should adhere to the DuraSpace Code of >> > Conduct: https://duraspace.org/about/policies/code-of-conduct/ >> > --- >> > You received this message because you are subscribed to the Google >> > Groups "DSpace Community" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> > an email to [email protected] >> > <mailto:[email protected]>. >> > To view this discussion on the web visit >> > >> https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com >> > < >> https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com?utm_medium=email&utm_source=footer >> >. >> >> -- >> All messages to this mailing list should adhere to the DuraSpace Code of >> Conduct: https://duraspace.org/about/policies/code-of-conduct/ >> --- >> You received this message because you are subscribed to the Google Groups >> "DSpace Community" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/dspace-community/15980bcc-7f2e-9b95-e6a3-6b9777b43332%40ics.muni.cz >> . >> > > > -- > Terry Brady > Applications Programmer Analyst > Georgetown University Library Information Technology > https://github.com/terrywbrady/info > 425-298-5498 (Seattle, WA) > > -- > All messages to this mailing list should adhere to the DuraSpace Code of > Conduct: https://duraspace.org/about/policies/code-of-conduct/ > --- > You received this message because you are subscribed to the Google Groups > "DSpace Community" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/dspace-community/CAMp2YEz0ZDbqJix2EjnNQCpX8CV0Q5%2BKkGesOfuDAp6PdFb_AQ%40mail.gmail.com > <https://groups.google.com/d/msgid/dspace-community/CAMp2YEz0ZDbqJix2EjnNQCpX8CV0Q5%2BKkGesOfuDAp6PdFb_AQ%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > -- All messages to this mailing list should adhere to the DuraSpace Code of Conduct: https://duraspace.org/about/policies/code-of-conduct/ --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/CACwo3X3KE4RjrUSHy7Kjd3aRC%2BZEp837r8O9iBHX8J%3D8HetpZQ%40mail.gmail.com.
