Cool thread, just picking up on it now ! Here are a few examples of large
DSpace instances, in terms of item count:

https://repository.globethics.net/discover
Based on the Atmire Open Repository platform
<http://www.openrepository.com/> (DSpace 5)
Result of a migration from Fedora to DSpace.
+665.000 items

https://archives.lib.state.ma.us/discover
Based on the Atmire DSpace Express platform
<https://www.atmire.com/dspace-express> (DSpace 6), where we basically
offer DSpace without customizations
+717.000 items

https://bibliotecadigital.trt7.jus.br/xmlui/discover
A Brazilian court / institute of the justice system
+911.000 items

http://dspace.nplg.gov.ge/simple-search?query=
National parliamentary library of Georgia
+307.000 items

The size of the database, SOLR discovery index, are indeed challenging.
But scaling and performance wise, we generally find it a bigger challenge
to keep response times down for repositories that are subject to a lot of
traffic.

So if your repository has 100k items, but is very frequently visited, you
may have a bigger challenge on your hand than a repository with half a
million items that has less usage.

cheers,

Bram

[image: logo] Bram Luyten
*250-B Luci*us Gordon Drive, Suite 3A, West Henrietta, NY 14586
Gaston Geenslaan 14, 3001 Leuven, Belgium
atmire.com
<http://atmire.com/website/?q=services&utm_source=emailfooter&utm_medium=email&utm_campaign=braml>


On Sat, 24 Aug 2019 at 00:54, Terry Brady <[email protected]>
wrote:

> Vlastik,
>
> The 470,000 citation items were added to our repository several years
> ago.  With the exception of the ingest of these large citation collections,
> our repository growth is relatively modest each year.
>
> After ingesting one large collection with 50,000 bitstreams, we discovered
> that we needed to disable the checksum checker process.  We have discussed
> resuming that process, but it has not been a priority for us.  Aside from
> the checksum checker, we observed that it took longer to rebuild our
> discovery index as our item count increased.
>
> Our instance runs on a single server.  After ingesting the large
> collections, we increased the RAM on our server.  We further increased RAM
> further when we migrated from DSpace 5 to DSpace 6.  We run tomcat with
> -Xmx8g.  We run each of our command line tasks with 2-3g of RAM depending
> on the task.  It takes 3-4 hours to rebuild our discovery index.  Other
> than the RAM allocation, we have stuck with most of the recommended
> configuration settings.  See
> https://wiki.duraspace.org/display/DSDOC6x/Performance+Tuning+DSpace.
>
> Terry
>
> On Fri, Aug 23, 2019 at 2:10 PM Vlastimil Krejčíř <[email protected]>
> wrote:
>
>> Thank you Terry. How fast do your DSpace grow? How many items per month
>> or year? Do you do clustering / load balancing? What kind of hardware do
>> you need to run it? I would be grateful if you can share those
>> information.
>>
>> Vlastik
>>
>> On 8/23/19 6:28 PM, Terry Brady wrote:
>> > Here are some details about DigitalGeorgetown.
>> >
>> >   * Total items: 546,000
>> >   * Public items: 397,000
>> >   * Citation only items: ~470,000
>> >
>> > As we tested and migrated to DSpace 6x, we did encounter a few
>> > performance issues.  We have contributed patches to DSpace 6x releases
>> > (and to the future DSpace 6.4 release) to help resolve these issues.
>> >
>> > We preserve our assets in the APTrust (Academic Preservation Trust)
>> > service, so we do not run the DSpace checksum checker on our DSpace
>> > instance.
>> >
>> > Terry
>> >
>> > On Fri, Aug 23, 2019 at 7:48 AM Tim Donohue <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> >     Hello Vlastimil,
>> >
>> >     Unfortunately, the size of DSpace sites is very difficult to track
>> >     overall (it relies entirely on self reporting).
>> >
>> >     I know there are very large sites out there... a few that come to
>> >     mind are U of Cambridge (https://www.repository.cam.ac.uk
>> >     <https://www.repository.cam.ac.uk/>), and Georgetown University
>> >     (https://repository.library.georgetown.edu/).  I cannot claim to
>> >     know exactly how large the sites are though, as each of these sites
>> >     may have access restricted content (which is not even visible on the
>> >     web).  However, in terms of public content alone each has 250-350
>> >     thousand items.
>> >
>> >     I also admit that I don't know whether there are larger sites out
>> >     there.  But, maybe institutions on this mailing list will
>> >     self-report if they have more than 400 thousand items. (I know I'd
>> >     love to hear which sites have >400K items!)
>> >
>> >     I think Mark Wood gave a thorough answer regarding the number of
>> >     items possible in a DSpace.  Technically, the biggest limitation is
>> >     the amount of server space & memory available (as larger sites need
>> >     more of each).  For each release we attempt to make DSpace as
>> >     performant (and memory lean) as we can, and as memory issues are
>> >     reported we resolve them as bugs in a new release.  For example, for
>> >     the upcoming DSpace 7 release (which is still under active
>> >     development) we are running more detailed performance testing as
>> >     detailed
>> >     here:
>> https://wiki.duraspace.org/display/DSPACE/DSpace+7+Performance+Testing
>> >      At this time, that performance testing is more geared towards
>> >     minimizing CPU load and memory overall (which will also help in
>> >     scaling).
>> >
>> >     Tim
>> >
>> >
>>  ------------------------------------------------------------------------
>> >     *From:* [email protected]
>> >     <mailto:[email protected]>
>> >     <[email protected]
>> >     <mailto:[email protected]>> on behalf of Vlastimil
>> >     Krejčíř <[email protected] <mailto:[email protected]>>
>> >     *Sent:* Friday, August 23, 2019 5:57 AM
>> >     *To:* DSpace Community <[email protected]
>> >     <mailto:[email protected]>>
>> >     *Subject:* [dspace-community] Scalability of DSpace
>> >
>> >     Hi all,
>> >
>> >     back in April 2013 I asked the community about the DSpace
>> >     scalability, see:
>> >
>> >
>> http://dspace.2283337.n4.nabble.com/DSpace-scalability-tens-of-hundreds-TBs-tt4662988.html#a4663047
>> >
>> >     Now, at 2019, it is time to ask the same question :-).
>> >
>> >     How much data / how many items can DSpace handle? The DSpace system
>> >     at Cambridge University (https://www.repository.cam.ac.uk/) was
>> >     reported as the largest then. I can see it stores about 245
>> >     thousands of items nowadays.
>> >
>> >     Does anyone else have bigger one? Are there new information on
>> >     scalability since 2013?
>> >
>> >     Regards,
>> >
>> >     Vlastik Krejčíř
>> >
>> >     --
>> >
>>  ----------------------------------------------------------------------------
>> >     Vlastimil Krejčíř
>> >     Library and Information Centre, Institute of Computer Science
>> >     Masaryk University, Brno, Czech Republic
>> >     Email: krejcir (at) ics (dot) muni (dot) cz
>> >     Phone: +420 549 49 3872
>> >     OpenPGP key: https://kic-internal.ics.muni.cz/~krejvl/pgp/
>> >     Fingerprint: 7800 64B2 6E20 645B 56AF  C303 34CB 1495 C641 11B9
>> >
>>  ----------------------------------------------------------------------------
>> >
>> >     --
>> >     All messages to this mailing list should adhere to the DuraSpace
>> >     Code of Conduct:
>> https://duraspace.org/about/policies/code-of-conduct/
>> >     ---
>> >     You received this message because you are subscribed to the Google
>> >     Groups "DSpace Community" group.
>> >     To unsubscribe from this group and stop receiving emails from it,
>> >     send an email to [email protected]
>> >     <mailto:[email protected]>.
>> >     To view this discussion on the web visit
>> >
>> https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com
>> >     <
>> https://groups.google.com/d/msgid/dspace-community/a37b7af1-59eb-4a7e-b302-196cadbed7a0%40googlegroups.com?utm_medium=email&utm_source=footer
>> >.
>> >
>> >     --
>> >     All messages to this mailing list should adhere to the DuraSpace
>> >     Code of Conduct:
>> https://duraspace.org/about/policies/code-of-conduct/
>> >     ---
>> >     You received this message because you are subscribed to the Google
>> >     Groups "DSpace Community" group.
>> >     To unsubscribe from this group and stop receiving emails from it,
>> >     send an email to [email protected]
>> >     <mailto:[email protected]>.
>> >     To view this discussion on the web visit
>> >
>> https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com
>> >     <
>> https://groups.google.com/d/msgid/dspace-community/DM5PR22MB05727332D082F1B9BEB443BCEDA40%40DM5PR22MB0572.namprd22.prod.outlook.com?utm_medium=email&utm_source=footer
>> >.
>> >
>> >
>> >
>> > --
>> > Terry Brady
>> > Applications Programmer Analyst
>> > Georgetown University Library Information Technology
>> > https://github.com/terrywbrady/info
>> > 425-298-5498 (Seattle, WA)
>> >
>> > --
>> > All messages to this mailing list should adhere to the DuraSpace Code of
>> > Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> > ---
>> > You received this message because you are subscribed to the Google
>> > Groups "DSpace Community" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> > an email to [email protected]
>> > <mailto:[email protected]>.
>> > To view this discussion on the web visit
>> >
>> https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com
>> > <
>> https://groups.google.com/d/msgid/dspace-community/CAMp2YEwjrRz7B%2B%2BXtyC0gV-gW90aukC5o3s2o%2B9pf4y5wE_uZA%40mail.gmail.com?utm_medium=email&utm_source=footer
>> >.
>>
>> --
>> All messages to this mailing list should adhere to the DuraSpace Code of
>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "DSpace Community" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/dspace-community/15980bcc-7f2e-9b95-e6a3-6b9777b43332%40ics.muni.cz
>> .
>>
>
>
> --
> Terry Brady
> Applications Programmer Analyst
> Georgetown University Library Information Technology
> https://github.com/terrywbrady/info
> 425-298-5498 (Seattle, WA)
>
> --
> All messages to this mailing list should adhere to the DuraSpace Code of
> Conduct: https://duraspace.org/about/policies/code-of-conduct/
> ---
> You received this message because you are subscribed to the Google Groups
> "DSpace Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/dspace-community/CAMp2YEz0ZDbqJix2EjnNQCpX8CV0Q5%2BKkGesOfuDAp6PdFb_AQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/dspace-community/CAMp2YEz0ZDbqJix2EjnNQCpX8CV0Q5%2BKkGesOfuDAp6PdFb_AQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-community/CACwo3X3KE4RjrUSHy7Kjd3aRC%2BZEp837r8O9iBHX8J%3D8HetpZQ%40mail.gmail.com.

Reply via email to