Hi Tim, I had a chance to come back to this issue recently. Adding relationships was definitely the most time consuming part of the Simple Archive Format import. After some logging and analysis I found two problem areas with respect to importing relationships:
1. getEntityType() in ItemImportServiceImpl -> was making a DB call when the relevant information was already present on the Item. This was about 1s per call, but is made several times per relationship added and really added as we have many relationships per item we are importing 2. update() in DSpaceObjectServiceImpl -> this was also called several times per Item and in some cases was only 200ms but in others was up to 3s. Since I don't believe you can assign a Place in the Simple Archive Format import I don't think this was doing anything of value so was just commented out for our purposes. These two tweaks took our import time from just over 1 minute per Item to just over 20s, though even 20s per Item is still far from ideal as we are looking at importing roughly 100,000 items. Without these import tweaks that is roughly 70 days of 24x7 import processing. With these tweaks it should be more like 24 days, but still a LONG time. **Adding server/DB resources improved these times above by about 40% so our current run time is estimated at 14 days. It looks like 40% of time is related to SOLR indexing at the end of the import process. Might it be more efficient to run the "index-discovery" command manually at the end of the entire migration process -- or would that likely take the same time regardless? Steve On Wednesday, April 12, 2023 at 1:27:17 PM UTC-4 Tim Donohue wrote: > Hi all, > > There were some major performance improvements to batch importing added in > the 7.5 release. But, they seem to have been more specific to CSV-based > imports. We were not aware of performance issues with SAF (Simple Archive > Format) imports (and I'm not seeing any bug tickets that are obviously > related to that). That said, be worth re-testing on 7.5 if possible, > simply because sometimes a performance fix for one feature may also fix > performance issues of another. > > I'd also recommend in this scenario providing more details about what > exactly you are seeing, so that DSpace developers can try to reproduce the > problem on our end (which makes it easier to find a quick solution). So, > if you know it occurs even in small batches, it'd be good to have a sample > batch or sample commands where you are seeing the problem (and whether it > occurs just from the UI or also from the commandline). Please feel free > to create a ticket for this performance issue in > https://github.com/DSpace/DSpace/issues and share what you've found. > > Thanks, > > Tim > On Monday, April 10, 2023 at 3:11:29 PM UTC-5 sk60 wrote: > >> We're experiencing similar issues with the Simple Archive Format imports >> in DSpace 7. None of my imports on our test server have processed (even >> though Processes show that the imports are complete). I have tried with >> batches as large as 100 items and as small as 6 items. >> >> Shannon >> >> -- >> Shannon Kipphut-Smith >> Scholarly Communications Liaison >> Fondren Library, Rice University >> [email protected]||(713)348-3989 <(713)%20348-3989> >> Schedule a meeting or consultation: https://calendly.com/scholcomm >> she|her >> >> On Wed, Apr 5, 2023 at 12:23 PM Stephen Brush <[email protected]> wrote: >> >>> Hi, >>> >>> We are looking at importing a large number of items as part of our >>> launch (~200,000). Imports seemed to be slow from the start but were at >>> least tolerable. As the size of the repository has grown import time has >>> grown significantly along with it. I've tried different "batch" sizes to >>> see if that had an impact but the pattern still seems to be the same. >>> >>> Currently importing 100 items is taking well over 1 hour. I should >>> mention that the resources involved could be scaled up further -- but I >>> assume they should be sufficient for the tasks this involves (exception >>> maybe SOLR as that's less familiar to me). Based on how fast SOLR indexes >>> items using the "index-discovery" command I can't see it being so slow here. >>> >>> Is this a known or common problem? Is there anything others have done to >>> speed this up? >>> >>> To be clear in this instance I am referring to Item Import via Simple >>> Archive Format -- though I've noticed similar behaviour with the CSV import >>> capabilities via the UI. >>> >>> We are on v7.3 currently. >>> >>> Thanks, >>> >>> Steve >>> >>> -- >>> All messages to this mailing list should adhere to the Code of Conduct: >>> https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx >>> <https://urldefense.com/v3/__https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx__;!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_un3S8Dbdw$> >>> --- >>> You received this message because you are subscribed to the Google >>> Groups "DSpace Community" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n%40googlegroups.com >>> >>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_ullU_1mOg$> >>> . >>> >> >> >> >> -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/6230e51f-7be3-42eb-b330-4c8791a1b668n%40googlegroups.com.
