One of the two fixes would be a solution. The other was just a band-aid but I will take a closer look to see if it could be done more elegantly with no impact.
Will send in a PR at some point. Steve On Friday, April 21, 2023 at 10:58:36 AM UTC-4 Tim Donohue wrote: > Hi Steve, > > I'm not able to easily answer these questions as I don't have the full > context (nor am I the expert on all the code in DSpace...I don't write much > code these days. I'm more of a technical coordinator). > > However, it sounds like you may have found a performance bug and *maybe* a > solution? If so, could you please create a ticket to describe the > performance issues you see and possibly even send us a Pull Request with > the fixes you noted can speed things up? That way I can pass this along to > other developers who *can help answer the questions*. As you might expect > with any open source project, things get better when people contribute > fixes they've found. So, if you can find time to send us more information > via GitHub, it may help immensely... and it might be that you've stumbled > on an undiscovered issue with this import process. > > Here's where to submit a ticket (and PR): > https://github.com/DSpace/dspace/issues > > Thanks in advance... if you aren't able to send a PR, just creating a > ticket & linking us to the "two tweaks" you made to speed things up might > be enough to get started. > > Tim > > On Wednesday, April 19, 2023 at 11:57:57 AM UTC-5 [email protected] > wrote: > >> Hi Tim, >> >> I had a chance to come back to this issue recently. Adding relationships >> was definitely the most time consuming part of the Simple Archive Format >> import. After some logging and analysis I found two problem areas with >> respect to importing relationships: >> >> 1. getEntityType() in ItemImportServiceImpl -> was making a DB call >> when the relevant information was already present on the Item. This was >> about 1s per call, but is made several times per relationship added and >> really added as we have many relationships per item we are importing >> 2. update() in DSpaceObjectServiceImpl -> this was also called >> several times per Item and in some cases was only 200ms but in others was >> up to 3s. Since I don't believe you can assign a Place in the Simple >> Archive Format import I don't think this was doing anything of value so >> was >> just commented out for our purposes. >> >> These two tweaks took our import time from just over 1 minute per Item to >> just over 20s, though even 20s per Item is still far from ideal as we are >> looking at importing roughly 100,000 items. Without these import tweaks >> that is roughly 70 days of 24x7 import processing. With these tweaks it >> should be more like 24 days, but still a LONG time. >> >> **Adding server/DB resources improved these times above by about 40% so >> our current run time is estimated at 14 days. >> >> It looks like 40% of time is related to SOLR indexing at the end of the >> import process. Might it be more efficient to run the "index-discovery" >> command manually at the end of the entire migration process -- or would >> that likely take the same time regardless? >> >> Steve >> On Wednesday, April 12, 2023 at 1:27:17 PM UTC-4 Tim Donohue wrote: >> >>> Hi all, >>> >>> There were some major performance improvements to batch importing added >>> in the 7.5 release. But, they seem to have been more specific to CSV-based >>> imports. We were not aware of performance issues with SAF (Simple Archive >>> Format) imports (and I'm not seeing any bug tickets that are obviously >>> related to that). That said, be worth re-testing on 7.5 if possible, >>> simply because sometimes a performance fix for one feature may also fix >>> performance issues of another. >>> >>> I'd also recommend in this scenario providing more details about what >>> exactly you are seeing, so that DSpace developers can try to reproduce the >>> problem on our end (which makes it easier to find a quick solution). So, >>> if you know it occurs even in small batches, it'd be good to have a sample >>> batch or sample commands where you are seeing the problem (and whether it >>> occurs just from the UI or also from the commandline). Please feel free >>> to create a ticket for this performance issue in >>> https://github.com/DSpace/DSpace/issues and share what you've found. >>> >>> Thanks, >>> >>> Tim >>> On Monday, April 10, 2023 at 3:11:29 PM UTC-5 sk60 wrote: >>> >>>> We're experiencing similar issues with the Simple Archive Format >>>> imports in DSpace 7. None of my imports on our test server have processed >>>> (even though Processes show that the imports are complete). I have tried >>>> with batches as large as 100 items and as small as 6 items. >>>> >>>> Shannon >>>> >>>> -- >>>> Shannon Kipphut-Smith >>>> Scholarly Communications Liaison >>>> Fondren Library, Rice University >>>> [email protected]||(713)348-3989 <(713)%20348-3989> >>>> Schedule a meeting or consultation: https://calendly.com/scholcomm >>>> she|her >>>> >>>> On Wed, Apr 5, 2023 at 12:23 PM Stephen Brush <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> We are looking at importing a large number of items as part of our >>>>> launch (~200,000). Imports seemed to be slow from the start but were at >>>>> least tolerable. As the size of the repository has grown import time has >>>>> grown significantly along with it. I've tried different "batch" sizes to >>>>> see if that had an impact but the pattern still seems to be the same. >>>>> >>>>> Currently importing 100 items is taking well over 1 hour. I should >>>>> mention that the resources involved could be scaled up further -- but I >>>>> assume they should be sufficient for the tasks this involves (exception >>>>> maybe SOLR as that's less familiar to me). Based on how fast SOLR indexes >>>>> items using the "index-discovery" command I can't see it being so slow >>>>> here. >>>>> >>>>> Is this a known or common problem? Is there anything others have done >>>>> to speed this up? >>>>> >>>>> To be clear in this instance I am referring to Item Import via Simple >>>>> Archive Format -- though I've noticed similar behaviour with the CSV >>>>> import >>>>> capabilities via the UI. >>>>> >>>>> We are on v7.3 currently. >>>>> >>>>> Thanks, >>>>> >>>>> Steve >>>>> >>>>> -- >>>>> All messages to this mailing list should adhere to the Code of >>>>> Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx >>>>> <https://urldefense.com/v3/__https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx__;!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_un3S8Dbdw$> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "DSpace Community" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n%40googlegroups.com >>>>> >>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_ullU_1mOg$> >>>>> . >>>>> >>>> >>>> >>>> >>>> -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/94331379-889f-490b-afac-1bbe1a0e5603n%40googlegroups.com.
