Hi Steve, I'm not able to easily answer these questions as I don't have the full context (nor am I the expert on all the code in DSpace...I don't write much code these days. I'm more of a technical coordinator).
However, it sounds like you may have found a performance bug and *maybe* a solution? If so, could you please create a ticket to describe the performance issues you see and possibly even send us a Pull Request with the fixes you noted can speed things up? That way I can pass this along to other developers who *can help answer the questions*. As you might expect with any open source project, things get better when people contribute fixes they've found. So, if you can find time to send us more information via GitHub, it may help immensely... and it might be that you've stumbled on an undiscovered issue with this import process. Here's where to submit a ticket (and PR): https://github.com/DSpace/dspace/issues Thanks in advance... if you aren't able to send a PR, just creating a ticket & linking us to the "two tweaks" you made to speed things up might be enough to get started. Tim On Wednesday, April 19, 2023 at 11:57:57 AM UTC-5 [email protected] wrote: > Hi Tim, > > I had a chance to come back to this issue recently. Adding relationships > was definitely the most time consuming part of the Simple Archive Format > import. After some logging and analysis I found two problem areas with > respect to importing relationships: > > 1. getEntityType() in ItemImportServiceImpl -> was making a DB call > when the relevant information was already present on the Item. This was > about 1s per call, but is made several times per relationship added and > really added as we have many relationships per item we are importing > 2. update() in DSpaceObjectServiceImpl -> this was also called several > times per Item and in some cases was only 200ms but in others was up to > 3s. > Since I don't believe you can assign a Place in the Simple Archive Format > import I don't think this was doing anything of value so was just > commented > out for our purposes. > > These two tweaks took our import time from just over 1 minute per Item to > just over 20s, though even 20s per Item is still far from ideal as we are > looking at importing roughly 100,000 items. Without these import tweaks > that is roughly 70 days of 24x7 import processing. With these tweaks it > should be more like 24 days, but still a LONG time. > > **Adding server/DB resources improved these times above by about 40% so > our current run time is estimated at 14 days. > > It looks like 40% of time is related to SOLR indexing at the end of the > import process. Might it be more efficient to run the "index-discovery" > command manually at the end of the entire migration process -- or would > that likely take the same time regardless? > > Steve > On Wednesday, April 12, 2023 at 1:27:17 PM UTC-4 Tim Donohue wrote: > >> Hi all, >> >> There were some major performance improvements to batch importing added >> in the 7.5 release. But, they seem to have been more specific to CSV-based >> imports. We were not aware of performance issues with SAF (Simple Archive >> Format) imports (and I'm not seeing any bug tickets that are obviously >> related to that). That said, be worth re-testing on 7.5 if possible, >> simply because sometimes a performance fix for one feature may also fix >> performance issues of another. >> >> I'd also recommend in this scenario providing more details about what >> exactly you are seeing, so that DSpace developers can try to reproduce the >> problem on our end (which makes it easier to find a quick solution). So, >> if you know it occurs even in small batches, it'd be good to have a sample >> batch or sample commands where you are seeing the problem (and whether it >> occurs just from the UI or also from the commandline). Please feel free >> to create a ticket for this performance issue in >> https://github.com/DSpace/DSpace/issues and share what you've found. >> >> Thanks, >> >> Tim >> On Monday, April 10, 2023 at 3:11:29 PM UTC-5 sk60 wrote: >> >>> We're experiencing similar issues with the Simple Archive Format imports >>> in DSpace 7. None of my imports on our test server have processed (even >>> though Processes show that the imports are complete). I have tried with >>> batches as large as 100 items and as small as 6 items. >>> >>> Shannon >>> >>> -- >>> Shannon Kipphut-Smith >>> Scholarly Communications Liaison >>> Fondren Library, Rice University >>> [email protected]||(713)348-3989 <(713)%20348-3989> >>> Schedule a meeting or consultation: https://calendly.com/scholcomm >>> she|her >>> >>> On Wed, Apr 5, 2023 at 12:23 PM Stephen Brush <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> We are looking at importing a large number of items as part of our >>>> launch (~200,000). Imports seemed to be slow from the start but were at >>>> least tolerable. As the size of the repository has grown import time has >>>> grown significantly along with it. I've tried different "batch" sizes to >>>> see if that had an impact but the pattern still seems to be the same. >>>> >>>> Currently importing 100 items is taking well over 1 hour. I should >>>> mention that the resources involved could be scaled up further -- but I >>>> assume they should be sufficient for the tasks this involves (exception >>>> maybe SOLR as that's less familiar to me). Based on how fast SOLR indexes >>>> items using the "index-discovery" command I can't see it being so slow >>>> here. >>>> >>>> Is this a known or common problem? Is there anything others have done >>>> to speed this up? >>>> >>>> To be clear in this instance I am referring to Item Import via Simple >>>> Archive Format -- though I've noticed similar behaviour with the CSV >>>> import >>>> capabilities via the UI. >>>> >>>> We are on v7.3 currently. >>>> >>>> Thanks, >>>> >>>> Steve >>>> >>>> -- >>>> All messages to this mailing list should adhere to the Code of Conduct: >>>> https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx >>>> <https://urldefense.com/v3/__https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx__;!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_un3S8Dbdw$> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "DSpace Community" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n%40googlegroups.com >>>> >>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_ullU_1mOg$> >>>> . >>>> >>> >>> >>> >>> -- All messages to this mailing list should adhere to the Code of Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx --- You received this message because you are subscribed to the Google Groups "DSpace Community" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/dspace-community/0b3ebd7c-7fb5-42b6-9db2-33b39efecf3en%40googlegroups.com.
