One of the two fixes would be a solution. The other was just a band-aid but 
I will take a closer look to see if it could be done more elegantly with no 
impact.

Will send in a PR at some point.

Steve

On Friday, April 21, 2023 at 10:58:36 AM UTC-4 Tim Donohue wrote:

> Hi Steve,
>
> I'm not able to easily answer these questions as I don't have the full 
> context (nor am I the expert on all the code in DSpace...I don't write much 
> code these days. I'm more of a technical coordinator).
>
> However, it sounds like you may have found a performance bug and *maybe* a 
> solution?  If so, could you please create a ticket to describe the 
> performance issues you see and possibly even send us a Pull Request with 
> the fixes you noted can speed things up?  That way I can pass this along to 
> other developers who *can help answer the questions*.   As you might expect 
> with any open source project, things get better when people contribute 
> fixes they've found.  So, if you can find time to send us more information 
> via GitHub, it may help immensely... and it might be that you've stumbled 
> on an undiscovered issue with this import process.
>
> Here's where to submit a ticket (and PR): 
> https://github.com/DSpace/dspace/issues
>
> Thanks in advance... if you aren't able to send a PR, just creating a 
> ticket & linking us to the "two tweaks" you made to speed things up might 
> be enough to get started.
>
> Tim
>
> On Wednesday, April 19, 2023 at 11:57:57 AM UTC-5 [email protected] 
> wrote:
>
>> Hi Tim,
>>
>> I had a chance to come back to this issue recently. Adding relationships 
>> was definitely the most time consuming part of the Simple Archive Format 
>> import. After some logging and analysis I found two problem areas with 
>> respect to importing relationships:
>>
>>    1. getEntityType() in ItemImportServiceImpl -> was making a DB call 
>>    when the relevant information was already present on the Item. This was 
>>    about 1s per call, but is made several times per relationship added and 
>>    really added as we have many relationships per item we are importing
>>    2. update() in DSpaceObjectServiceImpl -> this was also called 
>>    several times per Item and in some cases was only 200ms but in others was 
>>    up to 3s. Since I don't believe you can assign a Place in the Simple 
>>    Archive Format import I don't think this was doing anything of value so 
>> was 
>>    just commented out for our purposes.
>>
>> These two tweaks took our import time from just over 1 minute per Item to 
>> just over 20s, though even 20s per Item is still far from ideal as we are 
>> looking at importing roughly 100,000 items. Without these import tweaks 
>> that is roughly 70 days of 24x7 import processing. With these tweaks it 
>> should be more like 24 days, but still a LONG time. 
>>
>> **Adding server/DB resources improved these times above by about 40% so 
>> our current run time is estimated at 14 days. 
>>
>> It looks like 40% of time is related to SOLR indexing at the end of the 
>> import process. Might it be more efficient to run the "index-discovery" 
>> command manually at the end of the entire migration process -- or would 
>> that likely take the same time regardless?
>>
>> Steve
>> On Wednesday, April 12, 2023 at 1:27:17 PM UTC-4 Tim Donohue wrote:
>>
>>> Hi all,
>>>
>>> There were some major performance improvements to batch importing added 
>>> in the 7.5 release.  But, they seem to have been more specific to CSV-based 
>>> imports.  We were not aware of performance issues with SAF (Simple Archive 
>>> Format) imports (and I'm not seeing any bug tickets that are obviously 
>>> related to that).  That said, be worth re-testing on 7.5 if possible, 
>>> simply because sometimes a performance fix for one feature may also fix 
>>> performance issues of another.
>>>
>>> I'd also recommend in this scenario providing more details about what 
>>> exactly you are seeing, so that DSpace developers can try to reproduce the 
>>> problem on our end (which makes it easier to find a quick solution).  So, 
>>> if you know it occurs even in small batches, it'd be good to have a sample 
>>> batch or sample commands where you are seeing the problem (and whether it 
>>> occurs just from the UI or also from the commandline).   Please feel free 
>>> to create a ticket for this performance issue in 
>>> https://github.com/DSpace/DSpace/issues and share what you've found.
>>>
>>> Thanks,
>>>
>>> Tim
>>> On Monday, April 10, 2023 at 3:11:29 PM UTC-5 sk60 wrote:
>>>
>>>> We're experiencing similar issues with the Simple Archive Format 
>>>> imports in DSpace 7. None of my imports on our test server have processed 
>>>> (even though Processes show that the imports are complete). I have tried 
>>>> with batches as large as 100 items and as small as 6 items.
>>>>
>>>> Shannon
>>>>
>>>> --
>>>> Shannon Kipphut-Smith
>>>> Scholarly Communications Liaison
>>>> Fondren Library, Rice University
>>>> [email protected]||(713)348-3989 <(713)%20348-3989>
>>>> Schedule a meeting or consultation: https://calendly.com/scholcomm
>>>> she|her
>>>>
>>>> On Wed, Apr 5, 2023 at 12:23 PM Stephen Brush <[email protected]> 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are looking at importing a large number of items as part of our 
>>>>> launch (~200,000). Imports seemed to be slow from the start but were at 
>>>>> least tolerable. As the size of the repository has grown import time has 
>>>>> grown significantly along with it. I've tried different "batch" sizes to 
>>>>> see if that had an impact but the pattern still seems to be the same. 
>>>>>
>>>>> Currently importing 100 items is taking well over 1 hour. I should 
>>>>> mention that the resources involved could be scaled up further -- but I 
>>>>> assume they should be sufficient for the tasks this involves (exception 
>>>>> maybe SOLR as that's less familiar to me). Based on how fast SOLR indexes 
>>>>> items using the "index-discovery" command I can't see it being so slow 
>>>>> here.
>>>>>
>>>>> Is this a known or common problem? Is there anything others have done 
>>>>> to speed this up?
>>>>>
>>>>> To be clear in this instance I am referring to Item Import via Simple 
>>>>> Archive Format -- though I've noticed similar behaviour with the CSV 
>>>>> import 
>>>>> capabilities via the UI.
>>>>>
>>>>> We are on v7.3 currently.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>> -- 
>>>>> All messages to this mailing list should adhere to the Code of 
>>>>> Conduct: https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx 
>>>>> <https://urldefense.com/v3/__https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx__;!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_un3S8Dbdw$>
>>>>> --- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "DSpace Community" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n%40googlegroups.com
>>>>>  
>>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/dspace-community/b190489d-bad0-4d22-9282-b9c610121cd1n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!BuQPrrmRaQ!hjpg33-64p8uJ2XJiFIcLjqLpJkV5Gw1s7e34w-X1zHPkVL_cALb5OLO8XudI5mL2471ngMDC_QJlSEg_ullU_1mOg$>
>>>>> .
>>>>>
>>>>
>>>>
>>>>
>>>>

-- 
All messages to this mailing list should adhere to the Code of Conduct: 
https://www.lyrasis.org/about/Pages/Code-of-Conduct.aspx
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Community" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/dspace-community/94331379-889f-490b-afac-1bbe1a0e5603n%40googlegroups.com.

Reply via email to