Hi Alan,

I can understand the frustration. That said, you may also want to consider
if this harvesting could be achieved in a different manner...

Do you need to use the REST API for this type of major harvesting?  Could
some of it be done via OAI-PMH (which has caching built in)?

Do you need all the information you are retrieving via those multiple
"expand" fields? Is there a way to leave them out and/or only include them
when you are down to an individual item (i.e. at the /items/[item_id]
endpoint)?  I realize this may require more requests, but you might want to
analyze which is faster -- one request pulling back a lot of data vs many
requests pulling back smaller amounts of data.

I admit, I don't have a good solution to speeding this up in DSpace 5.x.
I'm worried the main issue here could be how the REST API code is written.
Maybe you could track down what queries are being run (per request) and see
if you can optimize their performance. But, maybe there's someone else on
this list with better ideas.

- Tim

On Wed, Oct 17, 2018 at 10:13 AM Alan Orth <alan.o...@gmail.com> wrote:

> Dear Tim,
>
> Thanks. It's good to know I'm not alone. It's a hard pill to swallow that
> we'll be stuck with these performance issues for at least one more year
> (currently on DSpace 5.x, only god know when we'll get around to upgrading
> to 6.x or 7.x). In the meantime our repository will of course grow from
> 75,000 items to 100,000 or more!
>
> I tried to reduce the limit from 100 to 20 and the request does take one
> third the time to complete, but then I need to make five more requests to
> get the same number of records. Ouch! I guess we'll just have to deal with
> this for now... is there any way this could be fixed by beefing up the
> database somehow? Increasing buffers, adding indexes, upgrading PostgreSQL,
> etc? I have spare resources on the server, I want to use them!
>
> Cheers,
>
> On Wed, Oct 17, 2018 at 5:20 PM Tim Donohue <tdono...@duraspace.org>
> wrote:
>
>> Hi Alan,
>>
>> I suspect you are seeing slower performance with "expand" specified
>> simply because of how that "expand" parameter works.  By default, the REST
>> API calls return minimal information (to keep requests quick).  But, if you
>> require much more detailed information, the "expand" parameter is available
>> to tell the REST API "I really need more information here".  Simply put,
>> when you ask for more information, requests will take longer (obviously).
>>
>> That said, the way in which "expand" is currently implemented is NOT
>> ideal.  When you tell the DSpace 5.x or 6.x REST API that you want
>> "expand=metadata" (i.e. give me all the metadata), it literally loops
>> through all metadata fields, checks if any are flagged as "isHidden" and
>> adds them one by one to the response:
>> https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace-rest/src/main/java/org/dspace/rest/common/Item.java#L71
>>
>> The same thing happens when you say "expand=bitstreams" (i.e. give me all
>> the bitstreams)... it literally loops through all bundles, finding all
>> accessible bitstreams, and adds them one by one to the response:
>> https://github.com/DSpace/DSpace/blob/dspace-6_x/dspace-rest/src/main/java/org/dspace/rest/common/Item.java#L121
>>
>>
>> So, as you can see, if you are including a lot of "expand" options in
>> your request, this will quickly slow things down...*unless* you decrease
>> your paging options (e.g. use a lower "limit" of 20 or similar).
>>
>> As a sidenote, the way in which our REST API handles such requests is
>> changing drastically in DSpace 7 REST API.  In the development of DSpace 7,
>> we quickly realized that the DSpace 5.x / 6.x REST API has several areas
>> where major performance issues present themselves. This is why we are
>> deprecating this old 5.x-6.x REST API in the DSpace 7 release (it will be
>> dropped entirely in DSpace 8). DSpace 7 will be providing a brand new,
>> optimized, fully-featured REST API as a replacement. I know this doesn't
>> solve your immediate issues, but I just wanted to assure you that you are
>> not alone in finding these performance & usability issues with the current
>> REST API.
>>
>> - Tim
>>
>> On Tue, Oct 16, 2018 at 4:31 PM Alan Orth <alan.o...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> If I use several expands while iterating over the results of the REST
>>> API's /items endpoint the request takes about ten times longer than without
>>> the expands. In my unscientific benchmarks the performance is consistently
>>> poor on both our production and development DSpace instances. A few runs on
>>> each server, with and without expands:
>>>
>>> $ time curl -s '
>>> https://production.example.com/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
>>> > /dev/null
>>> ...
>>> 0.35s user 0.06s system 1% cpu 25.133 total
>>> 0.31s user 0.04s system 1% cpu 25.223 total
>>> 0.27s user 0.06s system 1% cpu 27.858 total
>>>
>>> $ time curl -q '
>>> https://production.example.com/rest/items?limit=100&offset=0' >
>>> /dev/null
>>> 0.03s user 0.01s system 1% cpu 3.085 total
>>> 0.03s user 0.01s system 1% cpu 2.800 total
>>> 0.03s user 0.02s system 1% cpu 3.008 total
>>>
>>> $ time curl -s '
>>> https://development.example.com/rest/items?expand=metadata,bitstreams,parentCommunityList&limit=100&offset=0'
>>> > /dev/null
>>> ...
>>> 0.22s user 0.03s system 1% cpu 17.248 total
>>> 0.23s user 0.02s system 1% cpu 16.856 total
>>> 0.23s user 0.04s system 1% cpu 16.460 total
>>>
>>> $ time curl -s '
>>> https://development.example.com/rest/items?limit=100&offset=0' >
>>> /dev/null
>>> 0.04s user 0.01s system 1% cpu 3.542 total
>>> 0.02s user 0.02s system 1% cpu 3.565 total
>>> 0.01s user 0.02s system 0% cpu 3.480 total
>>>
>>> These systems are both running Ubuntu 16.04, PostgreSQL 9.5, Java 8 (one
>>> Oracle, one OpenJDK), and DSpace 5.8 with lots of RAM, SSDs, and four or
>>> more CPUs each. Lots of people are harvesting us and it takes forever to
>>> iterate over our 75,000 items. Not to mention, if we have more than a few
>>> concurrently we start returning HTTP 500 errors!
>>>
>>> Where is the bottleneck in the REST API? How can I profile this? Is this
>>> something that can be improved with a query cache or database indexes in
>>> PostgreSQL?
>>>
>>> Thanks!
>>> --
>>> Alan Orth
>>> alan.o...@gmail.com
>>> https://picturingjordan.com
>>> https://englishbulgaria.net
>>> https://mjanja.ch
>>> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>>>
>>> --
>>> All messages to this mailing list should adhere to the DuraSpace Code of
>>> Conduct: https://duraspace.org/about/policies/code-of-conduct/
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "DSpace Technical Support" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to dspace-tech+unsubscr...@googlegroups.com.
>>> To post to this group, send email to dspace-tech@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/dspace-tech.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
>> Tim Donohue
>> Technical Lead for DSpace & DSpaceDirect
>> DuraSpace.org | DSpace.org | DSpaceDirect.org
>>
>
>
> --
> Alan Orth
> alan.o...@gmail.com
> https://picturingjordan.com
> https://englishbulgaria.net
> https://mjanja.ch
> "In heaven all the interesting people are missing." ―Friedrich Nietzsche
>
-- 
Tim Donohue
Technical Lead for DSpace & DSpaceDirect
DuraSpace.org | DSpace.org | DSpaceDirect.org

-- 
All messages to this mailing list should adhere to the DuraSpace Code of 
Conduct: https://duraspace.org/about/policies/code-of-conduct/
--- 
You received this message because you are subscribed to the Google Groups 
"DSpace Technical Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to dspace-tech+unsubscr...@googlegroups.com.
To post to this group, send email to dspace-tech@googlegroups.com.
Visit this group at https://groups.google.com/group/dspace-tech.
For more options, visit https://groups.google.com/d/optout.

Reply via email to