Re: [Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks

2013-01-30 Thread Lighton Phiri
Could you create a wiki page afterwards?
Sure, I'll put up the wiki page as soon as I am done.

If you're doing a comparison, could you also include AIPs?
Not a problem. I've taken note of this. Are you interested in specific
metrics? I ask because my interest in these benchmarks is purely
response times relative to scale.

Lighton Phiri
http://lightonphiri.org

On 29/01/2013 13:26, helix84 wrote:

Hi Lighton,

I don't know of any such work done as of yet. I definitely would be
interested in your results/findings and surely others would be, too.
Could you create a wiki page afterwards?

On Tue, Jan 29, 2013 at 12:09 PM, Lighton Phiri lighton.ph...@gmail.com wrote:

1. I am guessing the Batch Metadata Editing [1] tool is appropriate in
this context as opposed to using Simple Archive Format [2]; Yes?

I think that's up to the implementation of BME. Theoretically, it
should be faster because it deals with less file opening. I think
determining whether this is true will be one of your results.

I'm not sure if it is, but the CSV implementation should be streaming,
i.e. not requiring to keep the whole CSV in memory at once, processing
it line-by-line instead. That will have the largest impact on memory
usage.

2. Are experiments conducted to come up with the recommended line limit
size of 1k specified here [2] documented anywhere and are there factors
I should consider that could have a baring on this limit? In other
workds, is the 1k size a standard of some sort?

The limit is there only for web UIs, not the command line. The reason
is you don't want to upload a huge CSV file using a browser (and be
frustrated when it fails when you lose connection in the middle). Also
UIs display a diff of changes, giving you a chance to review and
confirm/cancel them. This could produce a huge HTML page which is
clearly not desirable. Command line doesn't have any such limitations
apart from practical ones, like memory consumed.

3. Are there any other documented scalability performance benchmarks
other than this [3]?

Not that I know of, I'd just use Google as you did.

I'd also appreciate any other comments/suggestions.

If you're doing a comparison, could you also include AIPs?


Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


[Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks

2013-01-29 Thread Lighton Phiri
I am conducting a couple of performance benchmarks and currently working 
towards ingesting ~2million metadata-only items into a DSpace 3.0 
instance. One of things I intend to do is benchmark the ingestion 
process and I was hoping someone could help with the following queries:

1. I am guessing the Batch Metadata Editing [1] tool is appropriate in 
this context as opposed to using Simple Archive Format [2]; Yes?
2. Are experiments conducted to come up with the recommended line limit 
size of 1k specified here [2] documented anywhere and are there factors 
I should consider that could have a baring on this limit? In other 
workds, is the 1k size a standard of some sort?
3. Are there any other documented scalability performance benchmarks 
other than this [3]?

I'd also appreciate any other comments/suggestions.

[1] 
https://wiki.duraspace.org/display/DSDOC3x/Batch+Metadata+Editing#BatchMetadataEditing-AddingMetadata-OnlyItems
[2] 
https://wiki.duraspace.org/display/DSDOC3x/Importing+and+Exporting+Items+via+Simple+Archive+Format
[3] http://archive.nlm.nih.gov/pubs/ceb2008/2008016.pdf

-- 
Lighton Phiri
http://lightonphiri.org


--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette


Re: [Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks

2013-01-29 Thread helix84
Hi Lighton,

I don't know of any such work done as of yet. I definitely would be
interested in your results/findings and surely others would be, too.
Could you create a wiki page afterwards?

On Tue, Jan 29, 2013 at 12:09 PM, Lighton Phiri lighton.ph...@gmail.com wrote:
 1. I am guessing the Batch Metadata Editing [1] tool is appropriate in
 this context as opposed to using Simple Archive Format [2]; Yes?

I think that's up to the implementation of BME. Theoretically, it
should be faster because it deals with less file opening. I think
determining whether this is true will be one of your results.

I'm not sure if it is, but the CSV implementation should be streaming,
i.e. not requiring to keep the whole CSV in memory at once, processing
it line-by-line instead. That will have the largest impact on memory
usage.

 2. Are experiments conducted to come up with the recommended line limit
 size of 1k specified here [2] documented anywhere and are there factors
 I should consider that could have a baring on this limit? In other
 workds, is the 1k size a standard of some sort?

The limit is there only for web UIs, not the command line. The reason
is you don't want to upload a huge CSV file using a browser (and be
frustrated when it fails when you lose connection in the middle). Also
UIs display a diff of changes, giving you a chance to review and
confirm/cancel them. This could produce a huge HTML page which is
clearly not desirable. Command line doesn't have any such limitations
apart from practical ones, like memory consumed.

 3. Are there any other documented scalability performance benchmarks
 other than this [3]?

Not that I know of, I'd just use Google as you did.

 I'd also appreciate any other comments/suggestions.

If you're doing a comparison, could you also include AIPs?


Regards,
~~helix84

Compulsory reading: DSpace Mailing List Etiquette
https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

--
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnnow-d2d
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette