Re: [Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks
Could you create a wiki page afterwards? Sure, I'll put up the wiki page as soon as I am done. If you're doing a comparison, could you also include AIPs? Not a problem. I've taken note of this. Are you interested in specific metrics? I ask because my interest in these benchmarks is purely response times relative to scale. Lighton Phiri http://lightonphiri.org On 29/01/2013 13:26, helix84 wrote: Hi Lighton, I don't know of any such work done as of yet. I definitely would be interested in your results/findings and surely others would be, too. Could you create a wiki page afterwards? On Tue, Jan 29, 2013 at 12:09 PM, Lighton Phiri lighton.ph...@gmail.com wrote: 1. I am guessing the Batch Metadata Editing [1] tool is appropriate in this context as opposed to using Simple Archive Format [2]; Yes? I think that's up to the implementation of BME. Theoretically, it should be faster because it deals with less file opening. I think determining whether this is true will be one of your results. I'm not sure if it is, but the CSV implementation should be streaming, i.e. not requiring to keep the whole CSV in memory at once, processing it line-by-line instead. That will have the largest impact on memory usage. 2. Are experiments conducted to come up with the recommended line limit size of 1k specified here [2] documented anywhere and are there factors I should consider that could have a baring on this limit? In other workds, is the 1k size a standard of some sort? The limit is there only for web UIs, not the command line. The reason is you don't want to upload a huge CSV file using a browser (and be frustrated when it fails when you lose connection in the middle). Also UIs display a diff of changes, giving you a chance to review and confirm/cancel them. This could produce a huge HTML page which is clearly not desirable. Command line doesn't have any such limitations apart from practical ones, like memory consumed. 3. Are there any other documented scalability performance benchmarks other than this [3]? Not that I know of, I'd just use Google as you did. I'd also appreciate any other comments/suggestions. If you're doing a comparison, could you also include AIPs? Regards, ~~helix84 Compulsory reading: DSpace Mailing List Etiquette https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_jan ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
[Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks
I am conducting a couple of performance benchmarks and currently working towards ingesting ~2million metadata-only items into a DSpace 3.0 instance. One of things I intend to do is benchmark the ingestion process and I was hoping someone could help with the following queries: 1. I am guessing the Batch Metadata Editing [1] tool is appropriate in this context as opposed to using Simple Archive Format [2]; Yes? 2. Are experiments conducted to come up with the recommended line limit size of 1k specified here [2] documented anywhere and are there factors I should consider that could have a baring on this limit? In other workds, is the 1k size a standard of some sort? 3. Are there any other documented scalability performance benchmarks other than this [3]? I'd also appreciate any other comments/suggestions. [1] https://wiki.duraspace.org/display/DSDOC3x/Batch+Metadata+Editing#BatchMetadataEditing-AddingMetadata-OnlyItems [2] https://wiki.duraspace.org/display/DSDOC3x/Importing+and+Exporting+Items+via+Simple+Archive+Format [3] http://archive.nlm.nih.gov/pubs/ceb2008/2008016.pdf -- Lighton Phiri http://lightonphiri.org -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
Re: [Dspace-tech] DSpace 3.0 Ingestion Performance Benchmarks
Hi Lighton, I don't know of any such work done as of yet. I definitely would be interested in your results/findings and surely others would be, too. Could you create a wiki page afterwards? On Tue, Jan 29, 2013 at 12:09 PM, Lighton Phiri lighton.ph...@gmail.com wrote: 1. I am guessing the Batch Metadata Editing [1] tool is appropriate in this context as opposed to using Simple Archive Format [2]; Yes? I think that's up to the implementation of BME. Theoretically, it should be faster because it deals with less file opening. I think determining whether this is true will be one of your results. I'm not sure if it is, but the CSV implementation should be streaming, i.e. not requiring to keep the whole CSV in memory at once, processing it line-by-line instead. That will have the largest impact on memory usage. 2. Are experiments conducted to come up with the recommended line limit size of 1k specified here [2] documented anywhere and are there factors I should consider that could have a baring on this limit? In other workds, is the 1k size a standard of some sort? The limit is there only for web UIs, not the command line. The reason is you don't want to upload a huge CSV file using a browser (and be frustrated when it fails when you lose connection in the middle). Also UIs display a diff of changes, giving you a chance to review and confirm/cancel them. This could produce a huge HTML page which is clearly not desirable. Command line doesn't have any such limitations apart from practical ones, like memory consumed. 3. Are there any other documented scalability performance benchmarks other than this [3]? Not that I know of, I'd just use Google as you did. I'd also appreciate any other comments/suggestions. If you're doing a comparison, could you also include AIPs? Regards, ~~helix84 Compulsory reading: DSpace Mailing List Etiquette https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette