Hi All, Just to bring this thread back to the original question of how we use Solr to store statistics (and also authority info for that matter).
Personally, I agree that having statistics & authority information stored *solely* in Solr is dangerous. As mentioned, Solr is primarily meant as an index/cache. If your Solr index of this information were to get corrupted or messed up in any way, there's no way to re-index (like with Discovery search/browser). However, I honestly feel that Solr's ability to "dump" its index contents to CSV could provide a reasonable solution to this problem, for the following reasons: 1. CSV is a good format for this sort of information. The files are small, and they can be read by many different products (so there's an opportunity to translate them and import into a different statistics engine). 2. CSVs also can be read by Excel, which means there's the potential to create reports from statistical information in Excel (however, this obviously may require some manipulation of the columns/data) 3. These same CSV files could be used to restore your Solr index if it gets corrupted, or if you need to do a full reindex. 4. Honestly, trying to store statistical information in a non-textual format (like a DB) seems like it'd be rather space consuming. In DSpace 5, we obviously already have a basic version of a backup to CSV for statistics: https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-BackuporExportSOLRrecordstointermediateformat So, can we simply enhance that backup process so that it stores all the information we are capturing in our Statistics & Authority indexes, and build in a corresponding re-index (re-import) script? NOTE: Some of you probably realize this, but everything I've said above has essentially been done by Andrea Schweer in: * https://jira.duraspace.org/browse/DS-2486 * https://github.com/DSpace/DSpace/pull/894/ So, my main point here is that I feel a standard backup & restore process to/from CSV files may be a good enough solution to this Solr question. We just need to better document that as a *highly recommended* backup if you ever want to be able to restore or reindex your statistics/authority info. - Tim On 3/11/2015 3:11 PM, Mark H. Wood wrote: > Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest > that we should step back and take a long look at how we are using the > Solr 'statistics' core. > > Solr seems designed for use as a cache. That's how the other cores > are used: they can be refreshed from data in the database and the > assetstore. But the statistics core is treated as durable storage, a > sink (perhaps the only one) for event data. If you don't keep your > 'dspace.log's forever, there may be NO WAY to recover statistical > records in the event of disaster or a schema change. At the very > least it can require some fancy footwork if stat.s are to survive an > upgrade. > > The Solr maintainers have basically said "don't do that": > > https://wiki.apache.org/solr/HowToReindex#Using_Solr_as_a_Data_Source > > I think we need to give some more thought to how we can readily > preserve usage records over DSpace upgrades and system failures. > > I should admit here that I am skeptical of using Solr as the > statistics store *at all*, however well it works most of the time. > But it is not my purpose in this note to advocate for something > different. > > > > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > > > > _______________________________________________ > Dspace-devel mailing list > Dspace-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspace-devel > ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel