Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Tim Donohue Wed, 25 Mar 2015 07:09:13 -0700

Hi All,

Just to bring this thread back to the original question of how we use 
Solr to store statistics (and also authority info for that matter).

Personally, I agree that having statistics & authority information 
stored *solely* in Solr is dangerous. As mentioned, Solr is primarily 
meant as an index/cache. If your Solr index of this information were to 
get corrupted or messed up in any way, there's no way to re-index (like 
with Discovery search/browser).

However, I honestly feel that Solr's ability to "dump" its index 
contents to CSV could provide a reasonable solution to this problem, for 
the following reasons:

1. CSV is a good format for this sort of information. The files are 
small, and they can be read by many different products (so there's an 
opportunity to translate them and import into a different statistics 
engine).
2. CSVs also can be read by Excel, which means there's the potential to 
create reports from statistical information in Excel (however, this 
obviously may require some manipulation of the columns/data)
3. These same CSV files could be used to restore your Solr index if it 
gets corrupted, or if you need to do a full reindex.
4. Honestly, trying to store statistical information in a non-textual 
format (like a DB) seems like it'd be rather space consuming.

In DSpace 5, we obviously already have a basic version of a backup to 
CSV for statistics:
https://wiki.duraspace.org/display/DSDOC5x/SOLR+Statistics+Maintenance#SOLRStatisticsMaintenance-BackuporExportSOLRrecordstointermediateformat

So, can we simply enhance that backup process so that it stores all the 
information we are capturing in our Statistics & Authority indexes, and 
build in a corresponding re-index (re-import) script?

NOTE: Some of you probably realize this, but everything I've said above 
has essentially been done by Andrea Schweer in:
* https://jira.duraspace.org/browse/DS-2486
* https://github.com/DSpace/DSpace/pull/894/

So, my main point here is that I feel a standard backup & restore 
process to/from CSV files may be a good enough solution to this Solr 
question. We just need to better document that as a *highly recommended* 
backup if you ever want to be able to restore or reindex your 
statistics/authority info.

- Tim

On 3/11/2015 3:11 PM, Mark H. Wood wrote:
> Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest
> that we should step back and take a long look at how we are using the
> Solr 'statistics' core.
>
> Solr seems designed for use as a cache.  That's how the other cores
> are used:  they can be refreshed from data in the database and the
> assetstore.  But the statistics core is treated as durable storage, a
> sink (perhaps the only one) for event data.  If you don't keep your
> 'dspace.log's forever, there may be NO WAY to recover statistical
> records in the event of disaster or a schema change.  At the very
> least it can require some fancy footwork if stat.s are to survive an
> upgrade.
>
> The Solr maintainers have basically said "don't do that":
>
>    https://wiki.apache.org/solr/HowToReindex#Using_Solr_as_a_Data_Source
>
> I think we need to give some more thought to how we can readily
> preserve usage records over DSpace upgrades and system failures.
>
> I should admit here that I am skeptical of using Solr as the
> statistics store *at all*, however well it works most of the time.
> But it is not my purpose in this note to advocate for something
> different.
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
>
>
>
> _______________________________________________
> Dspace-devel mailing list
> Dspace-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-devel
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to