Re: [Dspace-tech] alternative to solr statistics

2011-10-18 Thread Jesús Martín García

Hi Peter,

First of all, thanks for the answers

We've already worked with google analytics, and it seems pretty 
well...but you can't control how the statistics are done, so it's not an 
option to us.


On the other hand, we've already study the ElasticSearch option, but is 
made over Lucene, so the search methods and ram comsuption are 
similar...We've are studying the Sphinx option, which is much more 
efficient and faster than solr


Finally, I'll give a try to the Addons 1.6.2 option which I thought 
that wasn't updated to new versions of DSpace.


Regards,

Jesús

On 10/18/2011 12:15 AM, Peter Dietz wrote:

Hi Jesús,

We've run into SOLR statistics performance problems as well. You've 
posted that you have a very large solr index, and unfortunately solr 
performance degrade's as the index grows. We don't allow 
non-administrator's to view statistics for a collection/community/item 
on production because it slows the system down too much. However, when 
we need to provide a report, we copy the SOLR index to another 
computer, such as your workstation, and view the statistics locally. A 
local computer with a lot of memory will run solr fine, however a busy 
server does not also run SOLR that well.


If you want to be able to present reports on your production system, 
I'm thinking the only thing you can throw at the problem is resources. 
Perhaps adding an additional server just to host SOLR, similar to how 
you might have an additional server just to host mySQL or postgresql. 
My co-worker and I were wondering about the idea of switching out the 
dspace-stats implementation with a different engine, such as removing 
solr, and using something beefier such as ElasticSearch, however we 
haven't implemented anything.


As has been mentioned by some others. You might be able to figure out 
how to get Google Analytics to track all of the hits to your items, 
communities, collections, bitstreams. In such case, you could then 
query Google Analytics API for this information.


Finally, something to anonymize the solr statistics information 
would be a good thing. We currently have IP address for every visitor 
to every resource for every single request. Assuming we had a good 
grip on robots, I think we could aggregate this to just record the 
number of hits to a given resource per hour. After aggregating, and 
pruning, you might end up with a much smaller solr database. Instead 
of tens of millions, perhaps just hundreds of thousands of records. I 
think one should consult the COUNTER project before altering your 
statistics though.




Peter Dietz



2011/10/17 Richard Rodgers rrodg...@mit.edu mailto:rrodg...@mit.edu

Hi Jesús:

A lot of statistics work has been done for DSpace over time, but
each project focuses on different sets of requirements:
does the data need to appear in the UI, does it offer real-time
availability (just to name two of the strengths of the SOLR-based
system)?

One example of an alternative is
https://wiki.duraspace.org/display/DSPACE/StatisticsAddOn, though
I don't know if this has been
maintained against versions newer than DSpace 1.6.2

We run an entirely off-line, monthly reporting system using a
database designed to accommodate a set of internal administrative
requirements  - where statistics are delivered as a spreadsheet
- , but that might
not fulfill your requirements.

The tech list archives and the wiki are a good place to start, but
you could also post to the list what your use case(s) are, and see
if any existing
work better meets your needs.

Hope this helps,

Richard R


On Oct 17, 2011, at 6:00 AM, Jesús Martín García wrote:


Hi!

I've been wondering if there is some kind of alternative to solr
statistics, due to the high load of ram to our system (514
millions of
records) which it's not easy to scale and it's very very slow.
So...Has
someone done some work on an alternative?

Thanks in advance,

Regards,

Jesús

-- 
...

  __
/   /   Jesús Martín García
C E / S / C A   Tècnic de Projectes
  /__ / Centre de Serveis Científics i Acadèmics de Catalunya

Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
T. 93 551 6213 · F. 93 205 6979 · jmar...@cesca.cat
mailto:jmar...@cesca.cat
...



--
All the data continuously generated in your IT infrastructure
contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and
makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
   

Re: [Dspace-tech] Regenerate monthly reports

2011-10-18 Thread Alan Orth
Brian,

Thanks for the vote of confidence.  I'll give this a try tonight when
our servers aren't as busy.

Another question, there's nothing that gets modified in the database
when this happens, so I shouldn't need to restart Tomcat, right?

Thanks,

Alan

On Mon, Oct 17, 2011 at 5:34 PM, Brian Freels-Stendel bfre...@unm.edu wrote:
 Hello,

 I've done this a few times, and it's never been a problem for me.  I make a 
 backup of all of the .dat files in the log directory and the entire reports 
 directory before deleting the .dat and .html files, just in case.

 B--

 On 10/17/2011 at 1:14 AM, in message
 CAKKdN4Wu-eKf6ff29ruVOCC1EUsQgxevVA6GCPgjPi=bqut...@mail.gmail.com, Alan 
 Orth
 alan.o...@gmail.com wrote:
 Hi,

 Never heard back on this, so I'm re-sending:  We had some bad metadata
 and didn't realize for a few weeks that our stats scripts were
 choking.  Now we have a gap in our monthly stats (08/2011, 10/2011...
 but no 09/2011!)

 Is clearing the stats and rebuilding from scratch feasible?  All the
 historical log files are there...

 Thanks!

 Alan

 On Tue, Oct 11, 2011 at 5:21 PM, Alan Orth alan.o...@gmail.com wrote:
 Hey,

 We noticed recently that our monthly stats hadn't run for the month of
 September.  As it turns out, a batch import had imported some items with
 malformed `dc.date.accessioned` date fields, which was causing the stats
 scripts to die.  We finally tracked down all the items with these bad
 dates[1], and now the scripts are running successfully, but it seems the
 month of September has gone missing (we have 08/2011 and 10/2011)!

 My attempts to fix this are here: http://pastebin.com/9EDX8Vhx

 I'm curious, would starting over from `dspace stat-initial` and `dspace
 stat-report-initial` remedy this?  All the log files are there, and as far
 as I know the stat scripts process .log - .dat - .html (nothing in the
 database or anything.  Is there any danger in doing this (other than being
 expensive for the CPU/disk)?

 Thanks,

 [1] DSpace-tech thread:
 http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg15295.html

 --
 Alan Orth
 alan.o...@gmail.com
 http://alaninkenya.org
 I have always wished for my computer to be as easy to use as my telephone;
 my wish has come true because I can no longer figure out how to use my
 telephone. -Bjarne Stroustrup, inventor of C++








-- 
Alan Orth
alan.o...@gmail.com
http://alaninkenya.org
http://mjanja.co.ke
In heaven all the interesting people are missing. -Friedrich Nietzsche

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] alternative to solr statistics

2011-10-18 Thread Федор Краснов
Hello,

I watched the discussion about Solr statistics and want to make a small
contribution.
We have the task to make DSpace statistics reports in the COUNTER standart.
We decided to use AWStat (http://awstats.sourceforge.net) and  to develop
Perl add-on to AWstats for COUNTER compliance.
Has anyone already gone this route? Is there a rakes?

Regards
Fedor.


Message: 4
Date: Mon, 17 Oct 2011 18:15:56 -0400
From: Peter Dietz pdiet...@gmail.com
Subject: Re: [Dspace-tech] alternative to solr statistics
To: Richard Rodgers rrodg...@mit.edu
Cc: dspace-tech@lists.sourceforge.net
   dspace-tech@lists.sourceforge.net
Message-ID:
   caootv+o8on5obu+unajfjpxraerlcbfagtaq81hw+js4nga...@mail.gmail.com
Content-Type: text/plain; charset=utf-8

Hi Jes?s,

We've run into SOLR statistics performance problems as well. You've posted
that you have a very large solr index, and unfortunately solr performance
degrade's as the index grows. We don't allow non-administrator's to view
statistics for a collection/community/item on production because it slows
the system down too much. However, when we need to provide a report, we
copy
the SOLR index to another computer, such as your workstation, and view the
statistics locally. A local computer with a lot of memory will run solr
fine, however a busy server does not also run SOLR that well.

If you want to be able to present reports on your production system, I'm
thinking the only thing you can throw at the problem is resources. Perhaps
adding an additional server just to host SOLR, similar to how you might
have
an additional server just to host mySQL or postgresql. My co-worker and I
were wondering about the idea of switching out the dspace-stats
implementation with a different engine, such as removing solr, and using
something beefier such as ElasticSearch, however we haven't implemented
anything.

As has been mentioned by some others. You might be able to figure out how
to
get Google Analytics to track all of the hits to your items, communities,
collections, bitstreams. In such case, you could then query Google
Analytics
API for this information.

Finally, something to anonymize the solr statistics information would be
a
good thing. We currently have IP address for every visitor to every
resource
for every single request. Assuming we had a good grip on robots, I think we
could aggregate this to just record the number of hits to a given resource
per hour. After aggregating, and pruning, you might end up with a much
smaller solr database. Instead of tens of millions, perhaps just hundreds
of
thousands of records. I think one should consult the COUNTER project before
altering your statistics though.



Peter Dietz


--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] Thoughts about statistics (was: alternative to solr statistics)

2011-10-18 Thread Mark H. Wood
This points out a problem that I think we (and many other contemporary
projects) have all over the place:  our application is expected to grow
steadily  and without limit, yet we assume over and over again that
the problem is small and bounded.

There is no way around it:  if your repository is large and busy,
sooner or later you will be disappointed by the performance of ad-hoc
queries no matter how many resources you throw at them.

One answer to this is to depend less on ad-hoc queries.  Do you have
some usual questions to be answered over and over?  Do you really
need up-to-the-second answers?  Would it be good enough to run
periodic reports and accumulate them?  Some other machine with SPSS or
R or whatever can grind cases all night, if need be, and leave your
monthly abstract waiting in your inbox the next day.  (I want to find
the time to extend DSpace to facilitate this.)  If the periodic
abstractions are saved in raw form before rendering, they become cheap
inputs to longer-range reports.  There are *far* more efficient
methods than those presently provided for extracting information from
vast quantities of data.

Once periodic statistical products are available, they can be simply
fetched over and over again and slotted into DSpace pages to provide
tolerably up-to-date views of activity quickly and cheaply.  We just
don't do that yet.

Once periodic statistical products are available, we don't have to
keep twenty years of event data in Solr; we can purge old cases to
dead storage and combine precalculated summaries with live statistics
over only the latest events to keep the numbers fresh without having
responsiveness suffer more and more over time.  We just don't do that
yet.

Once we have a well-designed way to get cases out of DSpace for use
with other tools, we can produce as many streams as we wish, selected
any way that makes sense.  We can cheaply provide custom-tailored data
products to individual contributors and other consumers for their own
analysis.  We just don't do that yet.

There's still an important place for ad-hoc query, but how often would
something less expensive do just as well?  ALL cases are historical;
they're not going to change.  We only need to recalculate when we
change our view of the cases.

-- 
Mark H. Wood, Lead System Programmer   mw...@iupui.edu
Asking whether markets are efficient is like asking whether people are smart.


pgpz5JzmCYH3E.pgp
Description: PGP signature
--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Regenerate monthly reports

2011-10-18 Thread Brian Freels-Stendel
Hi Alan,

That's right, Tomcat won't need to be restarted.  The stats scripts do access 
the database to get titles of items, but it doesn't alter anything, so there 
should be no side effects.  I'd say good luck, but you won't need it, big smile.

B--

 On 10/18/2011 at 1:11 AM, in message
CAKKdN4VFfghmNyeB6pK8TC2JxcOGLTQnp=pjc0ugacmz7le...@mail.gmail.com, Alan Orth
alan.o...@gmail.com wrote:
 Brian,
 
 Thanks for the vote of confidence.  I'll give this a try tonight when
 our servers aren't as busy.
 
 Another question, there's nothing that gets modified in the database
 when this happens, so I shouldn't need to restart Tomcat, right?
 
 Thanks,
 
 Alan
 
 On Mon, Oct 17, 2011 at 5:34 PM, Brian Freels-Stendel bfre...@unm.edu wrote:
 Hello,

 I've done this a few times, and it's never been a problem for me.  I make a 
 backup of all of the .dat files in the log directory and the entire reports 
 directory before deleting the .dat and .html files, just in case.

 B--

 On 10/17/2011 at 1:14 AM, in message
 CAKKdN4Wu-eKf6ff29ruVOCC1EUsQgxevVA6GCPgjPi=bqut...@mail.gmail.com, Alan 
 Orth
 alan.o...@gmail.com wrote:
 Hi,

 Never heard back on this, so I'm re-sending:  We had some bad metadata
 and didn't realize for a few weeks that our stats scripts were
 choking.  Now we have a gap in our monthly stats (08/2011, 10/2011...
 but no 09/2011!)

 Is clearing the stats and rebuilding from scratch feasible?  All the
 historical log files are there...

 Thanks!

 Alan

 On Tue, Oct 11, 2011 at 5:21 PM, Alan Orth alan.o...@gmail.com wrote:
 Hey,

 We noticed recently that our monthly stats hadn't run for the month of
 September.  As it turns out, a batch import had imported some items with
 malformed `dc.date.accessioned` date fields, which was causing the stats
 scripts to die.  We finally tracked down all the items with these bad
 dates[1], and now the scripts are running successfully, but it seems the
 month of September has gone missing (we have 08/2011 and 10/2011)!

 My attempts to fix this are here: http://pastebin.com/9EDX8Vhx 

 I'm curious, would starting over from `dspace stat-initial` and `dspace
 stat-report-initial` remedy this?  All the log files are there, and as far
 as I know the stat scripts process .log - .dat - .html (nothing in the
 database or anything.  Is there any danger in doing this (other than being
 expensive for the CPU/disk)?

 Thanks,

 [1] DSpace-tech thread:
 http://www.mail-archive.com/dspace-tech@lists.sourceforge.net/msg15295.html
  

 --
 Alan Orth
 alan.o...@gmail.com 
 http://alaninkenya.org 
 I have always wished for my computer to be as easy to use as my telephone;
 my wish has come true because I can no longer figure out how to use my
 telephone. -Bjarne Stroustrup, inventor of C++





 


--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Thoughts about statistics

2011-10-18 Thread Tim Donohue
+1 to what Mark Wood says.

An additional (parallel) thought -- when I was at U of Illinois, we ran 
into similar scalability issues with one of the older statistics 
add-ons we were using (the one initially built by U of Rochester that 
stored stats in the DSpace Database).  The way we got around it was the 
following:

We made a distinct decision to aggregate our data  actively purge older 
event data. This resulted in an *immediate* increase in scalability. 
To better explain...

Essentially this older U of Rochester stats engine worked similar to the 
new Solr Statistics engine, except that it used the DB instead of Solr. 
So, it tracked each statistical event, including IP address, what the 
event was, etc. Over time the stats queries became rather expensive as 
the tables grew and grew. The tables were also full of IP address info 
that we really didn't need to keep around forever, and also information 
about old web spiders that we really didn't care about.  (As you can 
tell, this is all very parallel to the current Solr Statistics issues.)

So, as I said, we aggregated things. We decided to only keep IP 
addresses/full statistical events for a period of *one month*.  After 
that, all non-spider hits were aggregated/totaled into a monthly 
totals table (we threw out anything that was a web spider -- as that 
data was not useful and just made tables larger  queries more complex).

Although I don't think we went this far at U of Illinois, you could do a 
secondary aggregation and then aggregate/total stats again at a *yearly* 
level.

The idea here is that you make conscious decisions around what 
information is important and aggregate it. Stuff that is not important 
to keep forever (e.g. exact IP addresses for all hits, information from 
known-spiders) can just be discarded during the aggregation process. 
The aggregation simplifies larger queries (especially ones for 
yearly/monthly info, as you no longer need to perform complex 
calculations -- it's just a simple lookup)

If we brought this same sort of idea forward into Solr, I think you'd be 
less likely to encounter such performance issues. We'd only keep around 
full event details for a limited period of time (a month / 6 months), 
after which we'd discard information which was not necessary to generate 
the reports  aggregate everything else.

Just an idea -- I've never tried this before with the Solr Statistics 
engine. But, a Solr savvy person could likely figure out a way to 
implement this for the benefit of all of us.

- Tim

On 10/18/2011 7:52 AM, Mark H. Wood wrote:
 This points out a problem that I think we (and many other contemporary
 projects) have all over the place:  our application is expected to grow
 steadily  and without limit, yet we assume over and over again that
 the problem is small and bounded.

 There is no way around it:  if your repository is large and busy,
 sooner or later you will be disappointed by the performance of ad-hoc
 queries no matter how many resources you throw at them.

 One answer to this is to depend less on ad-hoc queries.  Do you have
 some usual questions to be answered over and over?  Do you really
 need up-to-the-second answers?  Would it be good enough to run
 periodic reports and accumulate them?  Some other machine with SPSS or
 R or whatever can grind cases all night, if need be, and leave your
 monthly abstract waiting in your inbox the next day.  (I want to find
 the time to extend DSpace to facilitate this.)  If the periodic
 abstractions are saved in raw form before rendering, they become cheap
 inputs to longer-range reports.  There are *far* more efficient
 methods than those presently provided for extracting information from
 vast quantities of data.

 Once periodic statistical products are available, they can be simply
 fetched over and over again and slotted into DSpace pages to provide
 tolerably up-to-date views of activity quickly and cheaply.  We just
 don't do that yet.

 Once periodic statistical products are available, we don't have to
 keep twenty years of event data in Solr; we can purge old cases to
 dead storage and combine precalculated summaries with live statistics
 over only the latest events to keep the numbers fresh without having
 responsiveness suffer more and more over time.  We just don't do that
 yet.

 Once we have a well-designed way to get cases out of DSpace for use
 with other tools, we can produce as many streams as we wish, selected
 any way that makes sense.  We can cheaply provide custom-tailored data
 products to individual contributors and other consumers for their own
 analysis.  We just don't do that yet.

 There's still an important place for ad-hoc query, but how often would
 something less expensive do just as well?  ALL cases are historical;
 they're not going to change.  We only need to recalculate when we
 change our view of the cases.




 --
 All the data 

Re: [Dspace-tech] Java fatal error on dspace import

2011-10-18 Thread André
Hi Andrea, Jose and Mark. Thank you!

I tried Jose suggestion, importing one by one, but the error seemed to be
randomic.
I switched to sun java 1.6, but at the same time I reset the dabatase (it
was on our test installation), and the problem was gone.

The problem could have been Java7 but could also have been the database, I
should have tested them separatedly.
But at least these messages could be a good tip for one who finds the same
problem as me in the future.

Thanks again
André Assada


Em 14 de outubro de 2011 17:15, Andrea Bollini boll...@cilea.it escreveu:

  Hi André,
 I noted that you use java 7 I have not direct experience with this but
 there are a lot of post in the web reporting issues using java 7 with
 lucene/solr.
 See for example: http://www.infoq.com/news/2011/08/java7-hotspot
 Hope this help,
 Andrea


 Il 14/10/2011 19:44, André ha scritto:

 Dear all,

 I'm trying to import 157 registries on dspace 1.6.2 by calling
 [dspace]/bin/dspace import --add --eperson=andre.ass...@usp.br--collection=
 123456789/32 --source=/home/andre/xImpAleph/impTeste111014/xvi_fd
 --mapfile=./xvi_fd --workflow


 It starts the process ok, but int the middle I get the following error
 message:


 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x7fea376dc440, pid=20001, tid=140644013197072
 #
 # JRE version: 7.0-b147
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (21.0-b17 mixed mode
 linux-amd64 compressed oops)
 # Problematic frame:
 # J
 org.apache.lucene.index.DocumentsWriter$ThreadState$FieldData.invertField(Lorg/apache/lucene/document/Fieldable;Lorg/apache/lucene/analysis/Analyzer;I)V
 #
 # Core dump written. Default location: /dspace/bin/core or core.20001 (max
 size 1 kB). To ensure a full core dump, try ulimit -c unlimited before
 starting Java again
 #
 # An error report file with more information is saved as:
 # /dspace/bin/hs_err_pid20001.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://bugreport.sun.com/bugreport/crash.jsp
 #
 ./dspace: line 69: 20001 Aborted java $JAVA_OPTS -classpath
 $FULLPATH $LOG org.dspace.app.launcher.ScriptLauncher $@





 If I retry to import, with the --resume option, it restarts very slowly,
 and in dspace.log I get the following message:





 2011-10-14 14:01:26,342 ERROR org.dspace.search.DSIndexer @ Lock obtain
 timed out: SimpleFSLock@/dspace/search/write.lock
 org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out:
 SimpleFSLock@/dspace/search/write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:85)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:691)
 at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:452)
 at org.dspace.search.DSIndexer.openIndex(DSIndexer.java:781)
 at org.dspace.search.DSIndexer.writeDocument(DSIndexer.java:853)
 at org.dspace.search.DSIndexer.buildDocument(DSIndexer.java:1138)
 at org.dspace.search.DSIndexer.indexContent(DSIndexer.java:299)
 at org.dspace.search.DSIndexer.updateIndex(DSIndexer.java:584)
 at org.dspace.search.DSIndexer.main(DSIndexer.java:545)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:601)
 at
 org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:212)




 Searching the archive of this list, I found some people solved this by
 deleting the write.lock and afterwards force the reindexation by running
 ./dsrun org.dspace.search.DSIndexer -c


 This solves the slowdown problem but doesn't solve the import problem.
 I tried to stop tomcat before importing, to guarantee none was accessing
 the index at the same time, but this didn't solve the problem.
 I also set more free memory with JAVA_OPTS=-Xmx512m   and also -Xmx1024m,
 but this also didn't do the trick.

 Has anyone had this problem? Could share any ideas?

 Thanks in advance

 Andre Assada


 --
 All the data continuously generated in your IT infrastructure contains a
 definitive record of customers, application performance, security
 threats, fraudulent activity and more. Splunk takes this data and makes
 sense of it. Business sense. IT sense. Common 
 sense.http://p.sf.net/sfu/splunk-d2d-oct



 ___
 DSpace-tech mailing 
 listDSpace-tech@lists.sourceforge.nethttps://lists.sourceforge.net/lists/listinfo/dspace-tech



 --
 Dott. Andrea Bollini
 boll...@cilea.it
 ph. +39 06 59292853 - mob. +39 348 8277525 - fax +39 06 5913770
 CILEA - Consorzio Interuniversitariohttp://www.cilea.it/disclaimer



 

[Dspace-tech] Approval of submitted items by site/collection admin

2011-10-18 Thread Alicia Verno
Hi everyone,

 

I have a question regarding item submission and policies in DSpace 1.7.1
- I set up COLLECTION_x_WORKFLOW_STEP_2  and COLLECTION_x_ADMIN groups
for all of the collections within my instance of DSpace and only
included myself in the group.  I then set up COLLECTION_x_SUBMIT groups
including all of our users.  My goal is to be notified when users submit
items so I can review the submission, make additions if necessary, and
approve/reject the submissions.

 

However, I found that when I submit items myself, I am made to go
through the approval process for my own submissions, even though I am
the admin for the site as well as the collection.  Does anyone know how
to get around this so that submissions from the collection admin group
(or the site admin) are automatically archived and do not need approval?
The extra step is proving to be quite tedious!

 

Here is an example of my current collection policies:

 

114209
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=114209 

ADMIN
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=114209 

COLLECTION_8_ADMIN (includes only my account)

90255
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=90255 

ADD
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=90255 

COLLECTION_8_WORKFLOW_STEP_2 (includes only my account)

90254
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=90254 

ADD
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=6a527e1
016333b48852e2b1163134d494855311fsubmit_editpolicy_id=90254 

COLLECTION_8_SUBMIT (includes group of all users)

78
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=78 

DEFAULT_BITSTREAM_READ
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=78 

BBC (group includes all users)

77
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=77 

DEFAULT_ITEM_READ
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=77 

BBC 

76
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=76 

READ
http://infosvs:8080/xmlui/admin/epeople?administrative-continue=3345397
9666c4a45836e33125c343e257a100333submit_editpolicy_id=76 

BBC 



 

 

Thanks,

Alicia Verno

Information Services Manager, Boston Biomedical Consultants

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Regenerate monthly reports

2011-10-18 Thread Stuart Lewis
Hi Alan,

 Another question, there's nothing that gets modified in the database
 when this happens, so I shouldn't need to restart Tomcat, right?

Yes - that is correct.  Everything happens on disk (not in the DB): 

  .log files -- .dat files -- .html reports

The .html reports are then loaded when required.

Thanks,


Stuart Lewis
Digital Development Manager
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] solr statistics

2011-10-18 Thread Stuart Lewis
Hi Tint,

 Our repository is running on 1.6.2 and we have been using solr for a few 
 months now. There seems to be some problem with solr statistics. Bitstream 
 for some items were downloaded more than a few thousand times within a month 
 from the same place. How can I filter out such systematic access (by 
 bots/spiders etc)?

Take a look at the following tool:

/dspace/bin/dspace stats-util -h
usage: StatisticsClient
   
 -b,--reindex-bitstreams  Reindex the bitstreams to ensure we have
  the bundle name
 -r,--remove-deleted-bitstreams   While indexing the bundle names remove
  the statistics about deleted bitstreams
 -u,--update-spider-files Update Spider IP Files from internet
  into /dspace/config/spiders
 -f,--delete-spiders-by-flag  Delete Spiders in Solr By isBot Flag
 -i,--delete-spiders-by-ipDelete Spiders in Solr By IP Address
 -m,--mark-spidersUpdate isBot Flag in Solr
 -h,--helphelp
 -o,--optimizeRun maintenance on the SOLR index


You might need to first register the IP address of the bots in 
/dspace/config/spiders/

I hope that helps,


Stuart Lewis
Digital Development Manager
Te Tumu Herenga The University of Auckland Library
Auckland Mail Centre, Private Bag 92019, Auckland 1142, New Zealand
Ph: +64 (0)9 373 7599 x81928


--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] Export Authority Key in Metadata

2011-10-18 Thread Jim Tuttle
I'm using the authority control tool to associate canonical university 
identifiers with authors in Dspace 1.7.1 and would like to export metadata 
containing the authority key.  I was hoping that the metadata export would 
contain an authority key field delimited in the same way that the author field 
is exported.  Is there any way to do this other than querying the database?  If 
not, might any of you have done this and have advice?

Thanks,
jt

--
Jim Tuttle
Digital Repository Program Coordinator
Duke University Libraries
919.613.6831

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


Re: [Dspace-tech] Unable to send mail

2011-10-18 Thread Andrea Schweer
Hi,

On 19/10/11 16:25, Justin A. Diana wrote:
 Unfortunately, that causes me even more confusion as it successfully sent
 the email and I successfully received it externally.
 
 It honestly looks like the app is never even attempting to send the email
 (nothing in the messages, maillog or dspace.log when I get that error in the
 UI).

Very strange. Have you tried other situations in which normally e-mails
are sent by DSpace (ie not registration)? Could you subscribe to a
collection, then add a new item to that collection and run the sub-daily
script -- do you get an e-mail then?

[dspace]/bin/dspace sub-daily -- see
https://wiki.duraspace.org/display/DSDOC/Installation#Installation-%27cron%27Jobs

cheers,
Andrea

-- 
Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech


[Dspace-tech] solr statistics: Internal Server Error

2011-10-18 Thread Osama Alkadi
Hello,

We are running DSpace 1.6.2 JSPUI with tomcat-6.0.32 and we have been using 
SOLR for the last 5 months. However I just released no statistics were 
generated for this current month. I've tried tools such as reindex-update, 
stats-util, and  stats-log-importer 

Any help would be greatly appreciated.

My SOLR conf

solr.log.server = https://(mydspace.edu)/solr/statistics
solr.dbfile = ${dspace.dir}/config/GeoLiteCity.dat
statistics.item.authorization.admin=false

solr.spiderips.urls = http://iplists.com/google.txt, \
  http://iplists.com/inktomi.txt, \
  http://iplists.com/lycos.txt, \
  http://iplists.com/infoseek.txt, \
  http://iplists.com/altavista.txt, \
  http://iplists.com/excite.txt, \
  http://iplists.com/misc.txt, \
  http://iplists.com/non_engines.txt

tomcat server.xml
Context path=/solr docBase=/usr/local/dspace/app/webapps/solr debug=0 
reloadable=true cachingAllowed=false allowLinking=true/


Errors in my dspace.log

2011-10-19 14:20:15,390 ERROR org.dspace.statistics.SolrLogger @ Internal 
Server Error

Internal Server Error

request: https://mydspace.edu/solr/statistics/update?wt=javabinversion=2.2
org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: https://mydspace.edu/solr/statistics/update?wt=javabinversion=2.2
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:343)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:183)
at 
org.apache.solr.client.solrj.request.UpdateRequest.process(UpdateRequest.java:217)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:63)
at org.dspace.statistics.SolrLogger.post(SolrLogger.java:245)
at 
org.dspace.statistics.SolrLoggerUsageEventListener.receiveEvent(SolrLoggerUsageEventListener.java:41)
at 
org.dspace.services.events.SystemEventService.fireLocalEvent(SystemEventService.java:154)
at 
org.dspace.services.events.SystemEventService.fireEvent(SystemEventService.java:97)
at 
org.dspace.app.webui.servlet.HandleServlet.doDSGet(HandleServlet.java:259)
at 
org.dspace.app.webui.servlet.DSpaceServlet.processRequest(DSpaceServlet.java:151)
at 
org.dspace.app.webui.servlet.DSpaceServlet.doGet(DSpaceServlet.java:99)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.dspace.utils.servlet.DSpaceWebappServletFilter.doFilter(DSpaceWebappServletFilter.java:112)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:190)
at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:291)
at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:776)
at 
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:705)
at 
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:898)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:690)
at java.lang.Thread.run(Thread.java:662)

Regards,
OA

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech