defragmentation can improve performance on SATA class 10 disk ~10000 rpm ?

2021-02-21 Thread Danilo Tomasoni
Hello all,
we are running a solr instance with around 41 MLN documents on a SATA class 10 
disk with around 10.000 rpm.
We are experiencing very slow query responses (in the order of hours..) with an 
average of 205 segments.
We made a test with a normal pc and an SSD disk, and there the same solr 
instance with the same data and the same number of segments was around 45 times 
faster.
Force optimize was also tried to improve the performances, but it was very 
slow, so we abandoned it.

Since we still don't have enterprise server ssd disks, we are now wondering if 
in the meanwhile defragmenting the solrdata folder can help.
The idea is that due to many updates, each segment file is fragmented across 
different phisical blocks.
Put in another way, each segment file is non-contiguous on disk, and this can 
slow-down the solr response.

What do you suggest?
Is this somewhat equivalent to force-optimize or it can be faster?

Thank you.
Danilo

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


R: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-21 Thread Danilo Tomasoni
Congratulations Jan!

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to

Da: Yonik Seeley 
Inviato: domenica 21 febbraio 2021 05:51
A: solr-user@lucene.apache.org 
Cc: Lucene Dev 
Oggetto: Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

[CAUTION: EXTERNAL SENDER]
[Please check correspondence between Sender Display Name and Sender Email 
Address before clicking on any link or opening attachments]


Congrats Jan! Go Solr!
-Yonik


On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta  wrote:

> Hi everyone,
>
> I’d like to inform everyone that the newly formed Apache Solr PMC nominated
> and elected Jan Høydahl for the position of the Solr PMC Chair and Vice
> President. This decision was approved by the board in its February 2021
> meeting.
>
> Congratulations Jan!
>
> --
> Anshum Gupta
>


Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread Shawn Heisey

On 2/21/2021 3:07 PM, cratervoid wrote:

Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:


 
   true
   ignored_
   _text_
 
   

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



This line from that config also is involved:

  regex=".*\.jar" />


That loads the contrib jars needed for the ExtractingRequestHandler to 
work right.  There are a LOT of jars there.  Tika is a very heavyweight 
piece of software.


Thanks,
Shawn


Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid
Thanks Shawn, I copied the solrconfig.xml file from the gettingstarted
example on 7.7.3 installation to the 8.8.0 installation, restarted the
server and it now works. Comparing the two files it looks like as you said
this section was left out of the _default/solrconfig.xml file in version
8.8.0:



  true
  ignored_
  _text_

  

So those trying out the tutorial will need to add this section to get it to
work for sample.html.



On Sat, Feb 20, 2021 at 4:21 PM Shawn Heisey  wrote:

> On 2/20/2021 3:58 PM, cratervoid wrote:
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
>
> The problem here is that the solrconfig.xml in use by the index named
> "gettingstarted" does not define a handler at /update/extract.
>
> Typically a handler defined at that URL path will utilize the extracting
> request handler class.  This handler uses Tika (another Apache project)
> to extract usable data from rich text formats like PDF, HTML, etc.
>
>
>startup="lazy"
>class="solr.extraction.ExtractingRequestHandler" >
>  
>true
>ignored_
>_text_
>  
>
>
> Note that using this handler will require adding some contrib jars to Solr.
>
> Tika can become very unstable because it deals with undocumented file
> formats, so we do not recommend using that handler in production.  If
> the functionality is important, Tika should be included in a program
> that's separate from Solr, so that if it crashes, it does not take Solr
> down with it.
>
> Thanks,
> Shawn
>


Re: HTML sample.html not indexing in Solr 8.8

2021-02-21 Thread cratervoid
Thanks Alex. I copied the solrconfig.xml over from 7.7.3 to the 8.8.0 conf
folder and restarted the server.  Now indexing works without erroring on
sample.html.  There is 1K difference between the 2 files so I'll diff them
to see what was left out of the 8.8 version.

On Sat, Feb 20, 2021 at 4:27 PM Alexandre Rafalovitch 
wrote:

> Most likely issue is that your core configuration (solrconfig.xml)
> does not have the request handler for that. The same config may have
> had that in 7.x, but changed since.
>
> More details:
> https://lucene.apache.org/solr/guide/8_8/uploading-data-with-solr-cell-using-apache-tika.html
>
> Regards,
>Alex.
>
> On Sat, 20 Feb 2021 at 17:59, cratervoid  wrote:
> >
> > I am trying out indexing the exampledocs in the examples folder with the
> > SimplePostTool on windows 10 using solr 8.8.  All the documents index
> > except sample.html. For that file I get the errors below.  I then
> > downloaded solr 7.7.3 and indexed the exampledocs folder with no errors,
> > including sample.html.
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\sample.html
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file sample.html (text/html) to [base]/extract
> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > SimplePostTool: WARNING: Response: 
> > 
> > 
> > Error 404 Not Found
> > 
> > HTTP ERROR 404 Not Found
> > 
> > URI:/solr/gettingstarted/update/extract
> > STATUS:404
> > MESSAGE:Not Found
> > SERVLET:default
> > 
> >
> > 
> > 
> > SimplePostTool: WARNING: IOException while reading response:
> > java.io.FileNotFoundException:
> >
> http://localhost:8983/solr/gettingstarted/update/extract?resource.name=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html=C%3A%5Csolr-8.8.0%5Cexample%5Cexampledocs%5Csample.html
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > Time spent: 0:00:00.086
> > ```
> >
> > However the json and all other file types index with no problem. For
> > example:
> > ```
> > PS C:\solr-8.8.0> java -jar -Dc=gettingstarted -Dauto
> > example\exampledocs\post.jar example\exampledocs\books.json
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,jsonl,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file books.json (application/json) to [base]/json/docs
> > 1 files indexed.
> > COMMITting Solr index changes to
> > http://localhost:8983/solr/gettingstarted/update...
> > ```
> > Just following this tutorial:[
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support][1
> > ]
> >
> >   [1]:
> >
> https://lucene.apache.org/solr/guide/8_8/post-tool.html#post-tool-windows-support
>