RE: Solr and HTTP headers

Uwe Schindler Fri, 30 Jun 2017 14:57:34 -0700

Hi Jan,
> Inspired by SOLR-10981 "Allow update to load gzip files” where the proposal
> is to obey the
> Content-Encoding HTTP request header to update a compressed stream, I
> started looking at other
> headers to do things in more industry-standard ways.
> 
> Accept:
> 
>   Advertises which content types, expressed as MIME types, the client is able
> to understand
>   https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept
> 
>   Could replace or at least be an alternative to “wt”. Examples:
>   Accept: application/xml
>   Accept: text/csv
> 
>   Issue: Most browsers sends a long accept header, typically
> application/xml,text/html,*/*, and now
>   that json is default for Solr, we’d need to serve JSON if the accept header
> includes “*/*"


That's known under term "Content Negotiation". I come from the scientific / 
library publisher world... We use that every day!

So your described problem is well known and must be solved by some algorithm 
that takes care of browsers. All "Accept" headers have additional scores behind 
the media types. When parsing the Accept header, you split on commas and then 
parse each item and look for the score (in parameter "q"). In addition browser 
gernally send some special mime types that clearly identify them as browsers 😊

One example from the scientific publishing world, where access to digital 
object identifiers is standardized to use the "Content-Negotiation" mechanism 
since approx 10 years, is this one:

Accept: application/rdf+xml;charset=ISO-8859-1;q=0.5, 
application/vnd.citationstyles.csl+json;q=1.0, */*;q=0.1

This tells the webserver that you would like to get the citation of the DOI as 
citeproc-json, but alternatively take it as RDF. The */* is just there because 
you would as a last chance also accept anything else (like HTML).

So the order and scores are important. First order by "q" scores backwards and 
if you have same scores, take the order in list. First wins.

The algorithm is used in the library/scientific publishing world and is well 
understood. E.g. see this DOI (Digital Object Identifier) and their URL to the 
landing page (I work for PANGAEA, too...):

https://doi.pangaea.de/10.1594/PANGAEA.867475

By default it shows the landing page, if visited by a browser, but if you want 
to have the metadata in JSON-LD format, do:

Uwe Schindler@VEGA:~ > curl -H 'Accept: application/ld+json' 
'https://doi.pangaea.de/10.1594/PANGAEA.867475'                     
{"@context":"http://schema.org/","@type":"Dataset","identifier":"doi:10.1594/PANGAEA.867475","url":"https://doi.pangaea.de/10.1594/PANGAEA.867475","name":"Response
 of Arctic benthic bacterial deep-sea communities to different detritus 
composition during an ex-situ high pressure 
experiment","creator":[{"@type":"Person","name":"Hoffmann, Katy...]}

If you want to download the data behind the URL (or if it would be a scientific 
paper - the PDF):

curl -H 'Accept: text/tab-separated-values, */*;q=.5' 
'https://doi.pangaea.de/10.1594/PANGAEA.867475'

Here I also added the */* with a lower score. As the server allows to give you 
text/tab-separated-values, it returns it by preference.

If your client accepts BIBTEX citations or Endnote (RIS) ones you can send a 
header, too. So you can fetch the citation of an item in a machine readable 
format the same way - and you can ask the server for any variant - standardized 
across all scientific publishers! Which one you got back is in the response's 
Content-Type 😊

If the server cannot satisfy any of your Accepts, it will send a HTTP error 406:

Uwe Schindler@VEGA:~ > curl -I -H 'Accept: foo/bar' 
'https://doi.pangaea.de/10.1594/PANGAEA.867475'
HTTP/1.1 406 Not Acceptable
Server: PANGAEA/1.0
Date: Fri, 30 Jun 2017 21:49:31 GMT
X-robots-tag: noindex,nofollow,noarchive
Content-length: 139
Content-type: text/html
X-ua-compatible: IE=Edge
X-content-type-options: nosniff
Strict-transport-security: max-age=31536000

The IDF / CrossRef / DataCite organizations (including PANGAEA...) have good 
code that also parses the "Accept" header so that stupid browser with many 
plugins (like Internet Explorer) kill you. So basically you look for specific 
media types and the catch all accept header and if it looks like a browser, 
kill it. E.g. Internet Explorer always sends application/xml with high score.

With this type of content negotiation, you can safely remove the wt=xxx param 
or make it optional.

For compression, you normally do the same (the gzip filter in Jetty does the 
same algorithm), although browser behave well on compressions and you can trust 
the header when the client sends it. The problem with sending data *to* Solr is 
that you don't know what the server accepts because you are sending data 
first...
 
> Accept-Encoding:
> 
>   Advertises which content encoding, usually a compression algorithm, the
> client is able to understand
>   https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-
> Encoding

That's usual practise, every stupid browser on earth does it by default. To 
enable it in Solr, just add the Gzip filter to Jetty. For sending data TO Solr 
it's not so easy, see above.

>   Could enable compression of large search results. SOLR-856 suggests that
> this is implemented,
>   but it does not work. Seems it is only implemented for replication. I’d 
> expect
> this to be useful for
>   large /export or /stream requests. Example:
>   Accept-Encoding: gzip
> 
> 
> 
> What do you think?

Strong +1

> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Solr and HTTP headers

Reply via email to