Hi Jan, > Inspired by SOLR-10981 "Allow update to load gzip files” where the proposal > is to obey the > Content-Encoding HTTP request header to update a compressed stream, I > started looking at other > headers to do things in more industry-standard ways. > > Accept: > > Advertises which content types, expressed as MIME types, the client is able > to understand > https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept > > Could replace or at least be an alternative to “wt”. Examples: > Accept: application/xml > Accept: text/csv > > Issue: Most browsers sends a long accept header, typically > application/xml,text/html,*/*, and now > that json is default for Solr, we’d need to serve JSON if the accept header > includes “*/*"
That's known under term "Content Negotiation". I come from the scientific / library publisher world... We use that every day! So your described problem is well known and must be solved by some algorithm that takes care of browsers. All "Accept" headers have additional scores behind the media types. When parsing the Accept header, you split on commas and then parse each item and look for the score (in parameter "q"). In addition browser gernally send some special mime types that clearly identify them as browsers 😊 One example from the scientific publishing world, where access to digital object identifiers is standardized to use the "Content-Negotiation" mechanism since approx 10 years, is this one: Accept: application/rdf+xml;charset=ISO-8859-1;q=0.5, application/vnd.citationstyles.csl+json;q=1.0, */*;q=0.1 This tells the webserver that you would like to get the citation of the DOI as citeproc-json, but alternatively take it as RDF. The */* is just there because you would as a last chance also accept anything else (like HTML). So the order and scores are important. First order by "q" scores backwards and if you have same scores, take the order in list. First wins. The algorithm is used in the library/scientific publishing world and is well understood. E.g. see this DOI (Digital Object Identifier) and their URL to the landing page (I work for PANGAEA, too...): https://doi.pangaea.de/10.1594/PANGAEA.867475 By default it shows the landing page, if visited by a browser, but if you want to have the metadata in JSON-LD format, do: Uwe Schindler@VEGA:~ > curl -H 'Accept: application/ld+json' 'https://doi.pangaea.de/10.1594/PANGAEA.867475' {"@context":"http://schema.org/","@type":"Dataset","identifier":"doi:10.1594/PANGAEA.867475","url":"https://doi.pangaea.de/10.1594/PANGAEA.867475","name":"Response of Arctic benthic bacterial deep-sea communities to different detritus composition during an ex-situ high pressure experiment","creator":[{"@type":"Person","name":"Hoffmann, Katy...]} If you want to download the data behind the URL (or if it would be a scientific paper - the PDF): curl -H 'Accept: text/tab-separated-values, */*;q=.5' 'https://doi.pangaea.de/10.1594/PANGAEA.867475' Here I also added the */* with a lower score. As the server allows to give you text/tab-separated-values, it returns it by preference. If your client accepts BIBTEX citations or Endnote (RIS) ones you can send a header, too. So you can fetch the citation of an item in a machine readable format the same way - and you can ask the server for any variant - standardized across all scientific publishers! Which one you got back is in the response's Content-Type 😊 If the server cannot satisfy any of your Accepts, it will send a HTTP error 406: Uwe Schindler@VEGA:~ > curl -I -H 'Accept: foo/bar' 'https://doi.pangaea.de/10.1594/PANGAEA.867475' HTTP/1.1 406 Not Acceptable Server: PANGAEA/1.0 Date: Fri, 30 Jun 2017 21:49:31 GMT X-robots-tag: noindex,nofollow,noarchive Content-length: 139 Content-type: text/html X-ua-compatible: IE=Edge X-content-type-options: nosniff Strict-transport-security: max-age=31536000 The IDF / CrossRef / DataCite organizations (including PANGAEA...) have good code that also parses the "Accept" header so that stupid browser with many plugins (like Internet Explorer) kill you. So basically you look for specific media types and the catch all accept header and if it looks like a browser, kill it. E.g. Internet Explorer always sends application/xml with high score. With this type of content negotiation, you can safely remove the wt=xxx param or make it optional. For compression, you normally do the same (the gzip filter in Jetty does the same algorithm), although browser behave well on compressions and you can trust the header when the client sends it. The problem with sending data *to* Solr is that you don't know what the server accepts because you are sending data first... > Accept-Encoding: > > Advertises which content encoding, usually a compression algorithm, the > client is able to understand > https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept- > Encoding That's usual practise, every stupid browser on earth does it by default. To enable it in Solr, just add the Gzip filter to Jetty. For sending data TO Solr it's not so easy, see above. > Could enable compression of large search results. SOLR-856 suggests that > this is implemented, > but it does not work. Seems it is only implemented for replication. I’d > expect > this to be useful for > large /export or /stream requests. Example: > Accept-Encoding: gzip > > > > What do you think? Strong +1 > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
