On 7 November 2016 at 14:36, John D. Ament <[email protected]> wrote: > On Mon, Nov 7, 2016 at 9:23 AM sebb <[email protected]> wrote: > >> On 7 November 2016 at 01:36, John D. Ament <[email protected]> wrote: >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote: >> > >> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]> >> wrote: >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]> >> >> wrote: >> >> > >> >> >> On 11/06/2016 03:18 PM, sebb wrote: >> >> >> > Fields such as message-id are stored as text strings, but they are >> >> >> > only really intended to be used as ids. They don't contain >> independent >> >> >> > text parts. >> >> >> > >> >> >> > From what I have understood so far from reading the ES docs, such >> >> >> > fields should be tagged as >> >> >> > >> >> >> > "index": "not_analyzed" >> >> >> > >> >> >> > AIUI this reduces the analysis overhead and storage requirements, >> and >> >> >> > also makes it harder to find fields with >> >> >> > This probably applies to other fields in "mbox": >> >> >> > >> >> >> > mid >> >> >> > possibly in-reply-to >> >> >> > also references >> >> >> > >> >> >> > And of course the auto-created fields such as attachments >> >> >> > >> >> >> > Likewise the doc types currently missing from setup.py: >> >> >> > >> >> >> > notifications >> >> >> > account >> >> >> > mailinglists >> >> >> > >> >> >> > These are internal use only so are not intended for searching. >> >> >> > >> >> >> > Or have I got this completely wrong? >> >> >> > >> >> >> >> >> >> message-id is set to not be analyzed, by the setup script (it's in >> the >> >> >> mappings it sends to ES when creating the index). mid and in-reply-to >> >> >> should probably also be not analyzed, although mid is really a copy >> of >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as >> >> >> list_raw), neither is the raw from address >> >> >> >> >> > >> >> > So I notice the query process is an arbitrary full text query, which >> runs >> >> > against _all. >> >> > >> >> >> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44 >> >> >> >> Huh? >> >> >> >> The query starts: >> >> >> >> local url = config.es_url .. doc .. "/_search?q="..query >> >> >> >> where >> >> >> >> es_url = "http://localhost:9200/ponymail/" >> >> >> >> and >> >> >> >> doc = "mbox" by default. >> >> >> >> Where does the _all come in? >> >> >> > >> > When you do a query string query in elastic search (reference: >> > >> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html >> ) >> > the default field unless specified is "_all". I can't find anything in >> the >> > pony code that changes this field. As a result, its going to search _all >> > by default. >> >> stats.lua changes the generic query into: >> >> "query_string": { >> "default_field": "subject", >> "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")" >> } >> >> Which does not use the _all field AFAICT >> > > Ok, this is what I was looking for ( but couldn't find ). But to reiterate > my notes from above - this means that the only mappings that matter are > these fields. Other field mappings don't matter. >
Surely all the text fields 'matter' - i.e. need to have a mapping? Otherwise the default is to analyse them. It's just a question of whether a field is used for searching, and if so, what type(s) of searches are done. It looks like from/subject/body need to support word matching, so need to be analysed. However message id and many other fields need only support keyword matching. So these only need to be indexed. >> >> > >> >> >> >> > unless >> >> > I need to dig into it a bit further to see if there's something >> building >> >> up >> >> > query a bit different. >> >> > >> >> > So... that means most of these mappings are moot. >> >> >>
