On Sun, Nov 6, 2016 at 9:03 PM sebb <[email protected]> wrote: > On 7 November 2016 at 01:36, John D. Ament <[email protected]> wrote: > > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote: > > > >> On 6 November 2016 at 14:37, John D. Ament <[email protected]> > wrote: > >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]> > >> wrote: > >> > > >> >> On 11/06/2016 03:18 PM, sebb wrote: > >> >> > Fields such as message-id are stored as text strings, but they are > >> >> > only really intended to be used as ids. They don't contain > independent > >> >> > text parts. > >> >> > > >> >> > From what I have understood so far from reading the ES docs, such > >> >> > fields should be tagged as > >> >> > > >> >> > "index": "not_analyzed" > >> >> > > >> >> > AIUI this reduces the analysis overhead and storage requirements, > and > >> >> > also makes it harder to find fields with > >> >> > This probably applies to other fields in "mbox": > >> >> > > >> >> > mid > >> >> > possibly in-reply-to > >> >> > also references > >> >> > > >> >> > And of course the auto-created fields such as attachments > >> >> > > >> >> > Likewise the doc types currently missing from setup.py: > >> >> > > >> >> > notifications > >> >> > account > >> >> > mailinglists > >> >> > > >> >> > These are internal use only so are not intended for searching. > >> >> > > >> >> > Or have I got this completely wrong? > >> >> > > >> >> > >> >> message-id is set to not be analyzed, by the setup script (it's in > the > >> >> mappings it sends to ES when creating the index). mid and in-reply-to > >> >> should probably also be not analyzed, although mid is really a copy > of > >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as > >> >> list_raw), neither is the raw from address > >> >> > >> > > >> > So I notice the query process is an arbitrary full text query, which > runs > >> > against _all. > >> > > >> > https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44 > >> > >> Huh? > >> > >> The query starts: > >> > >> local url = config.es_url .. doc .. "/_search?q="..query > >> > >> where > >> > >> es_url = "http://localhost:9200/ponymail/" > >> > >> and > >> > >> doc = "mbox" by default. > >> > >> Where does the _all come in? > >> > > > > When you do a query string query in elastic search (reference: > > > https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html > ) > > the default field unless specified is "_all". I can't find anything in > the > > pony code that changes this field. As a result, its going to search _all > > by default. > > > > Sorry, I thought you were referring to the _all doc type. > > But I'm not sure what this has to do with my original e-mail about > which fields should be indexed, and which should not. >
Everything actually. https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html Basically, the mappings we use are moot on the individual fields (except for the epoch field) since all searches are performed against the _all field's value, which is just a big lob of everything smushed together. Although the interesting thing, I just tried searching by message ID, and that doesn't seem to work on the ASF version out there - https://lists.apache.org/[email protected]:lte=1M:%[email protected]%3E John > > >> > >> > unless > >> > I need to dig into it a bit further to see if there's something > building > >> up > >> > query a bit different. > >> > > >> > So... that means most of these mappings are moot. > >> >
