On 7 November 2016 at 02:21, John D. Ament <[email protected]> wrote: > On Sun, Nov 6, 2016 at 9:03 PM sebb <[email protected]> wrote: > >> On 7 November 2016 at 01:36, John D. Ament <[email protected]> wrote: >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote: >> > >> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]> >> wrote: >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]> >> >> wrote: >> >> > >> >> >> On 11/06/2016 03:18 PM, sebb wrote: >> >> >> > Fields such as message-id are stored as text strings, but they are >> >> >> > only really intended to be used as ids. They don't contain >> independent >> >> >> > text parts. >> >> >> > >> >> >> > From what I have understood so far from reading the ES docs, such >> >> >> > fields should be tagged as >> >> >> > >> >> >> > "index": "not_analyzed" >> >> >> > >> >> >> > AIUI this reduces the analysis overhead and storage requirements, >> and >> >> >> > also makes it harder to find fields with >> >> >> > This probably applies to other fields in "mbox": >> >> >> > >> >> >> > mid >> >> >> > possibly in-reply-to >> >> >> > also references >> >> >> > >> >> >> > And of course the auto-created fields such as attachments >> >> >> > >> >> >> > Likewise the doc types currently missing from setup.py: >> >> >> > >> >> >> > notifications >> >> >> > account >> >> >> > mailinglists >> >> >> > >> >> >> > These are internal use only so are not intended for searching. >> >> >> > >> >> >> > Or have I got this completely wrong? >> >> >> > >> >> >> >> >> >> message-id is set to not be analyzed, by the setup script (it's in >> the >> >> >> mappings it sends to ES when creating the index). mid and in-reply-to >> >> >> should probably also be not analyzed, although mid is really a copy >> of >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as >> >> >> list_raw), neither is the raw from address >> >> >> >> >> > >> >> > So I notice the query process is an arbitrary full text query, which >> runs >> >> > against _all. >> >> > >> >> >> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44 >> >> >> >> Huh? >> >> >> >> The query starts: >> >> >> >> local url = config.es_url .. doc .. "/_search?q="..query >> >> >> >> where >> >> >> >> es_url = "http://localhost:9200/ponymail/" >> >> >> >> and >> >> >> >> doc = "mbox" by default. >> >> >> >> Where does the _all come in? >> >> >> > >> > When you do a query string query in elastic search (reference: >> > >> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html >> ) >> > the default field unless specified is "_all". I can't find anything in >> the >> > pony code that changes this field. As a result, its going to search _all >> > by default. >> > >> >> Sorry, I thought you were referring to the _all doc type. >> >> But I'm not sure what this has to do with my original e-mail about >> which fields should be indexed, and which should not. >> > > Everything actually.
I assume you mean everything should *not* be indexed? That will surely depend on whether there are any specific field searches, e.g. Subject and From are shown as separate fields in the Advanced search. > https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html In which case we should disable the _all field for all but the mbox mapping. Most of those will not have many documents, apart from mbox_source, and that does not have many text fields. So maybe it won't make much difference. > Basically, the mappings we use are moot on the individual fields (except > for the epoch field) since all searches are performed against the _all > field's value, which is just a big lob of everything smushed together. Since epoch is double (why is it not long?), not a string, it's not analysed anyway. > Although the interesting thing, I just tried searching by message ID, and > that doesn't seem to work on the ASF version out there - > https://lists.apache.org/[email protected]:lte=1M:%[email protected]%3E message-id is flagged as not_analysed: maybe that excludes it from _all > John > > >> >> >> >> >> > unless >> >> > I need to dig into it a bit further to see if there's something >> building >> >> up >> >> > query a bit different. >> >> > >> >> > So... that means most of these mappings are moot. >> >> >>
