Re: Index not_analysed for fields used as ids?

John D. Ament Mon, 07 Nov 2016 07:08:07 -0800

On Mon, Nov 7, 2016 at 9:54 AM sebb <[email protected]> wrote:

> On 7 November 2016 at 14:36, John D. Ament <[email protected]> wrote:
> > On Mon, Nov 7, 2016 at 9:23 AM sebb <[email protected]> wrote:
> >
> >> On 7 November 2016 at 01:36, John D. Ament <[email protected]>
> wrote:
> >> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote:
> >> >
> >> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]>
> >> wrote:
> >> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]>
> >> >> wrote:
> >> >> >
> >> >> >> On 11/06/2016 03:18 PM, sebb wrote:
> >> >> >> > Fields such as message-id are stored as text strings, but they
> are
> >> >> >> > only really intended to be used as ids. They don't contain
> >> independent
> >> >> >> > text parts.
> >> >> >> >
> >> >> >> > From what I have understood so far from reading the ES docs,
> such
> >> >> >> > fields should be tagged as
> >> >> >> >
> >> >> >> > "index": "not_analyzed"
> >> >> >> >
> >> >> >> > AIUI this reduces the analysis overhead and storage
> requirements,
> >> and
> >> >> >> > also makes it harder to find fields with
> >> >> >> > This probably applies to other fields in "mbox":
> >> >> >> >
> >> >> >> > mid
> >> >> >> > possibly in-reply-to
> >> >> >> > also references
> >> >> >> >
> >> >> >> > And of course the auto-created fields such as attachments
> >> >> >> >
> >> >> >> > Likewise the doc types currently missing from setup.py:
> >> >> >> >
> >> >> >> > notifications
> >> >> >> > account
> >> >> >> > mailinglists
> >> >> >> >
> >> >> >> > These are internal use only so are not intended for searching.
> >> >> >> >
> >> >> >> > Or have I got this completely wrong?
> >> >> >> >
> >> >> >>
> >> >> >> message-id is set to not be analyzed, by the setup script (it's in
> >> the
> >> >> >> mappings it sends to ES when creating the index). mid and
> in-reply-to
> >> >> >> should probably also be not analyzed, although mid is really a
> copy
> >> of
> >> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
> >> >> >> list_raw), neither is the raw from address
> >> >> >>
> >> >> >
> >> >> > So I notice the query process is an arbitrary full text query,
> which
> >> runs
> >> >> > against _all.
> >> >> >
> >> >>
> >>
> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
> >> >>
> >> >> Huh?
> >> >>
> >> >> The query starts:
> >> >>
> >> >> local url = config.es_url .. doc .. "/_search?q="..query
> >> >>
> >> >> where
> >> >>
> >> >> es_url = "http://localhost:9200/ponymail/";
> >> >>
> >> >> and
> >> >>
> >> >> doc = "mbox" by default.
> >> >>
> >> >> Where does the _all come in?
> >> >>
> >> >
> >> > When you do a query string query in elastic search (reference:
> >> >
> >>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
> >> )
> >> > the default field unless specified is "_all".  I can't find anything
> in
> >> the
> >> > pony code that changes this field.  As a result, its going to search
> _all
> >> > by default.
> >>
> >> stats.lua changes the generic query into:
> >>
> >> "query_string": {
> >>   "default_field": "subject",
> >>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
> >> }
> >>
> >> Which does not use the _all field AFAICT
> >>
> >
> > Ok, this is what I was looking for ( but couldn't find ).  But to
> reiterate
> > my notes from above - this means that the only mappings that matter are
> > these fields.  Other field mappings don't matter.
> >
>
> Surely all the text fields 'matter' - i.e. need to have a mapping?
> Otherwise the default is to analyse them.
>
>
Not based on the query in use.  The only three fields being searched are
"from", "subject" and "body" - so only their mappings matter when doing
search.


One of the concepts behind ES is that your model your index based on the
queries you want to execute.  There's two points of view on that, only
store the things that are relevant, or make everything relevant.


> It's just a question of whether a field is used for searching, and if
> so, what type(s) of searches are done.
>
> It looks like from/subject/body need to support word matching, so need
> to be analysed.
>

We may want to consider things like partial match as well - fuzziness
ranking, ngrams, etc.


>
> However message id and many other fields need only support keyword
> matching.
> So these only need to be indexed.
>

Yes and no.  ES 5 introduced the concept of an enum type which may be what
message-id should be pointing to.  Email message IDs include some of the
stop characters in there "-" which need to be treated specially in queries.


>
> >>
> >> >
> >> >>
> >> >> > unless
> >> >> > I need to dig into it a bit further to see if there's something
> >> building
> >> >> up
> >> >> > query a bit different.
> >> >> >
> >> >> > So... that means most of these mappings are moot.
> >> >>
> >>
>

Re: Index not_analysed for fields used as ids?

Reply via email to