Re: Index not_analysed for fields used as ids?

sebb Mon, 07 Nov 2016 06:54:42 -0800

On 7 November 2016 at 14:36, John D. Ament <[email protected]> wrote:
> On Mon, Nov 7, 2016 at 9:23 AM sebb <[email protected]> wrote:
>
>> On 7 November 2016 at 01:36, John D. Ament <[email protected]> wrote:
>> > On Sun, Nov 6, 2016 at 8:22 PM sebb <[email protected]> wrote:
>> >
>> >> On 6 November 2016 at 14:37, John D. Ament <[email protected]>
>> wrote:
>> >> > On Sun, Nov 6, 2016 at 9:27 AM Daniel Gruno <[email protected]>
>> >> wrote:
>> >> >
>> >> >> On 11/06/2016 03:18 PM, sebb wrote:
>> >> >> > Fields such as message-id are stored as text strings, but they are
>> >> >> > only really intended to be used as ids. They don't contain
>> independent
>> >> >> > text parts.
>> >> >> >
>> >> >> > From what I have understood so far from reading the ES docs, such
>> >> >> > fields should be tagged as
>> >> >> >
>> >> >> > "index": "not_analyzed"
>> >> >> >
>> >> >> > AIUI this reduces the analysis overhead and storage requirements,
>> and
>> >> >> > also makes it harder to find fields with
>> >> >> > This probably applies to other fields in "mbox":
>> >> >> >
>> >> >> > mid
>> >> >> > possibly in-reply-to
>> >> >> > also references
>> >> >> >
>> >> >> > And of course the auto-created fields such as attachments
>> >> >> >
>> >> >> > Likewise the doc types currently missing from setup.py:
>> >> >> >
>> >> >> > notifications
>> >> >> > account
>> >> >> > mailinglists
>> >> >> >
>> >> >> > These are internal use only so are not intended for searching.
>> >> >> >
>> >> >> > Or have I got this completely wrong?
>> >> >> >
>> >> >>
>> >> >> message-id is set to not be analyzed, by the setup script (it's in
>> the
>> >> >> mappings it sends to ES when creating the index). mid and in-reply-to
>> >> >> should probably also be not analyzed, although mid is really a copy
>> of
>> >> >> the doc ID, IIRC. the list ID is also not analyzed by default (as
>> >> >> list_raw), neither is the raw from address
>> >> >>
>> >> >
>> >> > So I notice the query process is an arbitrary full text query, which
>> runs
>> >> > against _all.
>> >> >
>> >>
>> https://github.com/apache/incubator-ponymail/blob/master/site/api/lib/elastic.lua#L44
>> >>
>> >> Huh?
>> >>
>> >> The query starts:
>> >>
>> >> local url = config.es_url .. doc .. "/_search?q="..query
>> >>
>> >> where
>> >>
>> >> es_url = "http://localhost:9200/ponymail/";
>> >>
>> >> and
>> >>
>> >> doc = "mbox" by default.
>> >>
>> >> Where does the _all come in?
>> >>
>> >
>> > When you do a query string query in elastic search (reference:
>> >
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html
>> )
>> > the default field unless specified is "_all".  I can't find anything in
>> the
>> > pony code that changes this field.  As a result, its going to search _all
>> > by default.
>>
>> stats.lua changes the generic query into:
>>
>> "query_string": {
>>   "default_field": "subject",
>>   "query": "(from:\"QUERY\") OR (subject:\"QUERY\") OR (body:\"QUERY\")"
>> }
>>
>> Which does not use the _all field AFAICT
>>
>
> Ok, this is what I was looking for ( but couldn't find ).  But to reiterate
> my notes from above - this means that the only mappings that matter are
> these fields.  Other field mappings don't matter.
>


Surely all the text fields 'matter' - i.e. need to have a mapping?
Otherwise the default is to analyse them.

It's just a question of whether a field is used for searching, and if
so, what type(s) of searches are done.

It looks like from/subject/body need to support word matching, so need
to be analysed.

However message id and many other fields need only support keyword matching.
So these only need to be indexed.

>>
>> >
>> >>
>> >> > unless
>> >> > I need to dig into it a bit further to see if there's something
>> building
>> >> up
>> >> > query a bit different.
>> >> >
>> >> > So... that means most of these mappings are moot.
>> >>
>>

Re: Index not_analysed for fields used as ids?

Reply via email to