[Koha-bugs] [Bug 38101] ES skips records with huge fields

bugzilla-daemon--- via Koha-bugs Sat, 12 Oct 2024 12:08:31 -0700

https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=38101


--- Comment #6 from Thomas Klausner <[email protected]> ---
I have now (finally..sorry) looked at the mappings as used in ktd, and it seems
that sub-fields of `notes` (where `500` is index to per default) is indeed set
up as `type: keyword`:

start ktd, then ktd --shell:

```
kohadev-koha@kohadevbox:koha(main)$ curl -s
'http://es:9200/koha_kohadev_biblios/_mapping/field/note?pretty''

{
  "koha_kohadev_biblios" : {
    "mappings" : {
      "note" : {
        "full_name" : "note",
        "mapping" : {
          "note" : {
            "type" : "text",
            "fields" : {
              "ci_raw" : {
                "type" : "keyword",
                "normalizer" : "icu_folding_normalizer"
              },
              "phrase" : {
                "type" : "text",
                "analyzer" : "analyzer_phrase"
              },
              "raw" : {
                "type" : "keyword",
                "normalizer" : "nfkc_cf_normalizer"
              }
            },
            "analyzer" : "analyzer_standard"
          }
        }
      }
    }
  }
}

```

Here you see that fields "raw" and "ci_raw" are of type "keyword".

Now, to test if this is indeed the problem we have to fiddle with the
ElasticSearch mappings, which is not very easy (because the web interface does
not have any effect on the actual mappings, which are stored in
`admin/searchengine/elasticsearch/mappings.yaml`. BUT we actually don't care
that much about the mappings (i.e. which MARC21 fields goes into which search
field). We care about the definition of the "note" field, which has no 'type',
so it uses the default type, which we find in
`admin/searchengine/elasticsearch/field_config.yaml`:

```
  default:
    type: text
    analyzer: analyzer_standard
    search_analyzer: analyzer_standard
    fields:
      phrase:
        type: text
        analyzer: analyzer_phrase
        search_analyzer: analyzer_phrase
      raw:
        type: keyword
        normalizer: nfkc_cf_normalizer
      ci_raw:
        type: keyword
        normalizer: icu_folding_normalizer
```

Because I'm currently just exploring, I just deleted `raw` and `ci_raw`, but
(spoiler alert) this wasn't enough, because the `analyzer_phrase` has the same
problem. So I remove all the subfields from "default", so we only have 

```
  default:
    type: text
    analyzer: analyzer_standard
    search_analyzer: analyzer_standard
```

Now I can recreate the ES index:

kohadev-koha@kohadevbox:koha(main)$ perl
misc/search_tools/rebuild_elasticsearch.pl -r

And Re-Index my test entry (where I added ~40k text to 500):

perl misc/search_tools/rebuild_elasticsearch.pl --biblios --bn 284 -v -v

And it works!!

And I can find the book when I search for some of the text I entered (even if
the text is at the end of the 40k).

BUT (a very big BUT):

This is NOT the proper solution, just a prove that the problem lies in the
usage of `keyword` and/or `analyzer_phrase` (where `analyzer_phrase` is defined
in `admin/searchengine/elasticsearch/index_config.yaml` and also uses
`keyword`)

One thing we could (easily) do is to use `ignore_above` for type=keyword (which
would behave similar to your patch, in that it removes too-long text):

      raw:
        type: keyword
        normalizer: nfkc_cf_normalizer
        ignore_above: 20000
      ci_raw:
        type: keyword
        normalizer: icu_folding_normalizer
        ignore_above: 20000

But this does not work for `analyzer_phrase` :-(

I guess the correct (but very hard) solution would be to figure out why and
where we need those subfields (esp. "phrase", but also "raw" and "ci_raw") and
decide if we can use ignore_above for "raw" and "ci_raw". And figure out a fix
for "phrase".

Or, much easier: we define a new search type "long_text" which does not include
all those subfields (and therefor will not support a phrase search). Then you
can change the search_mappings for "note" on your instance from "default" to
"long_text" and everything should work. Or we might even decide that "note"
should be a "long_text" per default.

Unfortunantley ElasticSearch is a complex beast, and the Koha ES implementation
has a lot of improvement opportunities (let's call it that...)

-- 
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.
_______________________________________________
Koha-bugs mailing list
[email protected]
https://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/

[Koha-bugs] [Bug 38101] ES skips records with huge fields

Reply via email to