[
https://issues.apache.org/jira/browse/METRON-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632792#comment-15632792
]
Jon Zeolla edited comment on METRON-517 at 11/3/16 2:04 PM:
------------------------------------------------------------
After some discussion on the dev list ("[DISCUSS] Search Concerns") I'm
comfortable with the following recommendation:
1. Add truncation in the indexingBolt for all fields > $value (32766) and add
key-value pairs (fields that were truncated, pre-truncated {length,hash}, and
timestamp of truncation).
2. Add multifield mapping to the URI field in elasticsearch for bro_index*.
3. Use the profiler to monitor contains_truncated:T (after a very brief review
of the profile I'm thinking foreach fields_truncated, onlyif fields_truncated
!= null, groupBy ).
These all don't necessarily need to be a part of the same PR - in fact it
probably should get broken into 3.
Concerns
1. How do we make sure the value 32766 gets updated if the limitation changes
in Lucene?
was (Author: [email protected]):
After some discussion on the dev list ("[DISCUSS] Search Concerns") I'm
comfortable with the following recommendation:
1. Add truncation in the indexingBolt for all fields > $value (32766) and add
key-value pairs (fields that were truncated, pre-truncated {length,hash}, and
timestamp of truncation).
2. Add multifield mapping to the URI field in elasticsearch for bro_index*.
3. Use the profiler to monitor contains_truncated:T (after a very brief review
of the profile I'm thinking foreach fields_truncated, onlyif fields_truncated
!= null, groupBy ).
These all don't necessarily need to be a part of the same PR - in fact it
probably should get broken into 3.
Concerns
1. How do we make sure the value 32766 gets updated if the limitation changes
in Lucene?
> Update elasticsearch bro templates for uri
> ------------------------------------------
>
> Key: METRON-517
> URL: https://issues.apache.org/jira/browse/METRON-517
> Project: Metron
> Issue Type: Bug
> Reporter: Jon Zeolla
> Assignee: Jon Zeolla
> Original Estimate: 1h
> Remaining Estimate: 1h
>
> The bro uri field in
> [HTTP::Info](https://www.bro.org/sphinx/scripts/base/protocols/http/main.bro.html#type-HTTP::Info)
> can exceed the Lucene-imposed limit of 32766 per term (non-analyzed fields
> are treated as a single term, and we are setting it as not_analyzed here -
> https://github.com/apache/incubator-metron/blob/master/metron-deployment/roles/metron_elasticsearch_templates/files/es_templates/bro_index.template).
> The resolution options that I've been able to find appear to be:
> 1. Set analyzed to
> "[no](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html)",
> which will not add that field to the index, making it not queryable.
> 2. Change the type to
> [binary](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html),
> which will not store it by default.
> 3. Use
> "[ignore_above](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)"
> to set a limit, above which strings are not indexed.
> 4. Set the field as "analyzed".
> Here is an example error message:
> ```
> [4]: index [bro_index_2016.10.25.21], type [bro_doc], id
> [AVf-iCuooLg3mHEm2PpH], message [java.lang.IllegalArgumentException: Document
> contains at least one immense term in field="uri" (whose UTF8 encoding is
> longer than the max length 32766), all of which were skipped. Please correct
> the analyzer to not produce such terms. The prefix of the first immense term
> is: '[<redacted>]...', original message: bytes can be at most 32766 in
> length; got 38623]
> ```
> Relevant Lucene documentation:
> https://lucene.apache.org/core/6_2_1/core/constant-values.html#org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)