[ 
https://issues.apache.org/jira/browse/METRON-517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15632792#comment-15632792
 ] 

Jon Zeolla edited comment on METRON-517 at 11/3/16 2:05 PM:
------------------------------------------------------------

After some discussion on the dev list ("[DISCUSS] Search Concerns") I'm 
comfortable with the following recommendation:
1. Add truncation in the indexingBolt for all fields > $value (32766) and add 
key-value pairs (fields that were truncated, pre-truncated length and hash, and 
timestamp of truncation).  
2. Add multifield mapping to the URI field in elasticsearch for bro_index*.
3. Use the profiler to monitor contains_truncated:T (after a very brief review 
of the profile I'm thinking foreach fields_truncated, onlyif fields_truncated 
!= null, groupBy ip_src_addr).

These all don't necessarily need to be a part of the same PR - in fact it 
probably should get broken into 3.  

Concerns
1. How do we make sure the value 32766 gets updated if the limitation changes 
in Lucene?


was (Author: [email protected]):
After some discussion on the dev list ("[DISCUSS] Search Concerns") I'm 
comfortable with the following recommendation:
1. Add truncation in the indexingBolt for all fields > $value (32766) and add 
key-value pairs (fields that were truncated, pre-truncated length and hash, and 
timestamp of truncation).  
2. Add multifield mapping to the URI field in elasticsearch for bro_index*.
3. Use the profiler to monitor contains_truncated:T (after a very brief review 
of the profile I'm thinking foreach fields_truncated, onlyif fields_truncated 
!= null, groupBy ).

These all don't necessarily need to be a part of the same PR - in fact it 
probably should get broken into 3.  

Concerns
1. How do we make sure the value 32766 gets updated if the limitation changes 
in Lucene?

> Update elasticsearch bro templates for uri
> ------------------------------------------
>
>                 Key: METRON-517
>                 URL: https://issues.apache.org/jira/browse/METRON-517
>             Project: Metron
>          Issue Type: Bug
>            Reporter: Jon Zeolla
>            Assignee: Jon Zeolla
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The bro uri field in 
> [HTTP::Info](https://www.bro.org/sphinx/scripts/base/protocols/http/main.bro.html#type-HTTP::Info)
>  can exceed the Lucene-imposed limit of 32766 per term (non-analyzed fields 
> are treated as a single term, and we are setting it as not_analyzed here - 
> https://github.com/apache/incubator-metron/blob/master/metron-deployment/roles/metron_elasticsearch_templates/files/es_templates/bro_index.template).
>   The resolution options that I've been able to find appear to be:
> 1. Set analyzed to 
> "[no](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html)",
>  which will not add that field to the index, making it not queryable.
> 2. Change the type to 
> [binary](https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html),
>  which will not store it by default.
> 3. Use 
> "[ignore_above](https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html)"
>  to set a limit, above which strings are not indexed.
> 4. Set the field as "analyzed".  
> Here is an example error message:
> ```
> [4]: index [bro_index_2016.10.25.21], type [bro_doc], id 
> [AVf-iCuooLg3mHEm2PpH], message [java.lang.IllegalArgumentException: Document 
> contains at least one immense term in field="uri" (whose UTF8 encoding is 
> longer than the max length 32766), all of which were skipped.  Please correct 
> the analyzer to not produce such terms.  The prefix of the first immense term 
> is: '[<redacted>]...', original message: bytes can be at most 32766 in 
> length; got 38623]
> ```
> Relevant Lucene documentation:  
> https://lucene.apache.org/core/6_2_1/core/constant-values.html#org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to