[ 
https://issues.apache.org/jira/browse/METRON-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480764#comment-16480764
 ] 

ASF GitHub Bot commented on METRON-1567:
----------------------------------------

GitHub user justinleet opened a pull request:

    https://github.com/apache/metron/pull/1020

    METRON-1567: Large error message can't be written in Solr

    ## Contributor Comments
    This PR is against the feature branch.
    
    There's a hard limit of ~32kb on string fields.  This migrates the 
raw_message fields to an unanalyzed TextField type, along with making it a 
dynamic field to handle the split into raw_messages_<number>.
    
    An integration test is added for this behavior which ensures no error is 
thrown on huge text, ensures the multiple values work as expected the say way, 
and that a plain string field does actually cause problems.
    
    ## Pull Request Checklist
    
    Thank you for submitting a contribution to Apache Metron.  
    Please refer to our [Development 
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
 for the complete guide to follow for contributions.  
    Please refer also to our [Build Verification 
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
 for complete smoke testing guides.  
    
    
    In order to streamline the review of the contribution we ask you follow 
these guidelines and ask you to double check the following:
    
    ### For all changes:
    - [x] Is there a JIRA ticket associated with this PR? If not one needs to 
be created at [Metron 
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
    - [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA 
number you are trying to resolve? Pay particular attention to the hyphen "-" 
character.
    - [x] Has your PR been rebased against the latest commit within the target 
branch (typically master)?
    
    
    ### For code changes:
    - [x] Have you included steps to reproduce the behavior or problem that is 
being changed or addressed?
    - [x] Have you included steps or a guide to how the change may be verified 
and tested manually?
    - [ ] Have you ensured that the full suite of tests and checks have been 
executed in the root metron folder via:
      ```
      mvn -q clean integration-test install && 
dev-utilities/build-utils/verify_licenses.sh 
      ```
    
    - [x] Have you written or updated unit tests and or integration tests to 
verify your changes?
    - [ ] Have you verified the basic functionality of the build by building 
and running locally with Vagrant full-dev environment or the equivalent?
    
    #### Note:
    Please ensure that once the PR is submitted, you check travis-ci for build 
issues and submit an update to your PR as soon as possible.
    It is also recommended that [travis-ci](https://travis-ci.org) is set up 
for your personal repository such that your branches are built there before 
submitting a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/justinleet/metron errorRawSchema

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/metron/pull/1020.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1020
    
----
commit 49c6971de3c9caa87db557a1c3a2a27ef6ef886c
Author: justinjleet <justinjleet@...>
Date:   2018-05-18T14:47:29Z

    Schema change and integration test

----


> Large error message can't be written in Solr
> --------------------------------------------
>
>                 Key: METRON-1567
>                 URL: https://issues.apache.org/jira/browse/METRON-1567
>             Project: Metron
>          Issue Type: Sub-task
>            Reporter: Justin Leet
>            Assignee: Justin Leet
>            Priority: Major
>
> Error message on the feature branch:
> {code:java}
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error 
> from server at 
> http://ip-11-0-1-51.us-west-2.compute.internal:8983/solr/error: Exception 
> writing document id cd6db5c1-f41b-4dcf-8f68-583c7fc08575 to the index; 
> possible analysis error: Document contains at least one immense term in 
> field="raw_message_1" (whose UTF8 encoding is longer than the max length 
> 32766), all of which were skipped. Please correct the analyzer to not produce 
> such terms. The prefix of the first immense term is: '[123, 34, 101, 120, 99, 
> 101, 112, 116, 105, 111, 110, 34, 58, 34, 106, 97, 118, 97, 46, 105, 111, 46, 
> 70, 105, 108, 101, 78, 111, 116, 70]...', original message: bytes can be at 
> most 32766 in length; got 165866. Perhaps the document has an indexed string 
> field (solr.StrField) which is too large
> at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:612)
>  ~[stormjar.jar:?]
> ...{code}
> This is a hard limit of string fields, per 
> https://lucene.apache.org/solr/guide/6_6/field-types-included-with-solr.html
> It also mentions they aren't tokenized or analyzed, so it doesn't seem like 
> we'd be able to turn this limit off.
> Text fields don't list any sort of limit (although they may still have one), 
> so we may want to switch to that, but it would require testing.
> Additionally, it appears that raw_message is dynamic (since it's getting _1, 
> but we don't define it in the schema).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to