[
https://issues.apache.org/jira/browse/METRON-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480764#comment-16480764
]
ASF GitHub Bot commented on METRON-1567:
----------------------------------------
GitHub user justinleet opened a pull request:
https://github.com/apache/metron/pull/1020
METRON-1567: Large error message can't be written in Solr
## Contributor Comments
This PR is against the feature branch.
There's a hard limit of ~32kb on string fields. This migrates the
raw_message fields to an unanalyzed TextField type, along with making it a
dynamic field to handle the split into raw_messages_<number>.
An integration test is added for this behavior which ensures no error is
thrown on huge text, ensures the multiple values work as expected the say way,
and that a plain string field does actually cause problems.
## Pull Request Checklist
Thank you for submitting a contribution to Apache Metron.
Please refer to our [Development
Guidelines](https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=61332235)
for the complete guide to follow for contributions.
Please refer also to our [Build Verification
Guidelines](https://cwiki.apache.org/confluence/display/METRON/Verifying+Builds?show-miniview)
for complete smoke testing guides.
In order to streamline the review of the contribution we ask you follow
these guidelines and ask you to double check the following:
### For all changes:
- [x] Is there a JIRA ticket associated with this PR? If not one needs to
be created at [Metron
Jira](https://issues.apache.org/jira/browse/METRON/?selectedTab=com.atlassian.jira.jira-projects-plugin:summary-panel).
- [x] Does your PR title start with METRON-XXXX where XXXX is the JIRA
number you are trying to resolve? Pay particular attention to the hyphen "-"
character.
- [x] Has your PR been rebased against the latest commit within the target
branch (typically master)?
### For code changes:
- [x] Have you included steps to reproduce the behavior or problem that is
being changed or addressed?
- [x] Have you included steps or a guide to how the change may be verified
and tested manually?
- [ ] Have you ensured that the full suite of tests and checks have been
executed in the root metron folder via:
```
mvn -q clean integration-test install &&
dev-utilities/build-utils/verify_licenses.sh
```
- [x] Have you written or updated unit tests and or integration tests to
verify your changes?
- [ ] Have you verified the basic functionality of the build by building
and running locally with Vagrant full-dev environment or the equivalent?
#### Note:
Please ensure that once the PR is submitted, you check travis-ci for build
issues and submit an update to your PR as soon as possible.
It is also recommended that [travis-ci](https://travis-ci.org) is set up
for your personal repository such that your branches are built there before
submitting a pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/justinleet/metron errorRawSchema
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/metron/pull/1020.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1020
----
commit 49c6971de3c9caa87db557a1c3a2a27ef6ef886c
Author: justinjleet <justinjleet@...>
Date: 2018-05-18T14:47:29Z
Schema change and integration test
----
> Large error message can't be written in Solr
> --------------------------------------------
>
> Key: METRON-1567
> URL: https://issues.apache.org/jira/browse/METRON-1567
> Project: Metron
> Issue Type: Sub-task
> Reporter: Justin Leet
> Assignee: Justin Leet
> Priority: Major
>
> Error message on the feature branch:
> {code:java}
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at
> http://ip-11-0-1-51.us-west-2.compute.internal:8983/solr/error: Exception
> writing document id cd6db5c1-f41b-4dcf-8f68-583c7fc08575 to the index;
> possible analysis error: Document contains at least one immense term in
> field="raw_message_1" (whose UTF8 encoding is longer than the max length
> 32766), all of which were skipped. Please correct the analyzer to not produce
> such terms. The prefix of the first immense term is: '[123, 34, 101, 120, 99,
> 101, 112, 116, 105, 111, 110, 34, 58, 34, 106, 97, 118, 97, 46, 105, 111, 46,
> 70, 105, 108, 101, 78, 111, 116, 70]...', original message: bytes can be at
> most 32766 in length; got 165866. Perhaps the document has an indexed string
> field (solr.StrField) which is too large
> at
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:612)
> ~[stormjar.jar:?]
> ...{code}
> This is a hard limit of string fields, per
> https://lucene.apache.org/solr/guide/6_6/field-types-included-with-solr.html
> It also mentions they aren't tokenized or analyzed, so it doesn't seem like
> we'd be able to turn this limit off.
> Text fields don't list any sort of limit (although they may still have one),
> so we may want to switch to that, but it would require testing.
> Additionally, it appears that raw_message is dynamic (since it's getting _1,
> but we don't define it in the schema).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)