To clarify, it only needs to truncate fields > 32766 which need a full/exact string match search to be run on them (analyzed fields generally would not hit this limitation but I guess in theory they could). However, that's probably every field which can get > 32766 because I'm assuming those will all be strings.
I also think using the profiler to monitor the truncation action could be a useful default. Jon On Wed, Nov 2, 2016, 21:08 [email protected] <[email protected]> wrote: > That would break searching on uri entirely unless you queried and knew to > truncate at 32766 because it's not analyzed. I don't like pushing that > complication to the end user. > > I would suggest truncation in the indexingBolt (not using stellar because > you'd want this across the board) for all fields > 32766 (how do we make > sure this gets updated if the limitation changes in Lucene?) and adding > metadata key-value pairs (pre-trunc length, hash, truncated bool, etc.). > In the URI scenario I would also suggest doing a multifield mapping by > default because of the way that data is useful (not sure which analyser to > use though - maybe write or find a good URI analyzer?). Since timestamp is > a required field for all messages (I'm pretty sure?) I'm ok with timestamp > and field value used as the UID, but would prefer something better. > > Jon > > On Wed, Nov 2, 2016, 20:33 James Sirota <[email protected]> wrote: > > Jon, > > For METRON-517 would it suffice to have a stellar statement to take a URI > string and truncate it to length of 32766 in the ES writer? But still > write the actual string to HDFS? You can then search against ES on the > truncated portion, but retrieve the actual timestamp from HDFS. It's easy > to do because you know the timestamp from the original message. So you > know which logs in HDFS to search through to find the data. > > 02.11.2016, 14:12, "[email protected]" <[email protected]>: > > I personally would like to see the following things done before things > > leave BETA: > > (1) Address data integrity concerns (Specifically thinking of METRON-370, > > METRON-517) > > (2) Make cluster tuning easier and more consistent (METRON-485, > METRON-470, > > and the "[DISCUSS] moving parsers back to flux" which I can't find a JIRA > > for). > > > > I would also want to see the upgrade path (as opposed to rebuild) be more > > thoroughly and regularly tested once things leave BETA. From my > > perspective I think the project is very close but not yet ready. > > > > Jon > > > > On Wed, Nov 2, 2016 at 4:44 PM Casey Stella <[email protected]> wrote: > > > > Hello Everyone, > > > > Now that the discussion around the next release has started, it has been > > proposed and I think it's a good time to discuss what to name this next > > release. Before, we have adopted the BETA suffix. I think it might be > > time to drop it and call the next release 0.2.2 > > > > Thoughts? > > > > Best, > > > > Casey > > > > -- > > > > Jon > > ------------------- > Thank you, > > James Sirota > PPMC- Apache Metron (Incubating) > jsirota AT apache DOT org > > -- > > Jon > -- Jon
