To clarify, it only needs to truncate fields > 32766 which need a
full/exact string match search to be run on them (analyzed fields generally
would not hit this limitation but I guess in theory they could).  However,
that's probably every field which can get > 32766 because I'm assuming
those will all be strings.

I also think using the profiler to monitor the truncation action could be a
useful default.

Jon

On Wed, Nov 2, 2016, 21:08 [email protected] <[email protected]> wrote:

> That would break searching on uri entirely unless you queried and knew to
> truncate at 32766 because it's not analyzed.  I don't like pushing that
> complication to the end user.
>
> I would suggest truncation in the indexingBolt (not using stellar because
> you'd want this across the board) for all fields > 32766 (how do we make
> sure this gets updated if the limitation changes in Lucene?) and adding
> metadata key-value pairs (pre-trunc length, hash, truncated bool, etc.).
> In the URI scenario I would also suggest doing a multifield mapping by
> default because of the way that data is useful (not sure which analyser to
> use though - maybe write or find a good URI analyzer?).  Since timestamp is
> a required field for all messages (I'm pretty sure?) I'm ok with timestamp
> and field value used as the UID, but would prefer something better.
>
> Jon
>
> On Wed, Nov 2, 2016, 20:33 James Sirota <[email protected]> wrote:
>
> Jon,
>
> For METRON-517 would it suffice to have a stellar statement to take a URI
> string and truncate it to length of 32766 in the ES writer?  But still
> write the actual string to HDFS? You can then search against ES on the
> truncated portion, but retrieve the actual timestamp from HDFS.  It's easy
> to do because you know the timestamp from the original message.  So you
> know which logs in HDFS to search through to find the data.
>
> 02.11.2016, 14:12, "[email protected]" <[email protected]>:
> > I personally would like to see the following things done before things
> > leave BETA:
> > (1) Address data integrity concerns (Specifically thinking of METRON-370,
> > METRON-517)
> > (2) Make cluster tuning easier and more consistent (METRON-485,
> METRON-470,
> > and the "[DISCUSS] moving parsers back to flux" which I can't find a JIRA
> > for).
> >
> > I would also want to see the upgrade path (as opposed to rebuild) be more
> > thoroughly and regularly tested once things leave BETA. From my
> > perspective I think the project is very close but not yet ready.
> >
> > Jon
> >
> > On Wed, Nov 2, 2016 at 4:44 PM Casey Stella <[email protected]> wrote:
> >
> > Hello Everyone,
> >
> > Now that the discussion around the next release has started, it has been
> > proposed and I think it's a good time to discuss what to name this next
> > release. Before, we have adopted the BETA suffix. I think it might be
> > time to drop it and call the next release 0.2.2
> >
> > Thoughts?
> >
> > Best,
> >
> > Casey
> >
> > --
> >
> > Jon
>
> -------------------
> Thank you,
>
> James Sirota
> PPMC- Apache Metron (Incubating)
> jsirota AT apache DOT org
>
> --
>
> Jon
>
-- 

Jon

Reply via email to