Right, that is the current state, and the short term fix here is to use
ignore_above in the template, which will allow us to drop only that field
and not the entire message.  I'm not a huge fan of that solution, but it's
better than the alternative and I've tried to make sure this issue is well
documented (542 and all sub-tasks) so it will see eventual resolution in a
way that can be easily mirrored (or even abstracted) to other fields that
have a similar situation.

Jon

On Thu, Nov 3, 2016 at 3:50 PM Matt Foley <[email protected]> wrote:

> I’ll ping a couple people I know, too.  Can’t guarantee a prompt response
> :-)
>
>
>
> You said the current solution actually rejects indexing any message > 32K.
>
> That seems unfortunate.  But I would consider a completely “dumb”
> truncation
>
> solution (simplest form of your Solution #2, even without any warnings or
> indicators)
>
> to be very acceptable.  After all, even the current situation deals with
> the vast majority
>
> of messages, and truncating would deal with 99% of use cases for the
> remainder.
>
> Speaking loosely of course!
>
>
>
> --Matt
>
>
>
> On 11/3/16, 11:25 AM, "[email protected]" <[email protected]> wrote:
>
>
>
>     I totally agree - unfortunately I wasn't able to find anybody who
> publicly
>
>     posted a similar enough scenario to make use of.  I did reach out to
> some
>
>     people on the elastic and lucene teams for validation of what I put
> forward
>
>     in METRON-542 and got a thumbs up, but I'm always ok with more review
> if
>
>     you happen to have some contacts.  I will say that I'm comfortable
> with the
>
>     current solution, especially if I'm able to realize some benefit to
>
>     monitoring abuse with the profiler.  What do you think?
>
>
>
>     Jon
>
>
>
>     On Thu, Nov 3, 2016 at 2:15 PM Matt Foley <[email protected]> wrote:
>
>
>
>     > Right, for both (1) and (2) there is the problem that if you happen
> to
>
>     > break the long string in the middle of a substring that you want to
> use for
>
>     > a search term, you don’t see anything with the non-analyzed string
> search,
>
>     > and may not see anything with the analyzed string search if you
> broke or
>
>     > truncated in the middle of a token instead of between tokens.  And
> agree
>
>     > entirely with you about the needs for full data integrity as an
> option for
>
>     > those willing to spend the storage.
>
>     >
>
>     >
>
>     >
>
>     > Agree this is a hard problem. The best response I can think of is
> that
>
>     > this has to be a previously solved problem (or perhaps a
>
>     > previously-worked-around problem, since this looks like it may not
> have a
>
>     > complete solution).  We really should consult with ES/Solr experts
> about
>
>     > best practice in these cases.  I’m not one, altho I’ve done a couple
>
>     > projects.  Of course if you are one, please don’t take this wrong
> :-)  But
>
>     > I would definitely welcome input from folks with a lot more
> experience than
>
>     > I in full text indexing of large documents specifically.
>
>     >
>
>     >
>
>     >
>
>     > I guess my take on it is to stop thinking of such large bodies of
> text as
>
>     > a “field” and starting thinking of it as a “document”, where things
> like
>
>     > ordered proximity matching have known (if imperfect) solutions.
>
>     >
>
>     >
>
>     >
>
>     > Thanks,
>
>     >
>
>     > --Matt
>
>     >
>
>     >
>
>     >
>
>     >
>
>     >
>
>     > On 11/1/16, 6:36 AM, "[email protected]" <[email protected]> wrote:
>
>     >
>
>     >
>
>     >
>
>     >     Hi Matt,
>
>     >
>
>     >
>
>     >
>
>     >     First off, thanks for the great response.  I'm going to try and
> keep
>
>     > this
>
>     >
>
>     >     one brief but there's a lot of nuance here that I want to make
> sure is
>
>     >
>
>     >     obvious.
>
>     >
>
>     >
>
>     >
>
>     >     We need to store long non_analyzed fields because (1) wildcards
>
>     >
>
>     >     <
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html
>
>     > >
>
>     >
>
>     >     are only available on not_analyzed fields in Elasticsearch
> (because
>
>     >
>
>     >     they function
>
>     >
>
>     >     at the term level
>
>     >
>
>     >     <
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/guide/current/_wildcard_and_regexp_queries.html
>
>     > >),
>
>     >
>
>     >     and (2) proximity matching
>
>     >
>
>     >     <
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/guide/current/proximity-matching.html
>
>     > >
>
>     >
>
>     >     and ordering terms is not straightforward/available to string
> fields
>
>     > (a way
>
>     >
>
>     >     to fix this would be a custom analyzer (the standard analyzer
> doesn't
>
>     > apply
>
>     >
>
>     >     here)).  For instance, when trying to find a specific web shell
> that's
>
>     >
>
>     >     being used on your network, you may just care to search for a
> URI that
>
>     >
>
>     >     contains "c99shell.php" or "r57shell.php" which is a much faster
> search
>
>     >
>
>     >     with an analyzed field.  You may also have an APT that planted a
>
>     > specific
>
>     >
>
>     >     shell with a certain name on your system, or a certain path, so
> you'd
>
>     > want
>
>     >
>
>     >     to search "/this/is/a/bad/shell.php*" as is only allowed with not
>
>     > analyzed
>
>     >
>
>     >     field (accounts for query strings, etc.).  Other fields have
> similar
>
>     >
>
>     >     scenarios, such as searching a cookie value for shellshock, etc.
>
>     >
>
>     >
>
>     >
>
>     >     It's also important to note that I'm not currently considering
> HDFS
>
>     > data
>
>     >
>
>     >     "reasonable" for a SOC analyst to access within Metron at this
> time,
>
>     > but
>
>     >
>
>     >     one of the things that I am proposing is that we make it
> reasonable,
>
>     > or at
>
>     >
>
>     >     least start to lay the foundation so it can be reasonable in the
>
>     > future.
>
>     >
>
>     >
>
>     >
>
>     >     For your follow-up to (a), I think that the timestamp + 32K
> string
>
>     > prefix
>
>     >
>
>     >     wouldn't be a bad indicator of a unique record, if it is only
> being
>
>     > used to
>
>     >
>
>     >     connect HDFS to Lucene.  However, my understanding is that there
> is
>
>     > already
>
>     >
>
>     >     a unique indicator being used upstream which just isn't
> persisted (I
>
>     > would
>
>     >
>
>     >     defer to others on this one, this is just hearsay...) and if that
>
>     > exists,
>
>     >
>
>     >     I'd vote to just persist it and use that.  I would also argue
> both
>
>     > that (1)
>
>     >
>
>     >     incredibly high integrity of data is important for both the
> security
>
>     > use
>
>     >
>
>     >     case and data at this scale, and (2) these fields could
> intentionally
>
>     >
>
>     >     manipulated by the client to avoid indexing.
>
>     >
>
>     >
>
>     >
>
>     >     For your follow-up to (b), I agree that this is another avenue
> we could
>
>     >
>
>     >     take.  Personally, because of the relative infrequency of this
> event
>
>     > (when
>
>     >
>
>     >     not intentionally abused) and the fact that we may have people
> directly
>
>     >
>
>     >     querying elasticsearch (and not using the UI) I prefer a
> truncated
>
>     > field
>
>     >
>
>     >     which then has an obvious indicator in the JSON like
> truncated:true,
>
>     > but I
>
>     >
>
>     >     guess we could also have a field_split:true which performs
> similarly.
>
>     >
>
>     >     Probably the best reason why I can say that I prefer the former
> is to
>
>     > keep
>
>     >
>
>     >     the amount of data in Elasticsearch smaller and trust HDFS to be
> the
>
>     > true,
>
>     >
>
>     >     large ("full") datastore at an architectural level.  I could see
> an
>
>     >
>
>     >     argument that it makes out-of-band querying/third-party UIs
> slightly
>
>     > more
>
>     >
>
>     >     complicated because they would need to query both Elasticsearch
> and
>
>     > HDFS,
>
>     >
>
>     >     as opposed to doing a more complicated query but only to
> Elasticsearch.
>
>     >
>
>     >
>
>     >
>
>     >     Regarding multi-field, you're exactly right.  In my last minute
>
>     > editing of
>
>     >
>
>     >     the first email I forgot to add that it would need to be
> truncated as
>
>     > well.
>
>     >
>
>     >
>
>     >
>
>     >
>
>     >
>
>     >     Finally, I'm not sure I follow your last paragraph.  What does
> "other
>
>     >
>
>     >     records will simply lack a URI field" mean, exactly?  I don't
> see a
>
>     >
>
>     >     scenario where that is necessary, unless you are referring to
>
>     >
>
>     >     non-(Bro+HTTP) records.  I think that this does have to be
> federated
>
>     > into
>
>     >
>
>     >     search, because (1) this field is under the control of the client
>
>     >
>
>     >     (potentially an adversary) and thus can be purposefully abused,
> and
>
>     > (2) if
>
>     >
>
>     >     Metron goes 1.0 without fully resolving this issue (including in
> the
>
>     > UIs),
>
>     >
>
>     >     it could significantly impact uptake and reliance.
>
>     >
>
>     >
>
>     >
>
>     >     Jon
>
>     >
>
>     >
>
>     >
>
>     >     On Mon, Oct 31, 2016 at 6:28 PM Matt Foley <[email protected]>
> wrote:
>
>     >
>
>     >
>
>     >
>
>     >     > Hi Jon, interesting topic.  A couple questions come to mind:
>
>     >
>
>     >     >
>
>     >
>
>     >     > What is the reason we need to store very long non_analyzed
> values in
>
>     > the
>
>     >
>
>     >     > Index (which evidently is not its design sweet spot)?  Is it
> because:
>
>     >
>
>     >     > a. It is valuable to be able to Search against the whole
> string even
>
>     > when
>
>     >
>
>     >     > that string is > 32K?
>
>     >
>
>     >     > b. If HDFS is not also being used, the Index is the only
> source of
>
>     >
>
>     >     > historical truth for these records?
>
>     >
>
>     >     >
>
>     >
>
>     >     > If the answer is (a), the follow-up question is:  Isn’t it
> safe to
>
>     > assume
>
>     >
>
>     >     > that there’s a timestamp near the beginning of the raw
> string?  So
>
>     > if we
>
>     >
>
>     >     > provided a 32K prefix string match, including the timestamp,
>
>     > wouldn’t that
>
>     >
>
>     >     > be pretty much as good?  Or are there lots of cases where the
> first
>
>     > 32K+,
>
>     >
>
>     >     > including the timestamp, are truly the same, yet they differ
> after
>
>     > 32K of
>
>     >
>
>     >     > text?  If 32K prefix string match is sufficient for 99% of
> cases,
>
>     > then a
>
>     >
>
>     >     > fixed-length truncation limit slightly less than 32K, on both
> Index
>
>     > and
>
>     >
>
>     >     > Search, will suffice.  This is essentially your approach #2 –
> and
>
>     > it’s
>
>     >
>
>     >     > simple.
>
>     >
>
>     >     >
>
>     >
>
>     >     > If the answer is (b), then we can be satisfied with any
> approach that
>
>     >
>
>     >     > splits the very long strings up somehow and stores them all in
> a way
>
>     > that
>
>     >
>
>     >     > allows reassembling them in the correct sequence.  It does
> require a
>
>     >
>
>     >     > reformulator for querying in the UI, as you note in your
> “Concerns -
>
>     >
>
>     >     > Thoughts #1” below.
>
>     >
>
>     >     >
>
>     >
>
>     >     > I’m probably misunderstanding, but I don’t see how multi-field
> helps.
>
>     >
>
>     >     > According to the elastic.co doc you reference, multi-field
> allows
>
>     > storing
>
>     >
>
>     >     > and searching both the analyzed and not_analyzed sub-fields
> without
>
>     >
>
>     >     > doubling the storage size (which is clearly very useful), but
> the
>
>     >
>
>     >     > non_analyzed sub-field should still have the 32K limit.  Is
> this not
>
>     > so?
>
>     >
>
>     >     > Or are you proposing that a multi-field mapping could
> encapsulate the
>
>     >
>
>     >     > several sub-strings needed to contain a >32K string? Eg, as
> “raw”,
>
>     > “raw1”,
>
>     >
>
>     >     > “raw2”, etc., where each is <32K ?
>
>     >
>
>     >     >
>
>     >
>
>     >     > A sub-case of approach #2, relating to your second “Other
> Thoughts”,
>
>     > would
>
>     >
>
>     >     > be:
>
>     >
>
>     >     > Always truncate the indexed string to slightly less than 32K,
> but
>
>     > store
>
>     >
>
>     >     > the full value of any such string in HDFS, and include in the
> Index a
>
>     >
>
>     >     > reference (file URI with offset) that allows retrieving it.
> This
>
>     > solution
>
>     >
>
>     >     > can be limited to just the >32K strings, so other records will
>
>     > simply lack
>
>     >
>
>     >     > a URI field.  And it doesn’t have to be federated into Search
> as you
>
>     >
>
>     >     > suggest:  The 32K prefix string Search should be quite
> adequate as
>
>     >
>
>     >     > suggested above, and then the whole string can be read from
> HDFS if
>
>     > needed
>
>     >
>
>     >     > for historical reasons.
>
>     >
>
>     >     >
>
>     >
>
>     >     > Cheers,
>
>     >
>
>     >     > --Matt
>
>     >
>
>     >     >
>
>     >
>
>     >     > On 10/31/16, 1:38 PM, "[email protected]" <[email protected]>
> wrote:
>
>     >
>
>     >     >
>
>     >
>
>     >     >     Hi All,
>
>     >
>
>     >     >
>
>     >
>
>     >     >     I've been doing a bit of bug hunting with bro logs within
> the
>
>     > search
>
>     >
>
>     >     > tier
>
>     >
>
>     >     >     of Metron over the last couple of weeks.  During this (in
> this
>
>     > thread
>
>     >
>
>     >     > I'm
>
>     >
>
>     >     >     primarily referring to METRON-517
>
>     >
>
>     >     >     <https://issues.apache.org/jira/browse/METRON-517>) I've
> found a
>
>     >
>
>     >     > couple of
>
>     >
>
>     >     >     issues that I'd like to discuss with the wider
> population.  The
>
>     > primary
>
>     >
>
>     >     >     issue here is due to a limitation in Lucene itself,
> meaning we
>
>     > will
>
>     >
>
>     >     >     encounter the same problem with either Elasticsearch or
> Solr as
>
>     > far as
>
>     >
>
>     >     > I
>
>     >
>
>     >     >     can tell.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     *Problems*
>
>     >
>
>     >     >
>
>     >
>
>     >     >     1. Lucene, including the latest version (6.2.1), appears
> to have
>
>     > a hard
>
>     >
>
>     >     >     coded maximum term length of 32766 (reference
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://lucene.apache.org/core/6_2_1/core/constant-values.html%22%20%5Cl%20%22org.apache.lucene.index.IndexWriter.MAX_TERM_LENGTH
>
>     >
>
>     >     > >
>
>     >
>
>     >     >     here <%22> for
>
>     >
>
>     >     >     <
>
>     > https://github.com/apache/lucene-solr/search?utf8=%E2%9C%93&q=32766>
>
>     >
>
>     >     >     details <https://github.com/apache/lucene-solr/searc>).
> If the
>
>     >
>
>     >     >     indexingBolt attempts to input a non_analyzed
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html%22%20%5Cl%20%22_index_2
>
>     >
>
>     >     > >
>
>     >
>
>     >     >     tuple (set via these files <https://github.co>) which
> exceeds
>
>     > that
>
>     >
>
>     >     > limit,
>
>     >
>
>     >     >     the entire message is rejected.
>
>     >
>
>     >     >      - If you simply analyze the field it reduces the size for
> any
>
>     >
>
>     >     > individual
>
>     >
>
>     >     >     term, but it also throws a wrench in your queries, when
> you are
>
>     >
>
>     >     > searching
>
>     >
>
>     >     >     for a match of that entire field.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     2. From what I can tell, failures are only logged via
>
>     >
>
>     >     >     /var/log/elasticsearch/metron.log and in the Storm UI
> under the
>
>     > Bolt
>
>     >
>
>     >     > Id's
>
>     >
>
>     >     >     "Last error" column.
>
>     >
>
>     >     >      - It looks like this is already partially documented as
>
>     > METRON-307
>
>     >
>
>     >     >     <https://issues.apache.org/jira/browse/METRON-307>.
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     From here on out I'm going to focus on Problem 1.
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     *Thoughts*
>
>     >
>
>     >     >
>
>     >
>
>     >     >     1. We could use multifield
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-fields.html
>
>     >
>
>     >     > >
>
>     >
>
>     >     >     mappings to be able to do both a full and partial search
> for
>
>     > fields
>
>     >
>
>     >     > that
>
>     >
>
>     >     >     exceed 32766 length.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     2. Truncate fields in the indexingBolt to keep non-analyzed
>
>     > values
>
>     >
>
>     >     > below
>
>     >
>
>     >     >     the 32766 limit.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     3. Ensure that any field with the ability to grow beyond
> the
>
>     > 32766
>
>     >
>
>     >     > limit is
>
>     >
>
>     >     >     analyzed, and that no single term surpasses the max term
> limit.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     There are some other ways to fix the problem, such as to
> not
>
>     > store the
>
>     >
>
>     >     > field
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html
>
>     >
>
>     >     > >,
>
>     >
>
>     >     >     not index the field
>
>     >
>
>     >     >     <https://www.elastic.co/guide/en/elasticsearch/refe>,
> ignore
>
>     >
>
>     >     >     fields larger than a set value
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html
>
>     >
>
>     >     > >,
>
>     >
>
>     >     >     etc. but I personally see these as confusing (to the end
> user)
>
>     > and not
>
>     >
>
>     >     > very
>
>     >
>
>     >     >     helpful.  Others have brought up dynamic templates
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-templates.html
>
>     >
>
>     >     > >
>
>     >
>
>     >     >     as well, but I haven't looked into them yet.
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     *Concerns*
>
>     >
>
>     >     >
>
>     >
>
>     >     >     Thought #1
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - My current favourite (this is also what logstash does),
> but it
>
>     >
>
>     >     > requires
>
>     >
>
>     >     >     that we analyze the whole message and store a truncated
> version
>
>     > of the
>
>     >
>
>     >     >     whole message as a single large term.  If truncation
> occurs we
>
>     > would
>
>     >
>
>     >     > need
>
>     >
>
>     >     >     to:
>
>     >
>
>     >     >
>
>     >
>
>     >     >         - Add key-value pairs to the tuple that indicates that
> it was
>
>     >
>
>     >     >     truncated, what field(s) was/were truncated, the
> pre-truncated
>
>     > size of
>
>     >
>
>     >     > the
>
>     >
>
>     >     >     field(s), hash of pre-truncated field, and a timestamp of
>
>     > truncation
>
>     >
>
>     >     > (i.e.
>
>     >
>
>     >     >     Data tampering).
>
>     >
>
>     >     >
>
>     >
>
>     >     >         - Provide UI elements that clearly show that a specific
>
>     > message was
>
>     >
>
>     >     >     truncated.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - May need to abstract querying in the UI.  If so, this
> requires
>
>     > a
>
>     >
>
>     >     > sub-task
>
>     >
>
>     >     >     to METRON-195 and looking into an interim solution with
> Kibana.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - See "Other thoughts".
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     Thought #2
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - If we go this path we’d need to address how to do a full
>
>     > string match
>
>     >
>
>     >     >     (i.e. abstract a copy/paste of a > 32766 length URI to use
> as a
>
>     > query
>
>     >
>
>     >     > in
>
>     >
>
>     >     >     the UI).  This may or may not be possible with Kibana – if
> not,
>
>     > this
>
>     >
>
>     >     > needs
>
>     >
>
>     >     >     to be a subtask in METRON-195.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - Add key-value pairs to the tuple that indicates that it
> was
>
>     >
>
>     >     > truncated,
>
>     >
>
>     >     >     what field(s) was/were truncated, the pre-truncated size
> of the
>
>     >
>
>     >     > field(s),
>
>     >
>
>     >     >     hash of pre-truncated field, and a timestamp of truncation
> (i.e.
>
>     > Data
>
>     >
>
>     >     >     tampering).
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - Provide UI elements that clearly show that a specific
> message
>
>     > was
>
>     >
>
>     >     >     truncated.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - See "Other thoughts".
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     Thought #3
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - Not a huge fan of this solution because of how it
> affects whole
>
>     >
>
>     >     > string
>
>     >
>
>     >     >     matching.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - May need a custom analyzer to cut up the URI properly.
> Here
>
>     >
>
>     >     > <%22http> are
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-custom-analyzer.html
>
>     >
>
>     >     > >
>
>     >
>
>     >     >     some
>
>     >
>
>     >     >     <
>
>     >
>
>     >     >
>
>     >
> https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis->
>
>     >
>
>     >     >     relevant <
>
>     > http://docs.oracle.com/javase/7/docs/api/java/net/URL.html>
>
>     >
>
>     >     >     materials
>
>     >
>
>     >     >     <
>
>     > http://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html>
>
>     >
>
>     >     > if
>
>     >
>
>     >     >     <https://tools.ietf.org/html/rfc398> we
>
>     >
>
>     >     >     <
> https://tools.ietf.org/html/rfc3986%22%20%5Cl%20%22section-3>
>
>     > go
>
>     >
>
>     >     >     <
>
>     >
> http://download.java.net/jdk7/archive/b123/docs/api/java/net/URI.html>
>
>     >
>
>     >     > that
>
>     >
>
>     >     >     <http://www.regexplanet.com/advanced/java/index.html> path
>
>     >
>
>     >     >     <
>
>     >
> http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
>
>     >
>
>     >     > >.
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     *Other thoughts*
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - Maybe we can use the profiler to monitor truncated=True
> and
>
>     > watch for
>
>     >
>
>     >     >     people messing with this on purpose.
>
>     >
>
>     >     >
>
>     >
>
>     >     >     - We could add a persistent UUID to every tuple and map
> HDFS
>
>     > against
>
>     >
>
>     >     >     Elasticsearch data.  This could be used by a UI/frontend
> to query
>
>     >
>
>     >     > across
>
>     >
>
>     >     >     both datastores.   Very useful in the case of truncation -
>
>     > provide a
>
>     >
>
>     >     >     configurable setting that is false by default, but if set
> to
>
>     > true it
>
>     >
>
>     >     > will
>
>     >
>
>     >     >     query HDFS for data it got which has truncated:true in the
>
>     > indexed
>
>     >
>
>     >     > store.
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >     I have more thoughts but this has gotten more than long
> enough
>
>     > already
>
>     >
>
>     >     > and
>
>     >
>
>     >     >     I wanted to send it off today.  Thoughts?
>
>     >
>
>     >     >     Jon
>
>     >
>
>     >     >     --
>
>     >
>
>     >     >
>
>     >
>
>     >     >     Jon
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     >
>
>     >
>
>     >     > --
>
>     >
>
>     >
>
>     >
>
>     >     Jon
>
>     >
>
>     >
>
>     >
>
>     >
>
>     >
>
>     > --
>
>
>
>     Jon
>
>
>
>
>
> --

Jon

Reply via email to