Agree with others that we should work from a download rather than hitting
their API.  I haven't had a chance to download the data to look at it yet,
but assuming it's a reasonable size that seems like a better way to go.
I'm not sure that we would need any REST API in the executors - why not
just load it into HDFS and read it into a dataframe?  What kind of
enrichment did we have in mind?  Just having looked at the web site without
looking at the data download, it seems more of a reference data set that
could provide additional information on external IP addresses in netflow
data, but I don't know that the ODM tables would need to be enriched.
Couldn't we just do a JOIN when we query to display in the UI?

I like the idea of Spot having a set of ODM schemas and a set of supported
reference data schemas, of which perhaps this could be the first.

Michael

On Tue, Jun 27, 2017 at 9:31 PM, [email protected] <[email protected]>
wrote:

> If we query every time that we receive data we will kill the API, however
> if we do it after the fact that spot have results we are adding context to
> the suspicious results, can we explore what happen if we store the "common"
> results and we just query things that are out of the range? How much
> information we need to store is the other side of the question.
> Agree with Vertika, if we can enrich the data down stream we will add value
> to the solution.
> Regards
>
> 2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>:
>
> > This looks interesting. I understand we can either directly query the
> > database, or download point in time snapshots in specified frequent
> > interval.
> >
> > Ideally the enrichment should be done in a Streaming job based on the
> > snapshot downloaded.
> >
> > (Not sure if from within a Spot flow we would want to query the REST API
> > available on the public internet. Or we can query the downloaded snapshot
> > using REST API from the executors but then it may require some additional
> > tuning. That's theoretical at this point.)
> >
> > Complexity would be defined by the size of the data snapshot data
> > downloaded as well as the external IP Addresses flowing in the
> micro-batch.
> >
> > I have seen such enrichment successfully in the past on large scale
> > enrichment as well as IP addresses for a 50 node cluster with about 4
> > seconds of batch interval. The Ipaddresses were of the order of 400K and
> > the enrichment data was of the order of 400K. It involved using a Map
> side
> > loop up and join and then sending the enriched data further down stream
> to
> > a Kafka topic.
> >
> > Thoughts?
> >
> > On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]> wrote:
> >
> > > Folks,
> > >
> > > I'd like to discuss the possibility of incorporating Censys for
> profiling
> > > and context enrichment of external IPv4 addresses on Spot. This
> community
> > > approach which leveraged ZMAP and ZGRAB to scans the Internet on a
> > > recurrent basis and is building a complex map of services running, ASN
> > > state, and SSL changes.
> > >
> > > High level description of the project below:
> > >
> > > "Censys is a search engine that allows computer scientists to ask
> > questions
> > > about the devices and networks that compose the Internet. Driven by
> > > Internet-wide scanning, Censys lets researchers find specific hosts and
> > > create aggregate reports on how devices, websites, and certificates are
> > > configured and deployed."
> > >
> > > They have a REST API to do queries at volume, so then it can be
> > > incorporated through some type of extension/plugin manager that can
> > > enable/disable it on demand.
> > >
> > > A deep dive on the research done can be found over here:
> > >
> > > https://www.censys.io/static/censys.pdf
> > >
> > > Let get the discussion open an determine next steps from here.
> > >
> > >
> > > Thanks,
> > >
> > > Cesar
> > >
> >
> >
> >
> > --
> > Vartika Singh
> > Senior Solutions Architect
> > Cloudera
> >
>



-- 
Michael Ridley <[email protected]>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Reply via email to