Agree with others that we should work from a download rather than hitting their API. I haven't had a chance to download the data to look at it yet, but assuming it's a reasonable size that seems like a better way to go. I'm not sure that we would need any REST API in the executors - why not just load it into HDFS and read it into a dataframe? What kind of enrichment did we have in mind? Just having looked at the web site without looking at the data download, it seems more of a reference data set that could provide additional information on external IP addresses in netflow data, but I don't know that the ODM tables would need to be enriched. Couldn't we just do a JOIN when we query to display in the UI?
I like the idea of Spot having a set of ODM schemas and a set of supported reference data schemas, of which perhaps this could be the first. Michael On Tue, Jun 27, 2017 at 9:31 PM, [email protected] <[email protected]> wrote: > If we query every time that we receive data we will kill the API, however > if we do it after the fact that spot have results we are adding context to > the suspicious results, can we explore what happen if we store the "common" > results and we just query things that are out of the range? How much > information we need to store is the other side of the question. > Agree with Vertika, if we can enrich the data down stream we will add value > to the solution. > Regards > > 2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>: > > > This looks interesting. I understand we can either directly query the > > database, or download point in time snapshots in specified frequent > > interval. > > > > Ideally the enrichment should be done in a Streaming job based on the > > snapshot downloaded. > > > > (Not sure if from within a Spot flow we would want to query the REST API > > available on the public internet. Or we can query the downloaded snapshot > > using REST API from the executors but then it may require some additional > > tuning. That's theoretical at this point.) > > > > Complexity would be defined by the size of the data snapshot data > > downloaded as well as the external IP Addresses flowing in the > micro-batch. > > > > I have seen such enrichment successfully in the past on large scale > > enrichment as well as IP addresses for a 50 node cluster with about 4 > > seconds of batch interval. The Ipaddresses were of the order of 400K and > > the enrichment data was of the order of 400K. It involved using a Map > side > > loop up and join and then sending the enriched data further down stream > to > > a Kafka topic. > > > > Thoughts? > > > > On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]> wrote: > > > > > Folks, > > > > > > I'd like to discuss the possibility of incorporating Censys for > profiling > > > and context enrichment of external IPv4 addresses on Spot. This > community > > > approach which leveraged ZMAP and ZGRAB to scans the Internet on a > > > recurrent basis and is building a complex map of services running, ASN > > > state, and SSL changes. > > > > > > High level description of the project below: > > > > > > "Censys is a search engine that allows computer scientists to ask > > questions > > > about the devices and networks that compose the Internet. Driven by > > > Internet-wide scanning, Censys lets researchers find specific hosts and > > > create aggregate reports on how devices, websites, and certificates are > > > configured and deployed." > > > > > > They have a REST API to do queries at volume, so then it can be > > > incorporated through some type of extension/plugin manager that can > > > enable/disable it on demand. > > > > > > A deep dive on the research done can be found over here: > > > > > > https://www.censys.io/static/censys.pdf > > > > > > Let get the discussion open an determine next steps from here. > > > > > > > > > Thanks, > > > > > > Cesar > > > > > > > > > > > -- > > Vartika Singh > > Senior Solutions Architect > > Cloudera > > > -- Michael Ridley <[email protected]> office: (650) 352-1337 mobile: (571) 438-2420 Senior Solutions Architect Cloudera, Inc.
