I just registered for an account to see what the data looks like.  The IPv4
download is 800+ GB compressed, x509 cert DL is 3 TB compressed, and Alexa
dataset is 50 GB compressed.  Not unreasonable dataset sizes for a big data
cluster but not going to fit on my laptop :-)  Also this could be a
challenge for them to support if a lot of people start downloading this on
a daily or weekly basis.  Anyway, the infrastructure hosting issues are not
a primary concern for us but we may want to reach out to them about looking
into something like bit torrent to provide these files.

As far as working with them in a development environment, I guess we'll
need to figure out a subset to work with.  If I have time later I will try
to spin up a cluster to download them and see what the data looks like.

Michael

On Wed, Jun 28, 2017 at 1:34 PM, Smith, Nathanael P <
[email protected]> wrote:

> +1 on the download
>
> I think that this might fit nicely into some ideas for streaming ingest.
> For instance data can be ingested via a spark streaming worker, normalized
> (talk of schema’s aside)
> and sent to another streaming worker, perhaps a Structured Streaming
> worker.
> A Censys DF can be loaded with the latest download and can be used to
> checks against some conditions.
> results can be published to a streaming specific table or downstream to
> another streaming job waiting for suspicious events/bad actors only.
>
> The streaming table could take some thought, or perhaps it’s just a
> directory of Parquet data that can be loaded and queried. the latter could
> be troublesome with large implementations.
> Another option would just be pushing that data to a Solr/Grafana/etc.
>
> the UI should have a dashboard making calls to the graphQL api,
> the question is how to structure a streaming time series dashboard or
> perhaps using third party tools integrated with Spot UI is the best option.
>
> - Nathanael
>
>
>
> > On Jun 28, 2017, at 10:09 AM, Mark Grover <[email protected]> wrote:
> >
> > Thanks for starting this thread, Cesar.
> > I don't know enough Censys to have a strong opinion on this.
> >
> > I just looked around from a licensing and workflow perspective. Their
> python
> > client <https://github.com/censys/censys-python> seems to be ASL v2
> > licensed so that's a good thing. I think download makes sense as well but
> > do you have any thoughts on the workflow there? Currently the download
> > seems to require a user account, so does that mean every update of the
> > downloaded data from Censys would be manual? Do you know if that can be
> > automated, with only ASLv2 or compatibly licensed tooling?
> >
> > On Wed, Jun 28, 2017 at 9:49 AM, Michael Ridley <[email protected]>
> > wrote:
> >
> >> Agree with others that we should work from a download rather than
> hitting
> >> their API.  I haven't had a chance to download the data to look at it
> yet,
> >> but assuming it's a reasonable size that seems like a better way to go.
> >> I'm not sure that we would need any REST API in the executors - why not
> >> just load it into HDFS and read it into a dataframe?  What kind of
> >> enrichment did we have in mind?  Just having looked at the web site
> without
> >> looking at the data download, it seems more of a reference data set that
> >> could provide additional information on external IP addresses in netflow
> >> data, but I don't know that the ODM tables would need to be enriched.
> >> Couldn't we just do a JOIN when we query to display in the UI?
> >>
> >> I like the idea of Spot having a set of ODM schemas and a set of
> supported
> >> reference data schemas, of which perhaps this could be the first.
> >>
> >> Michael
> >>
> >> On Tue, Jun 27, 2017 at 9:31 PM, [email protected] <
> >> [email protected]>
> >> wrote:
> >>
> >>> If we query every time that we receive data we will kill the API,
> however
> >>> if we do it after the fact that spot have results we are adding context
> >> to
> >>> the suspicious results, can we explore what happen if we store the
> >> "common"
> >>> results and we just query things that are out of the range? How much
> >>> information we need to store is the other side of the question.
> >>> Agree with Vertika, if we can enrich the data down stream we will add
> >> value
> >>> to the solution.
> >>> Regards
> >>>
> >>> 2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>:
> >>>
> >>>> This looks interesting. I understand we can either directly query the
> >>>> database, or download point in time snapshots in specified frequent
> >>>> interval.
> >>>>
> >>>> Ideally the enrichment should be done in a Streaming job based on the
> >>>> snapshot downloaded.
> >>>>
> >>>> (Not sure if from within a Spot flow we would want to query the REST
> >> API
> >>>> available on the public internet. Or we can query the downloaded
> >> snapshot
> >>>> using REST API from the executors but then it may require some
> >> additional
> >>>> tuning. That's theoretical at this point.)
> >>>>
> >>>> Complexity would be defined by the size of the data snapshot data
> >>>> downloaded as well as the external IP Addresses flowing in the
> >>> micro-batch.
> >>>>
> >>>> I have seen such enrichment successfully in the past on large scale
> >>>> enrichment as well as IP addresses for a 50 node cluster with about 4
> >>>> seconds of batch interval. The Ipaddresses were of the order of 400K
> >> and
> >>>> the enrichment data was of the order of 400K. It involved using a Map
> >>> side
> >>>> loop up and join and then sending the enriched data further down
> stream
> >>> to
> >>>> a Kafka topic.
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]>
> wrote:
> >>>>
> >>>>> Folks,
> >>>>>
> >>>>> I'd like to discuss the possibility of incorporating Censys for
> >>> profiling
> >>>>> and context enrichment of external IPv4 addresses on Spot. This
> >>> community
> >>>>> approach which leveraged ZMAP and ZGRAB to scans the Internet on a
> >>>>> recurrent basis and is building a complex map of services running,
> >> ASN
> >>>>> state, and SSL changes.
> >>>>>
> >>>>> High level description of the project below:
> >>>>>
> >>>>> "Censys is a search engine that allows computer scientists to ask
> >>>> questions
> >>>>> about the devices and networks that compose the Internet. Driven by
> >>>>> Internet-wide scanning, Censys lets researchers find specific hosts
> >> and
> >>>>> create aggregate reports on how devices, websites, and certificates
> >> are
> >>>>> configured and deployed."
> >>>>>
> >>>>> They have a REST API to do queries at volume, so then it can be
> >>>>> incorporated through some type of extension/plugin manager that can
> >>>>> enable/disable it on demand.
> >>>>>
> >>>>> A deep dive on the research done can be found over here:
> >>>>>
> >>>>> https://www.censys.io/static/censys.pdf
> >>>>>
> >>>>> Let get the discussion open an determine next steps from here.
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Cesar
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Vartika Singh
> >>>> Senior Solutions Architect
> >>>> Cloudera
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Michael Ridley <[email protected]>
> >> office: (650) 352-1337
> >> mobile: (571) 438-2420
> >> Senior Solutions Architect
> >> Cloudera, Inc.
> >>
>
>


-- 
Michael Ridley <[email protected]>
office: (650) 352-1337
mobile: (571) 438-2420
Senior Solutions Architect
Cloudera, Inc.

Reply via email to