+1 on the download

I think that this might fit nicely into some ideas for streaming ingest.
For instance data can be ingested via a spark streaming worker, normalized 
(talk of schema’s aside)
and sent to another streaming worker, perhaps a Structured Streaming worker. 
A Censys DF can be loaded with the latest download and can be used to checks 
against some conditions.
results can be published to a streaming specific table or downstream to another 
streaming job waiting for suspicious events/bad actors only.

The streaming table could take some thought, or perhaps it’s just a directory 
of Parquet data that can be loaded and queried. the latter could be troublesome 
with large implementations.
Another option would just be pushing that data to a Solr/Grafana/etc.

the UI should have a dashboard making calls to the graphQL api,
the question is how to structure a streaming time series dashboard or perhaps 
using third party tools integrated with Spot UI is the best option.

- Nathanael



> On Jun 28, 2017, at 10:09 AM, Mark Grover <[email protected]> wrote:
> 
> Thanks for starting this thread, Cesar.
> I don't know enough Censys to have a strong opinion on this.
> 
> I just looked around from a licensing and workflow perspective. Their python
> client <https://github.com/censys/censys-python> seems to be ASL v2
> licensed so that's a good thing. I think download makes sense as well but
> do you have any thoughts on the workflow there? Currently the download
> seems to require a user account, so does that mean every update of the
> downloaded data from Censys would be manual? Do you know if that can be
> automated, with only ASLv2 or compatibly licensed tooling?
> 
> On Wed, Jun 28, 2017 at 9:49 AM, Michael Ridley <[email protected]>
> wrote:
> 
>> Agree with others that we should work from a download rather than hitting
>> their API.  I haven't had a chance to download the data to look at it yet,
>> but assuming it's a reasonable size that seems like a better way to go.
>> I'm not sure that we would need any REST API in the executors - why not
>> just load it into HDFS and read it into a dataframe?  What kind of
>> enrichment did we have in mind?  Just having looked at the web site without
>> looking at the data download, it seems more of a reference data set that
>> could provide additional information on external IP addresses in netflow
>> data, but I don't know that the ODM tables would need to be enriched.
>> Couldn't we just do a JOIN when we query to display in the UI?
>> 
>> I like the idea of Spot having a set of ODM schemas and a set of supported
>> reference data schemas, of which perhaps this could be the first.
>> 
>> Michael
>> 
>> On Tue, Jun 27, 2017 at 9:31 PM, [email protected] <
>> [email protected]>
>> wrote:
>> 
>>> If we query every time that we receive data we will kill the API, however
>>> if we do it after the fact that spot have results we are adding context
>> to
>>> the suspicious results, can we explore what happen if we store the
>> "common"
>>> results and we just query things that are out of the range? How much
>>> information we need to store is the other side of the question.
>>> Agree with Vertika, if we can enrich the data down stream we will add
>> value
>>> to the solution.
>>> Regards
>>> 
>>> 2017-06-27 19:51 GMT-05:00 Vartika Singh <[email protected]>:
>>> 
>>>> This looks interesting. I understand we can either directly query the
>>>> database, or download point in time snapshots in specified frequent
>>>> interval.
>>>> 
>>>> Ideally the enrichment should be done in a Streaming job based on the
>>>> snapshot downloaded.
>>>> 
>>>> (Not sure if from within a Spot flow we would want to query the REST
>> API
>>>> available on the public internet. Or we can query the downloaded
>> snapshot
>>>> using REST API from the executors but then it may require some
>> additional
>>>> tuning. That's theoretical at this point.)
>>>> 
>>>> Complexity would be defined by the size of the data snapshot data
>>>> downloaded as well as the external IP Addresses flowing in the
>>> micro-batch.
>>>> 
>>>> I have seen such enrichment successfully in the past on large scale
>>>> enrichment as well as IP addresses for a 50 node cluster with about 4
>>>> seconds of batch interval. The Ipaddresses were of the order of 400K
>> and
>>>> the enrichment data was of the order of 400K. It involved using a Map
>>> side
>>>> loop up and join and then sending the enriched data further down stream
>>> to
>>>> a Kafka topic.
>>>> 
>>>> Thoughts?
>>>> 
>>>> On Tue, Jun 27, 2017 at 7:26 PM, Cesar Berho <[email protected]> wrote:
>>>> 
>>>>> Folks,
>>>>> 
>>>>> I'd like to discuss the possibility of incorporating Censys for
>>> profiling
>>>>> and context enrichment of external IPv4 addresses on Spot. This
>>> community
>>>>> approach which leveraged ZMAP and ZGRAB to scans the Internet on a
>>>>> recurrent basis and is building a complex map of services running,
>> ASN
>>>>> state, and SSL changes.
>>>>> 
>>>>> High level description of the project below:
>>>>> 
>>>>> "Censys is a search engine that allows computer scientists to ask
>>>> questions
>>>>> about the devices and networks that compose the Internet. Driven by
>>>>> Internet-wide scanning, Censys lets researchers find specific hosts
>> and
>>>>> create aggregate reports on how devices, websites, and certificates
>> are
>>>>> configured and deployed."
>>>>> 
>>>>> They have a REST API to do queries at volume, so then it can be
>>>>> incorporated through some type of extension/plugin manager that can
>>>>> enable/disable it on demand.
>>>>> 
>>>>> A deep dive on the research done can be found over here:
>>>>> 
>>>>> https://www.censys.io/static/censys.pdf
>>>>> 
>>>>> Let get the discussion open an determine next steps from here.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Cesar
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Vartika Singh
>>>> Senior Solutions Architect
>>>> Cloudera
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Michael Ridley <[email protected]>
>> office: (650) 352-1337
>> mobile: (571) 438-2420
>> Senior Solutions Architect
>> Cloudera, Inc.
>> 

Reply via email to