Bumping this thread... Still looking for any material anybody might have around the grammar construction portion of the DNP3 work. Any references they used, etc...
Derick On Wed, Feb 24, 2016 at 9:56 AM, Derick Winkworth <ccie15...@gmail.com> wrote: > All: > > I actually didn't receive Sven's reply, and it's not in my spam folder. > :-( So I'll elaborate a little more. > > I've been working in the ML field for about 9 months (so still fresh), > specifically doing feature extraction/engineering/data representation. > Specifically, we are trying to apply data science/ML tools and techniques > to communications network infrastructure telemetry. > > We have multiple angles we are going down in this regard, but one of the > possibilities was leveraging existing NLP work by building a sort of > grammar for the raw telemetry coming in. I'll give an example. > > In an IP network (like the internet), nodes are able to export flow > telemetry. Usually this is sub-sampled as it's CPU and bandwidth intensive > to generate this telemetry. However on firewalls and IPS's where > individual flows are tracked, you get a much more complete picture of what > is going on. A "flow" record usually contains a source address, > destination address, source port, destination port, and some other fields. > > Suppose we have three hosts, "A", "B", and "C". Host A wants to talk to > host C, but must resolve a DNS name to the address of C by sending a DNS > request to Host B, which is a DNS server. When Host B responds, Host A > then starts communicating with Host C. The records would like like this: > > t0 - src:A, srcPort: 11435, dst:B, dstPort: 53 > t1 - src:B, srcPort: 53, dst:A, dstPort: 11435 > t2 - src:A, srcPort: 22987, dst:C, dstPort: 443 > > > Unfortunately this telemetry data does not tell us what URL Host:A was > resolving and in the real world Host C is hosted somewhere out on the > internet, so it wouldn't be possible to reverse-resolve the IP to the URL. > > However, Host B keeps DNS logs. Assuming that all devices are running NTP > and have their system times sync'd, we might be able to correlate a log > message from Host B to these flow records. We might also gather logs from > a third source, a web proxy, that tells the exact URI that Host A sent to > Host C. This is where a constructed grammer would be useful... > > noun verb noun noun noun > > --------------------------------------------------------------------------------------------------- > "A" - sent - DNS request - to - "B" - with - "drink.more.beer.com" > "B" - sent - DNS response - to - "A" - with - "C" > "A" - sent - HTTP/SSL request - to - "C" - with - " > http://drink.more.beer.com/sendFreeBeerNow.html" > > > > > If you're looking at this and thinking "that's terrible," then you now > understand why I sent the original email to the group. The thing is, to > get an understanding of what has actually transpired in IT infrastructure > (hardware and software) requires correlating logs, information from > protocols like SNMP, information received from message buses like RabbitMQ, > information retrieved from databases and APIs, etc, etc. And it's all > formatted differently structurally and even atoms of data are represented > differently from one source to the next. If we take this raw information > in the preprocessing phrase and construct a dialogue such as above using > some artificial grammer, we might be able to leverage existing NLP tools > like word2vec and other algorithms to identify specific events. Or we > might be able to train them on what is "normal" so that it can identify > when some piece of "dialogue" built with this grammar does not make sense > or is otherwise irregular. > > Derick > > On Sun, Feb 21, 2016 at 3:58 PM, Jeffrey Goldberg <jeff...@goldmark.org> > wrote: > >> >> > On 2016-02-17, at 10:50 AM, Sven M. Hallberg <pe...@khjk.org> wrote: >> > >> > Derick Winkworth <ccie15...@gmail.com> writes: >> >> In this case, they went through the process of defining a grammar for >> an >> >> existing protocol. This process might actually have another >> application in >> >> the realm of machine learning and language processing. >> >> > Interesting, could you elaborate? I would believe natural language >> > processing includes a lot of grammar construction; Meredith should be >> > able to tell. >> >> For some approaches to natural language processing (and much else in >> Linguistics) trying to work out an explicit grammar from just having >> instances of what is (and if you are lucky, what isn’t) in a language is >> the fun part. >> >> Now when linguists do this, they typically have the ability to check >> whether something is or isn’t in the language. For example, I can check >> whether (1) is English by consulting my intuitions. >> >> (1) *language the in isn’t or is something whether check to >> >> But when you don’t have the opportunity to interrogate a parser that >> knowns the language, you are struck with working from (mostly) positive >> examples of what is in the language. Note that to a substantial extent >> children learning their native language are confronted with the same >> problem. >> >> So when presented with a bunch of grammatical sentences made up from a >> set, w, of words in a language, the simplest grammar would be >> >> w* >> >> This of course, is not what people should do. We have exceptions of what >> a natural language grammar should look like and the kinds of things that >> the target language actually does. >> >> Anyway, there is a whole bunch of research on what sorts of assumptions >> about the nature of the grammar need to be in place to be able to “learn” >> the grammar from positive instances of it. >> >> One thing to keep in mind is that natural language grammars allow for >> ambiguity. >> >> (2) She saw the boy with the binoculars. >> >> There are clearly two distinct parse trees available for this. >> >> She [saw [the boy] [with [the binoculars]]] >> >> She [saw [the boy with the binoculars]] >> >> And see how many you can get for >> >> (3) We saw her duck >> >> >> Anyway, trying to figure out what a grammar is from instances of the >> language and various expectation about what the language is supposed to do >> is loads of fun. But the languages that Linguists look at are enormously >> more complex than the tiny languages of these protocols, so I don’t really >> think that many of the specific techniques are useful. >> >> Now as no to Linguists ever agree things, I await for Meredith to explain >> what I am wrong about. >> >> Cheers, >> >> -j >> _______________________________________________ >> langsec-discuss mailing list >> langsec-discuss@mail.langsec.org >> https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss >> > >
_______________________________________________ langsec-discuss mailing list langsec-discuss@mail.langsec.org https://mail.langsec.org/cgi-bin/mailman/listinfo/langsec-discuss