Dear PowerDNS Developers,

every now and then one of our internal customer calls and says "this and that record doesn't resolve whereas it works when using google opendns or dig +trace".
And they are right :-( For example

dig -x  194.95.67.2

pdns_recursor 3.3 sometimes only reports the cname (and a servfail) and sometimes both the cname and the queried ptr record are delivered.

I have no idea why 8.8.8.8 always returns the PTR, sometimes even the dig +trace fails.

To be able to understand these problems in a live system I would like to have some sort of tracing facility in pdns_recursor which can be turned on and off without restarting the service.

Ideally pdns_recursor would provide some sort of cli which can be used to create output channels, create, list and delete filters.

pcli create output logfile1 '/var/tmp/logfile_servfail'

There should be two types of filters: simple filters matching only single log entries ("entry-filter") and filters that output the complete transaction if any of the log entries matches ("transaction-filter").

One should be able to create filters on every field

pcli create entry-filter f1 as query='%67.95.194.in-addr.arpa' to logfile1, stdout

logentries contain the following information:

* traId: transaction Id: uniquely identify a transaction within a thread
* thrId: thread Id
* Proto: TCP or UDP (shortened to P in example)
* Version: 4 or 6 (shortened to V in example)
* srcIP: no need to explain
* dstIP: no need to explain
* QueryDirection (QD):
  - cQ client query: query received by server from a client
  - sQ server query: query sent by the server to authoritative DNS Servers
  - PC lookup packet cache
  - QC lookup query cache
  - sA server answer: answer received by server
  - cA client answer: answer send to client
* Ty: type of resource asked (A, PTR, RP, ...)
* Query: the question values as string 'google.com', '0.63.67.95.194.in-addr.arpa'
* Status: NXDOMAIN, NOERROR, SERVFAIL...
* flags: qr, aa, rd, ra, ...
* Time: for cQ and sQ null, for sA how long an individual query took and for cA how long it took from receiving cQ until cA was constructed (including wait time in queues)

traId thrId V srcIP   dstIP   P QD Ty  Query              St Fl     time
123   123   4 1.2.2.1 1.0.0.2 U cQ PTR 2.67.95.194.in-...    qr rd
123   123   4 4.0.0.3 2...:35 U sQ PTR 2.67.95.194.in-... NO qr rd
123   123   4 2...:35 4.0.0.3 U sA NS  a.in-addr....arpa. NO qr rd    27
123   123   4 1.0.0.2 1.2.2.1 U cA PTR 2.67.95.194.in-... NO qr rd ra 28

The example does not show the lookups in the packet cache, query cache and the wait time in the receive queue. In an ideal world times spent there would be shown.

This should be implemented in a fashion where I could run
- entry-filter QD="cA" and status="SERVFAIL"
- entry-filter QD="sA" and time > 500

to send these log entries to a monitoring system where they can be aggregated and alarms can be generated.

The transaction-filter will be mainly used to debug why things are happening.

Has anyone else sometimes the need to dive deeply into how the recursor is working and which server in the outside world are failing?

Is this idea worth opening a wishlist ticket?

Regards Thomas

_______________________________________________
Pdns-users mailing list
[email protected]
http://mailman.powerdns.com/mailman/listinfo/pdns-users

Reply via email to