Eberhard, can you, and possibly others, explain your approach used in matching and classifying user agents strings? This will probably help guide our approach.
So when I wrote dClass, a c pattern matching implementation, it actually worked with both wurfl and OpenDDR. With wurfl, the user agents have to go thru a parser which used several techniques to parse down the user agents into basic sets of identifiable tokens. This had problems since a lot of devices had random user agents containing tokens which would throw the algorithm off. This required a lot of human tuning. OpenDDR already has the tokens parsed and its done very cleanly and accurately. This removes a large chunk of complexity from the process. dClass indexes tokens into a dtree (decision tree) and then walks the input string while walking the dtree looking for matching tokens. Tokens can have a variety of different attributes which tell the algorithm how to treat the match. Given the structure of the dtree, performance will always be O(m), where m is the length of the token being matched. Performance is not dependent on n, the numbers of patterns in the dtree. So the dtree has a performance profile unlike most trees. Also, the dtree achieves 2 types of natural data compression. First, all common prefixes are reused. Second, I implemented system pointer compression on top of my memory allocation algorithm. These factors give it runtime performance in the sub 1us range and good memory efficiency. Finally, I attached a set of key value pairs to each matchable token. This gives the system the characteristics of a document oriented database (albeit a very advanced one). I did a write up which talks more about the justification here: http://www.rezsoft.org/device_detection/ And here: http://mail-archives.apache.org/mod_mbox/incubator-devicemap-dev/201208.mbox/%3C3B961B5DBE03B04EAB618084BD661E4F51AEF42F%40PRTMB02.corp.weather.com%3E dClass is pretty much in a steady state right now. When combined with OpenDDR, its highly accurate and extremely fast. I would like to see it become a part of DeviceMap and maybe incorporate features from other classification algorithms. I would also like to extend dClass into a more power decision classifier (dClass+) and with some new features use it to power and tackle larger classification problems. Thanks, Reza Naghibi [email protected] --- Sent from Blackberry Bold 9900 ----- Original Message ----- From: eberhard speer jr. [mailto:[email protected]] Sent: Thursday, December 20, 2012 05:13 AM To: [email protected] <[email protected]> Subject: User-Agent strings -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, a while ago I saw a request for user-agent strings. These was some debate about IP addresses and privacy but I don't know where things went from there. I understand the idea is to see how 'complete' the OpenDDR resource data is, ie : where are the gaps. I can contribute a list of 65,757 user-agents in with the following data/columns : UserAgent : device user-agent string Device : OpenDDR device Id, "unknown" for unresolved Elapsed : time taken in ms to resolve UserAgent I just ran the complete dataset through my test setup using the OpenDDR 1.13 resources and OpenDDR resolver code. This data can shed some light on gaps as well as strengths and weakness. I also have a subset of this dataset -- 12,272 user-agent strings -- with : Manufacture OEM Model name UserAgent screen-width screen-height So, basically this is a list of 12,272 *unique* device models with an User-agent sample string ! If you like I can make the data available for download on one of my servers. Also, I will gladly run any set of user-agent strings made available to me through the test set-up. Regards, esjr -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQEcBAEBAgAGBQJQ0uTFAAoJEOxywXcFLKYcCPcH/Al1xVyqIa2y1B4siOmSEIMh 6pfndVUizAKCWkVjd4j5Vn3qLLLzubxi0Js+f/IuFaOWtjS5eLK1mkXr0/nUg3b6 Qk3qbzTsTV2Gx7ZeubhuCjhnKD8orI0rmuPIpyrTccBGdsMl35BFGQWxYAuOjamI I1tM538+H5PSFiUvfBmKxohGIG0j/GBUhIjIVhezJlC0e9ceowoM5S8GqdKlxdal 8QITiCn8F8PPXt2BzdKygZwNE6dRYdZF8vm89w50ECsdtYcuRWlo90FudAfPpZbS 0VM/M8uTOsen3LYEJGSXeUUa66cawNKx+kVwIzZ/Q2FlN/kkPnYsylxq1HV+NG4= =9yO0 -----END PGP SIGNATURE-----
