So I did some more tests, I had to increase the range of pattern separator chars from just space to: " -_/\". Just space was a bit too naive. I also increased the word group threshold from 3 to 4. Eberhard, can you make this change before doing your tests?
http://svn.apache.org/viewvc/incubator/devicemap/trunk/devicemapjava/src/main/java/org/apache/devicemap/client/DeviceMapClient.java?view=markup Line 72, 76 ________________________________ From: Reza <[email protected]> To: "[email protected]" <[email protected]> Sent: Sunday, June 23, 2013 12:05 PM Subject: Re: device map java client - .Net version Nice, good to hear. I tried to keep it as simple as possible, nothing too exotic :] Another thing I want to do is make sure the algorithm is accurate. For example, I just checked in a fix to always choose the longest length pattern: http://svn.apache.org/viewvc/incubator/devicemap/trunk/devicemapjava/src/main/java/org/apache/devicemap/client/DeviceMapClient.java?view=markup Line 105 So let me know if you see any bad classifications. If this algorithm is suitable, it should be simple to port it over to other languages. Just to explain how it works, all patterns are stripped of regex and normalized into pure alpha numeric. Example: DROID.?BIONIC.?4G => droidbionic4g The same treatment is given to the input string. All possible single, double, and triple word combinations are passed thru the pattern index. Right now spaces are used as the default token separator. Example: This (1234.5 Agent) Test => this this12345 this12345agent 12345 12345agent 12345agenttest agent agenttest test Then some simple rules are used to choose the best match and filter out false positives (incomplete TwoStepDeviceBuilder patterns). I think this approach will be pretty accurate. If not, adjustments can be made. So let me know. If we want to use this algorithm across our clients, then we should adjust our pattern data to better suit this so we don't get conflicts down the road. So keep me posted on any sort of accuracy results. ________________________________ From: eberhard speer jr. <[email protected]> To: Apache Device Map DEV <[email protected]> Sent: Sunday, June 23, 2013 11:12 AM Subject: device map java client - .Net version -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi, managed to get a .Net version of Reza's java client up & running, simple enough... Nice ! Using the same user-agents strings for 'testing' I obtain a similar average of around 1 ms slowest : HTC Aria : 6.5651 ms fastest : iPhone : 0.1801 ms Nice indeed ! Next : a much larger test set to see if the device Id's match the one's returned by the 'old' version... I had thought about a 'parser' along the lines of Reza's current java client but shelved it, thinking it might return the wrong device Id is some cases. Well, I guess now we're going find out... I'll use this test data : https://svn.apache.org/repos/asf/incubator/devicemap/trunk/openddr/test-data/src/main/resources/test-data/dmap_20130522.txt esjr -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.8 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRxxBCAAoJEOxywXcFLKYcARQH/0uXnKYgrKsWocMMNBqv68Jk TlnDd4RmfGBtTa5hOzp8DOl8aXrB3M6EJSgtAewWqCxzksYMWVE9SUHlCyjRqe0A RC6NONC+XLXEyfC0UP46Yd88FLnkBH+Xy78BerLWFB44QSwpU06M7FX7K0lJW/Zr GnIeSy8tRoQNKsd8vLZd43usb3yT2ICyjzYok0/6tTOC747gArBxashJIY1TbbZO Cgj9EZlQoC9FxO+QE7c/QreeVj5sOTmY512g8IDA945Yl2At3S5YScBtpCGYpd0L i1vCzs2K5mOMgSxvzR3uw3J0Jc5pzFCVhzfhSttytE0tJ7/fQNT/Fgvn4sSmSSg= =NDx0 -----END PGP SIGNATURE-----
