[jira] [Comment Edited] (DMAP-107) Performance optimizations for DeviceMapClient.classify()

Reza Naghibi (JIRA) Mon, 13 Jul 2015 09:11:36 -0700

    [ 
https://issues.apache.org/jira/browse/DMAP-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14624864#comment-14624864
 ]


Reza Naghibi edited comment on DMAP-107 at 7/13/15 4:10 PM:
------------------------------------------------------------

Regarding removing the 2 test strings, looks like "HTC One X" and "HTC One X+" 
match the same since 1.0 normalizes out the regex. Otherwise, im guessing the + 
is an error because its likely meant to be a literal {{\+}} and not a regex. 
Since the patterns are identical, the device choosen is the first one reached 
during iteration. The patch changes this because the data structure changes 
from a HashMap to a HashSet, so iteration order is different.

I could add logic to the ranking function to fix this, but at this point there 
is no use. Getting matching to properly work on the ODDR data will never be 
perfect because the data has many errors like the one above. So as always, the 
solution here is the fix the bad pattern.

Also, some problems with your split() method. It shouldn't be static and you 
can remove the reference to Apache commons by using String.isEmpty(). Not sure 
we need the null check, but null is allowed in normalize(), so its best to err 
on the side of safety. Below is the corrected version:

{code:java}
private List<String> split(String text) {
        List<String> nonemptyParts = new ArrayList<String>();

        String[] parts = TEXT_SPLIT_PATTERN.split(text);

        for (String part : parts) {
            String normalizedPart = Pattern.normalize(part);

            if (normalizedPart != null && !normalizedPart.isEmpty()) {
                nonemptyParts.add(normalizedPart);
            }
        }
        
        return nonemptyParts;
    }
{code}

Also, the style of the 1.0 Java client is to be explicit with imports and not 
use the wildcard. Just a small style nitpick. So if you can correct the above 
split() function (and fix the imports), your patch should be good to go with 
the HTC One X tests removed.


was (Author: rezan):
Regarding removing the 2 test strings, looks like "HTC One X" and "HTC One X+" 
match the same since 1.0 normalizes out the regex. Otherwise, im guessing the + 
is an error because its likely meant to be a literal \\+ and not a regex. Since 
the patterns are identical, the device choosen is the first one reached during 
iteration. The patch changes this because the data structure changes from a 
HashMap to a HashSet, so iteration order is different.

I could add logic to the ranking function to fix this, but at this point there 
is no use. Getting matching to properly work on the ODDR data will never be 
perfect because the data has many errors like the one above. So as always, the 
solution here is the fix the bad pattern.

Also, some problems with your split() method. It shouldn't be static and you 
can remove the reference to Apache commons by using String.isEmpty(). Not sure 
we need the null check, but null is allowed in normalize(), so its best to err 
on the side of safety. Below is the corrected version:

{code:java}
private List<String> split(String text) {
        List<String> nonemptyParts = new ArrayList<String>();

        String[] parts = TEXT_SPLIT_PATTERN.split(text);

        for (String part : parts) {
            String normalizedPart = Pattern.normalize(part);

            if (normalizedPart != null && !normalizedPart.isEmpty()) {
                nonemptyParts.add(normalizedPart);
            }
        }
        
        return nonemptyParts;
    }
{code}

Also, the style of the 1.0 Java client is to be explicit with imports and not 
use the wildcard. Just a small style nitpick. So if you can correct the above 
split() function (and fix the imports), your patch should be good to go with 
the HTC One X tests removed.

> Performance optimizations for DeviceMapClient.classify()
> --------------------------------------------------------
>
>                 Key: DMAP-107
>                 URL: https://issues.apache.org/jira/browse/DMAP-107
>             Project: DeviceMap
>          Issue Type: Improvement
>          Components: Java Client
>    Affects Versions: 1.1.0 Java
>            Reporter: Volkan Yazıcı
>             Fix For: 1.1.1 Java
>
>         Attachments: classify.diff
>
>
> This patch removes redundant {{DeviceType}} checks and User-Agent string 
> normalization calls. Performance gain is more than 2x. Check out the 
> devicemap-client-benchmark (introduced in issue DMAP-106) output:
> {code}
> $ export userAgentFile=/path/to/user-agents.txt
> $ wc -l $userAgentFile
> 195325
> $ java \
>     -jar 
> devicemap/java/classifier-benchmark/target/devicemap-client-benchmark.jar \
>     -jvmArgsAppend "-server -XX:+TieredCompilation -XX:+AggressiveOpts 
> -Xms1024m -Xmx4096m -DuserAgentFile=$userAgentFile" \
>     -wi 5 -i 5 -bm avgt -tu ms -f 3 \
>     ".*DeviceMapClientBenchmark.*"
> # Using the most recent trunk.
> Result: 12079.408 ±(99.9%) 1240.628 ms/op [Average]
>   Statistics: (min, avg, max) = (11232.424, 12079.408, 16011.000), stdev = 
> 1160.484
>   Confidence interval (99.9%): [10838.781, 13320.036]
> # Using the enhanced classify().
> Result: 5505.355 ±(99.9%) 441.748 ms/op [Average]
>   Statistics: (min, avg, max) = (5060.269, 5505.355, 6508.699), stdev = 
> 413.211
>   Confidence interval (99.9%): [5063.607, 5947.103]
> {code}
> JVM version: 1.8.0_25
> OS: OS X 10.9.5
> Processor: 2.6 GHz Intel Core i5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (DMAP-107) Performance optimizations for DeviceMapClient.classify()

Reply via email to