[
https://issues.apache.org/jira/browse/TEXT-155?focusedWorklogId=210066&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-210066
]
ASF GitHub Bot logged work on TEXT-155:
---------------------------------------
Author: ASF GitHub Bot
Created on: 08/Mar/19 10:32
Start Date: 08/Mar/19 10:32
Worklog Time Spent: 10m
Work Description: aherbert commented on issue #109: TEXT-155: Add a
generic IntersectionSimilarity measure
URL: https://github.com/apache/commons-text/pull/109#issuecomment-470882543
I'd like to change this a bit to support the `Bag` functionality. One big
nit I had was that:
```
"aa" vs "aa" == 1
"aa" vs "aaaaaaaa" == 1 (with no duplicates)
"aa" vs "aaaaaaaa" == 2 / 8 = 0.25 (with duplicates)
```
I will update the API so that the `Function` is altered to a
`Function<CharSequence, Collection<T>>`. The user just then need to split the
`CharSequence` into a collection. If it is a `Set` then the assumption would be
that the user would like to perform an intersection without duplicates.
Anything else can be converted internally to a `Bag` representation. I'll put
the bag functionality as a private class and only borrow what is needed from
commons-collections. The amount of code will be minimal as it is just a
`HashMap<T, MutableInteger>` so the count of the items in the set can be
adjusted.
This will get it working. If a dependency on commons-collections is not a
problem then the internal class can be removed later without breaking binary
compatibility.
I'll also change F1Score to SorensenDiceCoefficient so the name matches the
parallel work in #103.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 210066)
Time Spent: 1h 50m (was: 1h 40m)
> Add a generic SetSimilarity measure
> -----------------------------------
>
> Key: TEXT-155
> URL: https://issues.apache.org/jira/browse/TEXT-155
> Project: Commons Text
> Issue Type: New Feature
> Affects Versions: 1.6
> Reporter: Alex D Herbert
> Priority: Minor
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> The {{SimilarityScore<T>}} interface can be used to compute a generic result.
> I propose to add a class that can compute the intersection between two sets
> formed from the characters. The sets must be formed from the {{CharSequence}}
> input to the {{apply}} method using a {{Function<CharSequence, Set<T>>}} to
> convert the {{CharSequence}}. This function can be passed to the
> {{SimilarityScore<T>}} during construction.
> The result can then be computed to have the size of each set and the
> intersection.
> I have created an implementation that can compute the equivalent of the
> {{JaccardSimilary}} class by creating {{Set<Character>}} and also the
> F1-score using bigrams (pairs of characters) by creating {{Set<String>}}.
> This relates to
> [Text-126|https://issues.apache.org/jira/projects/TEXT/issues/TEXT-126] which
> suggested an algorithm for the Sorensen-Dice similarity, also known as the
> F1-score.
> Here is an example:
> {code:java}
> // Match the functionality of the JaccardSimilarity class
> Function<CharSequence, Set<Character>> converter = (cs) -> {
> final Set<Character> set = new HashSet<>();
> for (int i = 0; i < cs.length(); i++) {
> set.add(cs.charAt(i));
> }
> return set;
> };
> IntersectionSimilarity<Character> similarity = new
> IntersectionSimilarity<>(converter);
> IntersectionResult result = similarity.apply("something", "something else");
> {code}
> The result has the size of set A, set B and the intersection between them.
> This class was inspired by my look through the various similarity
> implementations. All of them except the {{CosineSimilarity}} perform single
> character matching between the input {{CharSequence}}s. The
> {{CosineSimilarity}} tokenises using whitespace to create words.
> This more generic type of implementation will allow a user to determine how
> to divide the {{CharSequence}} but to create the sets that are compared, e.g.
> single characters, words, bigrams, etc.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)