GitHub user danielblazevski opened a pull request:
https://github.com/apache/flink/pull/1220
Flink 1745
I added a quadtree data structure for the knn algorithm. @chiwanpark made
originally made a pull request for a kNN algorithm, and we coordinated so that
I incorporate a tree structure. The quadtree scales very well with the number
of training + test points, but scales poorly with the dimension (even the
R-tree scales poorly with the dimension). I added a flag that is automatically
determines whether or not to use the quadtree. My implementation needed to use
the Euclidean or SquaredEuclidean distance since I needed a specific notion of
the distance between a test point and a box in the quadtree. I added another
test KNNQuadTreeSuite in addition to Chiwan Park's KNNITSuite, since C. Park's
parameters will automatically choose the brute-force non-quadtree method.
For more details on the quadtree + how I used it for the KNN query, please
see another branch I created that has a README.md:
https://github.com/danielblazevski/flink/tree/FLINK-1745-devel/flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/nn
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/danielblazevski/flink FLINK-1745
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1220.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1220
----
commit c7e5056c6d273f6f0f841f77e0fdd91ca221602d
Author: Chiwan Park <[email protected]>
Date: 2015-06-30T08:41:25Z
[FLINK-1745] [ml] Add exact k-nearest-neighbor join
commit 9d0c7942c09086324fadb29bdce749683a0d1a7e
Author: danielblazevski <[email protected]>
Date: 2015-09-15T21:49:05Z
modified kNN test to familiarize with Flink and KNN.scala
commit 611248e57166dc549f86f805b590dd4e45cb3df5
Author: danielblazevski <[email protected]>
Date: 2015-09-15T21:49:17Z
modified kNN test to familiarize with Flink and KNN.scala
commit 1fd8231ce194b52b5a1bd55bbc5e135b3fa5775b
Author: danielblazevski <[email protected]>
Date: 2015-09-16T01:26:57Z
nightly commit, minor changes: got the filter to work, working on mapping
the training set to include box lables
commit 15d7d2cb308b23e24c43d103b85a76b0e665cbd3
Author: danielblazevski <[email protected]>
Date: 2015-09-22T02:02:51Z
commit before incporporating quadtree
commit 8f2da8a66516565c59df8828de2715b45397cb7f
Author: danielblazevski <[email protected]>
Date: 2015-09-22T15:49:25Z
did a basic import of QuadTree and Test; to-do: modify QuadTree to allow
KNN.scala to make use of
commit e1cef2c5aea65c6f204caeff6348e2778231f98d
Author: danielblazevski <[email protected]>
Date: 2015-09-22T21:03:04Z
transfered ListBuffers for objects in leaf nodes to Vectors
commit c3387ef2ef59734727b56ea652fdb29af957d20b
Author: danielblazevski <[email protected]>
Date: 2015-09-23T00:41:29Z
basic test on 2D unit box seems to work -- need to generalize, e.g. to
include automated bounding box
commit 48294ff37a5f800e5111280da5a3c03f4375028d
Author: danielblazevski <[email protected]>
Date: 2015-09-23T15:03:06Z
had to debug quadtree -- back to testing 2D
commit 6403ba14e240ed8d67a296ac789e7e00dece800d
Author: danielblazevski <[email protected]>
Date: 2015-09-23T15:22:46Z
Testing 2D looks good, strong improvement in run time compared to
brute-force method
commit 426466a40bc2625f390fe0d912f56a346e46c8f8
Author: danielblazevski <[email protected]>
Date: 2015-09-23T19:04:52Z
added automated detection of bounding box based on min/max values of both
training and test sets
commit c35543b828384aa4ce04d56dfcb3d73db46d1e6d
Author: danielblazevski <[email protected]>
Date: 2015-09-24T00:28:56Z
added automated radius about test point to define localized neighborhood,
result runs. TO-DO: Lots of tests
commit 8e2d2e78f8533d4192aebe9b4baa7efbfa5928a5
Author: danielblazevski <[email protected]>
Date: 2015-09-24T00:54:06Z
Note for future: previous commit passed test of Chiwan Park had in intial
knn implementation
commit d6fd40cb88d6e198e52c368e829bf7d32d432081
Author: danielblazevski <[email protected]>
Date: 2015-09-24T01:56:38Z
Note for future: previous commit passed 3D version of the test that Chiwan
Park had in the intial knn implementation
commit 0ec1f4866157ca073341672e7fe9a50871ac0b7c
Author: danielblazevski <[email protected]>
Date: 2015-09-24T14:27:20Z
changed filename of QuadTreeTest to QuadTreeSuite, about to make test more
comprehensive and similar format to other Flink tests
commit ac81561cad27b65d158ae08fd0fb15bdb51d1c8b
Author: danielblazevski <[email protected]>
Date: 2015-09-24T19:51:32Z
refactored testing of QuadTree, and added more tests
commit b17f82d5ce0214617c8dbc4a387410057d6f3832
Author: danielblazevski <[email protected]>
Date: 2015-09-24T22:49:10Z
added KNNBenchmark to check runtimes
commit 530565835d4b5934fcac9e0e51105bb669fec9be
Author: danielblazevski <[email protected]>
Date: 2015-09-24T22:49:17Z
added KNNBenchmark to check runtimes
commit 1f946cb30450604e92bbd0f5959ce9a60eb4c41b
Author: danielblazevski <[email protected]>
Date: 2015-09-25T01:13:00Z
fixed bug -- in find siblings, needed to search for minimal bounding boxes
commit 22e4eb7b57795ad1ca4392ca1c1a8bdae76afa8e
Author: danielblazevski <[email protected]>
Date: 2015-09-27T03:34:16Z
added more thorough benchmark files; about to modify bounding box to only
bound training set and modify search for the testing set
commit 3723f6b09ec7d45f6444df70a5f699cbf998a4bb
Author: danielblazevski <[email protected]>
Date: 2015-09-27T03:34:21Z
added more thorough benchmark files; about to modify bounding box to only
bound training set and modify search for the testing set
commit c41d3e1029bf81a37cf3594f202b904e2d99e3ac
Author: danielblazevski <[email protected]>
Date: 2015-09-28T03:53:12Z
major simplification in choosing the radius to look at nearby neighbors
commit 7c77ea20fd9e8a0c4a33c81b83187c84b380d6b2
Author: danielblazevski <[email protected]>
Date: 2015-09-28T04:03:59Z
cleaned up; commit before deleting previous sibling search
commit cf4aa5d75611db19040466040c3d29432cb0e5f7
Author: danielblazevski <[email protected]>
Date: 2015-09-28T14:02:51Z
added new devel branch to push temp changes on github
commit ec6ddb0a57136075b4f77616e6e48eb5bcc50a11
Author: danielblazevski <[email protected]>
Date: 2015-10-01T00:17:33Z
cleaned up; removed comments and a unused method
commit 7ed9926d8207b5f59b4ceb968d7ebd732029f5c3
Author: danielblazevski <[email protected]>
Date: 2015-10-01T20:28:31Z
fixed bug in bFiltVect, and renamed to trainingFiltered
commit 4b3bb2ec92396bf754d0d207ffc6853406ce7c39
Author: danielblazevski <[email protected]>
Date: 2015-10-01T21:05:46Z
fixed bug: if there are fewer than maxPerBox total training points, do not
do heap construction, just make siblingsQueue = root.objects
commit 1662b38822dfdfbbca272d377fa3b94f8246e9e6
Author: danielblazevski <[email protected]>
Date: 2015-10-01T22:03:42Z
added metric to constructor, and added a flag to test whether it is
Euclidean or SquaredEuclidean
commit f654b841cc8d91fa861f188469831404c288b227
Author: danielblazevski <[email protected]>
Date: 2015-10-01T22:48:51Z
changed name of test that uses the Quadtree along with KNN -- modified from
CHiwan Park's test to ensure flag to use Quadtree will pass
commit 7928798281dba5554eeb63df7e67400a42e7a381
Author: danielblazevski <[email protected]>
Date: 2015-10-01T22:54:25Z
fixed QuadTree test to conform to using a min-heap
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---