CoreEventListener + service to build 64bits semantic hash of documents with
text content (PDF, office, xhtml, ...)
------------------------------------------------------------------------------------------------------------------
Key: NXSEM-8
URL: http://jira.nuxeo.org/browse/NXSEM-8
Project: Nuxeo Semantic R&D
Issue Type: Task
Reporter: Olivier Grisel
Assignee: Olivier Grisel
Using stacked denoising autoencoders (SDA) [1], spectral hashing (SH) [2] or
locality sensitive hashing (LSH) [3][4] or binary reconstructive encodings
(BRE) [5] build a service that is able to extract 64bits coliding hashes of
document such that low Hamming distances in the hash space mean highly related
content in the implicit human semantic space.
The lshkit [6] project provides SH and LSH implementation. The libsgd project
[7] should also soon provide SDA implementation albeit with a dense
representation that might not scale to the several tenth of hundred of
dimensions of the documents TF-IDF input space. Maybe SDA and libsgd should be
first tested on picture semantic hashing instead.
Before starting the implementation of this service, several algo /
implementations should be benched on a small tokenized / TF-IDF'ed wikipedia
subset to get a grasp of the performance requirements (CPU time / Memory usage)
of each options.
The end user goal of having semantic hashing is to complement the fulltext
indexes with another very scalable implementation of content based search
(using keywords queries) or by browsing the content of the nuxeo document
repository based on the document similiratiies instead of workspace
localization. Such as browsing user interface coold be build upon the JS
InfoViz Toolkit lib [8].
[1] http://www.cs.toronto.edu/.../aistats_2009_robust_interdependent.pdf
[2] http://people.csail.mit.edu/torralba/.../spectralhashing.pdf
[3] http://www.mit.edu/~andoni/LSH/
[4] http://www.cs.utexas.edu/~grauman/papers/iccv2009_klsh.pdf
[5] http://www.eecs.berkeley.edu/~kulis/pubs/hashing_bre_tr.pdf
[6] http://lshkit.sourceforge.net/
[7] http://bitbucket.org/ogrisel/libsgd/src/
[8] http://thejit.org/
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets