CoreEventListener + service to build 64bits semantic hash of documents with 
text content (PDF, office, xhtml, ...)
------------------------------------------------------------------------------------------------------------------

                 Key: NXSEM-8
                 URL: http://jira.nuxeo.org/browse/NXSEM-8
             Project: Nuxeo Semantic R&D
          Issue Type: Task
            Reporter: Olivier Grisel
            Assignee: Olivier Grisel


Using stacked denoising autoencoders (SDA) [1], spectral hashing (SH) [2] or 
locality sensitive hashing (LSH) [3][4] or binary reconstructive encodings 
(BRE) [5] build a service that is able to extract 64bits coliding hashes of 
document such that low Hamming distances in the hash space mean highly related 
content in the implicit human semantic space.

The lshkit [6] project provides SH and LSH implementation. The libsgd project 
[7] should also soon provide SDA implementation albeit with a dense 
representation that might not scale to the several tenth of hundred of 
dimensions of the documents TF-IDF input space. Maybe SDA and libsgd should be 
first tested on picture semantic hashing instead.

Before starting the implementation of this service, several algo / 
implementations should be benched on a small tokenized / TF-IDF'ed wikipedia 
subset to get a grasp of the performance requirements (CPU time / Memory usage) 
of each options.

The end user goal of having semantic hashing is to complement the fulltext 
indexes with another  very scalable implementation of content based search 
(using keywords queries) or by browsing the content of the nuxeo document 
repository based on the document similiratiies instead of workspace 
localization. Such as browsing user interface coold be build upon the JS 
InfoViz Toolkit lib [8].

[1] http://www.cs.toronto.edu/.../aistats_2009_robust_interdependent.pdf
[2] http://people.csail.mit.edu/torralba/.../spectralhashing.pdf
[3] http://www.mit.edu/~andoni/LSH/
[4] http://www.cs.utexas.edu/~grauman/papers/iccv2009_klsh.pdf
[5] http://www.eecs.berkeley.edu/~kulis/pubs/hashing_bre_tr.pdf
[6] http://lshkit.sourceforge.net/
[7] http://bitbucket.org/ogrisel/libsgd/src/ 
[8] http://thejit.org/

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://jira.nuxeo.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
_______________________________________________
ECM-tickets mailing list
[email protected]
http://lists.nuxeo.com/mailman/listinfo/ecm-tickets

Reply via email to