Hello everyone,

Greetings!

I am a contributor to the Galaxy project and currently working on finding
similarities in Galaxy tools (a part of my master's thesis). This work aims
at finding similar tools for any Galaxy tool based on tools description and
their input/output file types.

For example, the similar tools for a tool "bowtie2" could be "bwa",
"bwameth" or "bwa_mem" among others.

To see the results of this project online, please visit the link
<https://rawgit.com/anuprulez/similar_galaxy_tools/master/viz/similarity_viz.html>
(works fine on Firefox and Chrome). You should wait for a few seconds
before you see a list of tools in the select list as the page loads a big
JSON file (~100MB) asynchronously. Once the tools are loaded, please choose
a tool and see the similar ones for your favourite tool(s). The similar
tools are arranged in the descending order of their probability scores (top
20 are shown). The similar tools that you see are a mixture of tools
extracted based on the selected tool's description and file types. It means
that sometimes the tools are similar due to their description/kind of
functions they have and sometimes due to their file types. Also, there are
a few graphics/plots at the end of the page.

Here is the code repository
<https://github.com/anuprulez/similar_galaxy_tools> to read more about this
project.

I have followed the following approach to compute the similar tools:

   1. Text mining to collect and preprocess the tools' keywords (which
   represent a tool) - BM25 <https://en.wikipedia.org/wiki/Okapi_BM25>
   2. Matrix factorization to extract important concepts (and not just
   words) - Latent Semantic Analysis
   <https://en.wikipedia.org/wiki/Latent_semantic_analysis>
   3. Optimization to combine probability distributions - Gradient Descent
   <https://en.wikipedia.org/wiki/Gradient_descent> and Backtracking Line
   Search <https://en.wikipedia.org/wiki/Backtracking_line_search>

Further, I have plans to include:

   - Deep learning approach to compute similarity between paragraphs
   inspired from this work
   <https://cs.stanford.edu/~quocle/paragraph_vector.pdf>
   - Similarity in workflows

If you have any comment/feedback, please write. It will be immensely
helpful.

Thanks a lot!


Regards,
Anup Kumar
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/

Reply via email to