Hi Patrick,

Thanks for the data and the feedback!

I hadn't thought about logging malformed structures, which seems like
something good to build into the data registration process. My mentors
(Greg Landrum, Peter Gedeck, and Marco Stenta) and I also discussed
possible approaches to pre-processing molecules and data registration
today. From what I gathered, it seems like there's a lot of ongoing
discussion over identity search and what constitutes a duplicate
molecule—would you be able to clarify a little bit more what that means
from your end? (ex. do we include different tautomers as duplicates?)

I'll keep the other features you mentioned in mind going forward as
well—while they're not quite optimized yet, we can already support the
queries that you mention, ensure indices, and canonicalize SMILES.

Best,
Chris

On Wed, Jul 8, 2020 at 7:03 PM Patrick Fuller <patrickful...@gmail.com>
wrote:

> Chris,
>
> That sounds like a great idea! Optimized similarity and substructure
> searches are hard to get right, and most libraries leave it as an exercise
> to the reader to choose the right fingerprinting and db structure. I think
> the hardest part will be figuring out a robust end-user experience. You'll
> be writing the "glue" between two domain-specific libraries so you'll need
> extensibility, error handling, and lots of tutorial documentation.
>
> I attached a 1000-line sample of a much larger raw dataset I have lying
> around. I think the script should canonicalize the smiles, remove
> duplicates, skip and error log malformed structures, build fingerprints,
> ensureIndex on the mongodb, and be able to quickly query things like
> carboxylic acid substructure or 80% similarity to terephthalic acid. Hope
> this helps!
>
> Pat
>
> On Wed, Jul 8, 2020 at 4:27 PM Christopher Zou <cw...@berkeley.edu> wrote:
>
>> Dear RDKit Community,
>>
>> Hope you're all well! I'm a student from UC Berkeley building an
>> integration between RDKit and MongoDB as part of Google Summer of Code.
>>
>> The idea of the project is twofold:
>>
>>    1. Provide tools for building a chemically-intelligent MongoDB
>>    database.
>>    2. Provide high-performance similarity and substructure search that
>>    leverage MongoDB.
>>
>> If you use or would like to use MongoDB as part of your work, I'd love to
>> get some input from you, either via email or through a short call. What
>> kinds of Mongo setups are all of you using? What kinds of information would
>> you like to store? What are some examples of searches? This would help me
>> build something as usable as possible for all of you.
>>
>> Many thanks—I'm incredibly excited to be contributing to this community.
>>
>> Best,
>> Chris
>>
>>
>>
>> --
>> *Christopher Zou *
>> Computer Science and Biochemistry,
>> UC Berkeley '22
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

-- 
*Christopher Zou *
Computer Science and Biochemistry,
UC Berkeley '22
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to