Re: [Cdk-user] Raw fingerprints impossible to calculate

2020-02-25 Thread John Mayfield
Okay,

I'm going to presume you want to search the data.. to retrieve similar
compounds or substructures. If not then just store the hexadecimal
fingerprint.

It's not impossible to do searching in MongoDB, see a talk from Matt Swain
,
... and my follow ups:
http://efficientbits.blogspot.com/2014/11/memory-mapped-fingerprint-index-part-i.html
,
http://efficientbits.blogspot.com/2014/12/memory-mapped-fingerprint-index-part-ii.html
.

However my view is (as I make clear in those blog posts) MongoDB is the
wrong technology for this, but you could convert your the binary
fingerprint to a vector. In fact to *toString* works well:

System.out.println(new
> Fingerprinter().getBitFingerprint(mol).asBitSet().toString());


{43, 46, 51, 60, 65, 70, 72, 86, 95, 99, 111, 114, 123, 128, 144, 157, 158,
161, 166, 174, 185, 188, 204, 213, 222, 253, 271, 275, 278, 311, 315, 320,
335, 364, 371, 379, 390, 409, 446, 449, 463, 486, 498, 520, 523, 535, 540,
565, 574, 586, 588, 611, 628, 632, 637, 647, 649, 655, 667, 725, 742, 756,
770, 793, 845, 859, 865, 918, 951, 954, 959, 1015}

You could then use and/or queries to find fingerprint subsets or computer
Tanimotos etc.

John

On Mon, 24 Feb 2020 at 13:44, Maria Sorokina 
wrote:

> I see the problem.
>
> Well, originally, I wanted to checkout how the raw fingerprints look like.
> I am storing all the data (and the fingerprints) in MongoDB, and I am still
> not sure if in case I save the BitFingerprints directly in there (with is
> possible when the field has an Object type), if they will be parseable by
> the mongo engine as fingerprints (without retrieving them to be read with
> CDK). So this is why I wanted to check the raw fingerprints, as they should
> be more JSON-friendly format, and mongo engine would be able to read those
> integers and strings for further similarity search.
>
> Kind regards,
> Maria
>
>
> Dr. Maria Sorokina
> Steinbeck Research Group
> Analytical Chemistry - Cheminformatics and Chemometrics
> Friedrich-Schiller-University Jena, Germany
> http://cheminf.uni-jena.de
>
> Le 21 févr. 2020 à 19:31, John Mayfield  a
> écrit :
>
> Okay looking at it the Substructure fingerprint would be easy to adapt...
> but it's not hard to just count the substructures. Utility code like that
> is difficult to justify, every line is more to maintain.
>
> The other problem is I don't like the fingerprint APIs so it's a toss-up
> between using effort to implement something I (or hopefully someone else)
> will ultimately rewrite in future. "Deprecated on arrival" I believe Egon
> has said before.
>
> On Fri, 21 Feb 2020 at 18:25, John Mayfield 
> wrote:
>
>> What do you think the "raw" fingerprint is? Why would you expect it for
>> the Substructure one?
>>
>> On Fri, 21 Feb 2020 at 09:47, Maria Sorokina 
>> wrote:
>>
>>> I tried in total 7 fingerprinters (PubChem, Substructure, MACCS,
>>> KlekotaRoth, Circular, ShortestPath and Hybrifization) and none worked. For
>>> some, I’m not surprised, but I was really expecting to have the raw
>>> fingerprints for the Substructure one
>>>
>>>
>>> Dr. Maria Sorokina
>>> Steinbeck Research Group
>>> Analytical Chemistry - Cheminformatics and Chemometrics
>>> Friedrich-Schiller-University Jena, Germany
>>> http://cheminf.uni-jena.de
>>>
>>> Le 21 févr. 2020 à 10:39, John Mayfield  a
>>> écrit :
>>>
>>> ... I do have some patches for an updated fingerprint API stack that
>>> would also add this in to more places. Essentially it was added to the
>>> public API but only implemented in a few places and left as a "ToDo"
>>> elsewhere. Might be something for the hack-a-thon.
>>>
>>> I should PubChem fingerprints are binary in nature though so you would
>>> probably never want the RAW version. *getBitFingerprint()* it
>>> implemented always.
>>>
>>> John
>>>
>>> On Fri, 21 Feb 2020 at 09:34, John Mayfield 
>>> wrote:
>>>
 Hi Maria,

 Not all fingerprint support the "RAW" option and Count options.

 John

 On Fri, 21 Feb 2020 at 09:31, Maria Sorokina 
 wrote:

> Dear community,
>
> It is decidedly substructure search and fingerprinting period of the
> year!
>
> I want to create (to store) raw fingerprints of a range of different
> fingerprint types for a big number of complex molecules (natural 
> products).
>
> For example this:
>
> PubchemFingerprinter pubchemFingerprinter = new PubchemFingerprinter( 
> SilentChemObjectBuilder.getInstance() );
>
> System.out.println(pubchemFingerprinter.getRawFingerprint(myAtomContainer));
>
> For all my molecules I am getting an" UnsupportedOperationException",
> which according to the documentation reflects only the fact that the 
> fingerprinter
> cannot produce the raw fingerprint.
> I am using the latest (2.3) version of the CDK.
> Can anybody help me with this issue?
>
>
> Kind regards,
> 

Re: [Cdk-user] Substructure search using ShortestPathFingerprinter

2020-02-25 Thread John Mayfield
Yes good idea, I added a comment at the bottom but it does explicitly say
that at the top.

On Tue, 25 Feb 2020 at 08:43, nicepeopleproject 
wrote:

> Thank you!
> The documentation for the ShortestPathFingerprinter class says "Fingerprints
> allow for a fast screening step to exclude candidates for a substructure
> search in a database. They are also a means for determining the similarity
> of chemical structures.". Perhaps it’s worth removing so that there are
> no contradictions.
>
> чт, 20 февр. 2020 г. в 18:28, John Mayfield :
>
>> I've added a warning in the doc, there was already a warning on MACCS 166
>> keys.
>>
>> https://github.com/cdk/cdk/commit/82cb4f8d49283e117696f40d09538c70790a18fd
>>
>> On Thu, 20 Feb 2020 at 15:20, John Mayfield 
>> wrote:
>>
>>> *wrote :-)
>>>
>>> On Thu, 20 Feb 2020 at 15:20, John Mayfield 
>>> wrote:
>>>
 Only *Fingerprinter* or *ExtendedFingerprint* obey this transitivity
 property.

 Relevant post I wrong in 2015:
 https://nextmovesoftware.com/blog/2015/02/16/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration/

 On Thu, 20 Feb 2020 at 10:44, nicepeopleproject <
 nicepeopleproj...@gmail.com> wrote:

> Hello!
> I'm trying to realize substructure search. As I understand, the
> ShortestPathFingerprinter is suitable for this. I ran into the following
> problem. I attach two file(in molecules.zip). when using butane.mol as
> query, should find ciclopentane.mol. When i found BitSet for butane i got:
> {115, 503, 540, 653, 893}
> {115, 503, 542, 653, 893} - for ciclopentane.
> So i cannot find ciclopentane. Is there a way to make it work?
>
> --
> С уважением,
> Николаев Артём
> ___
> Cdk-user mailing list
> Cdk-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/cdk-user
>

>
> --
> С уважением,
> Николаев Артём
>
___
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user


Re: [Cdk-user] Substructure search using ShortestPathFingerprinter

2020-02-25 Thread nicepeopleproject
Thank you!
The documentation for the ShortestPathFingerprinter class says "Fingerprints
allow for a fast screening step to exclude candidates for a substructure
search in a database. They are also a means for determining the similarity
of chemical structures.". Perhaps it’s worth removing so that there are no
contradictions.

чт, 20 февр. 2020 г. в 18:28, John Mayfield :

> I've added a warning in the doc, there was already a warning on MACCS 166
> keys.
>
> https://github.com/cdk/cdk/commit/82cb4f8d49283e117696f40d09538c70790a18fd
>
> On Thu, 20 Feb 2020 at 15:20, John Mayfield 
> wrote:
>
>> *wrote :-)
>>
>> On Thu, 20 Feb 2020 at 15:20, John Mayfield 
>> wrote:
>>
>>> Only *Fingerprinter* or *ExtendedFingerprint* obey this transitivity
>>> property.
>>>
>>> Relevant post I wrong in 2015:
>>> https://nextmovesoftware.com/blog/2015/02/16/for-every-fingerprint-optimisation-there-is-an-equal-and-opposite-fingerprint-deterioration/
>>>
>>> On Thu, 20 Feb 2020 at 10:44, nicepeopleproject <
>>> nicepeopleproj...@gmail.com> wrote:
>>>
 Hello!
 I'm trying to realize substructure search. As I understand, the
 ShortestPathFingerprinter is suitable for this. I ran into the following
 problem. I attach two file(in molecules.zip). when using butane.mol as
 query, should find ciclopentane.mol. When i found BitSet for butane i got:
 {115, 503, 540, 653, 893}
 {115, 503, 542, 653, 893} - for ciclopentane.
 So i cannot find ciclopentane. Is there a way to make it work?

 --
 С уважением,
 Николаев Артём
 ___
 Cdk-user mailing list
 Cdk-user@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/cdk-user

>>>

-- 
С уважением,
Николаев Артём
___
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user