On May 20, 2021, at 03:17, Francois Berenger <mli...@ligand.eu> wrote:
> Weren't the path-based FPs formally described somewhere?

What does "formally" mean?

Daylight was rarely participated in the academic literature tradition.

They instead preferred to publish their information directly, as Pat mentions:

On May 21, 2021, at 02:26, Patrick Walters <wpwalt...@gmail.com> wrote:
> There's also some information on path fingerprints in the Daylight Theory 
> Manual 
> https://www.daylight.com/dayhtml/doc/theory/theory.finger.html

If you're looking for a citation, you can try something like:

     Daylight Theory Manual 4.9, Daylight Chemical Information Systems, Inc.,
     Laguna Niguel, CA.
     https://www.daylight.com/dayhtml/doc/theory/theory.finger.html


You can see this page is the source for the general understanding. Compare 
Daylight's:

    The fingerprinting algorithm examines the molecule and generates
    the following:

        • a pattern for each atom
        • a pattern representing each atom and its nearest neighbors (plus the 
bonds that join them)
        • a pattern representing each group of atoms and bonds connected by 
paths up to 2 bonds long
        • ... atoms and bonds connected by paths up to 3 bonds long
        • ... continuing, with paths up to 4, 5, 6, and 7 bonds long.

 ...
    the output of which is a set of bits (typically 4 or 5 bits
    per pattern); the set of bits thus produced is added (with a
    logical OR) to the fingerprint.


with, for example, the textbook "An Introduction to Chemoinformatics" by Leach 
and Gillet:

   It is produced by generating all possible linear paths of connected
   atoms through the molecule containing between one and a defined number
   of atoms (typically seven).

    ...

   Each of these paths in turn serves as the input to a second program
   that uses a hashing procedure to set a small number of bits (typically four
   or five) to “1” in the fingerprint bitstring.

The "typically" in the latter is because that's Daylight's algorithm's default.


What I've learned, in researching this detail, is that Daylight's "typically 4 
or 5 bits per pattern" used a size 'n' which was a function of the path length. 
That detail was known to a few Daylight users that I've talked to, and I've 
been told it was guided by information theoretical reasoning.

In RDKit and Open Babel, 'n' is fixed for the given fingerprint type.

In Indigo, which includes non-linear subgraphs, 'n' is a function of both the 
size and the internal symmetry. The Indigo authors said that was more effective 
in their experiments. (I would have to dig up old emails to confirm that 
memory.)

I know of no papers which explore this detail. I always figured it would be a 
good Master's paper for someone interested in old-school cheminformatics.

Cheers,


                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to